Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5829] Optimize conversion from json to row format when sanitizing field names #11941

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

vamsikarnika
Copy link
Contributor

@vamsikarnika vamsikarnika commented Sep 13, 2024

Change Logs

Currently when source data has to read in row format and sanitization is enabled, we first read the data in avro format(which supports sanitization) and later convert from avro to row. This new approach simplifies this process by directly converting from json to row while applying sanitization.

Impact

When source data has to be read in row format, and sanitization is enabled. This change should make the conversion from json to row faster by directly converting from json to row.

This change directly affects the existing streams. This is currently added behind a flag which can be disabled if any issues are to be found.

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Sep 13, 2024
@vamsikarnika vamsikarnika changed the title convert json to row using MercifulJsonToRowConverter [HUDI-5829] Optimize conversion from json to row format when sanitizing field names Sep 16, 2024
Comment on lines -313 to -325
// As we don't do rounding, the validation will enforce the scale part and the integer part are all within the
// limit. As a result, if scale is 2 precision is 5, we only allow 3 digits for the integer.
// Allowed: 123.45, 123, 0.12
// Disallowed: 1234 (4 digit integer while the scale has already reserved 2 digit out of the 5 digit precision)
// 123456, 0.12345
if (bigDecimal.scale() > decimalType.getScale()
|| (bigDecimal.precision() - bigDecimal.scale()) > (decimalType.getPrecision() - decimalType.getScale())) {
// Correspond to case
// org.apache.avro.AvroTypeException: Cannot encode decimal with scale 5 as scale 2 without rounding.
// org.apache.avro.AvroTypeException: Cannot encode decimal with scale 3 as scale 2 without rounding
return Pair.of(false, null);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have moved this code to DecimalFieldProcessor class, so this can be reused by both JsonToAvro and JsonToRow processors.

Comment on lines +41 to +46
static Stream<Object> decimalBadCases() {
return Stream.of(
// Invalid schema definition.
Arguments.of(DECIMAL_AVRO_FILE_INVALID_PATH, "123.45", null, false),
// Schema set precision as 5, input overwhelmed the precision.
Arguments.of(DECIMAL_AVRO_FILE_PATH, "123333.45", null, false),
Copy link
Contributor Author

@vamsikarnika vamsikarnika Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have moved the test data generators from TestMercifulJsonConvertor to this base class, so that both row and avro conversions can use same data for testing.

Comment on lines -76 to +71
@ValueSource(strings = {
"{\"first\":\"John\",\"last\":\"Smith\"}",
"[{\"first\":\"John\",\"last\":\"Smith\"}]",
"{\"first\":\"John\",\"last\":\"Smith\",\"suffix\":3}",
})
@MethodSource("dataNestedJsonAsString")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have converted this MethodSource and moved it Test Base class to be reused by MercifulJsonToRowConverter

Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments


import org.apache.hudi.exception.HoodieJsonConversionException;

public class HoodieJsonToRowConversionException extends HoodieJsonConversionException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

public Either<Row, String> fromJsonToRowWithError(String json) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain method return in comment. I think the name could be more clear. fromJsonToRowForErrortable? fromJsonToRowSafe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment. I've followed similar naming as AvroConverter, since it returns a pair of Row and Error string, I think fromJsonToRowWithError makes sense.

@jonvex jonvex self-assigned this Oct 1, 2024
@hudi-bot
Copy link

hudi-bot commented Oct 1, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants