-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5829] Optimize conversion from json to row format when sanitizing field names #11941
base: master
Are you sure you want to change the base?
Conversation
c86f236
to
1538426
Compare
e941544
to
a82fa20
Compare
// As we don't do rounding, the validation will enforce the scale part and the integer part are all within the | ||
// limit. As a result, if scale is 2 precision is 5, we only allow 3 digits for the integer. | ||
// Allowed: 123.45, 123, 0.12 | ||
// Disallowed: 1234 (4 digit integer while the scale has already reserved 2 digit out of the 5 digit precision) | ||
// 123456, 0.12345 | ||
if (bigDecimal.scale() > decimalType.getScale() | ||
|| (bigDecimal.precision() - bigDecimal.scale()) > (decimalType.getPrecision() - decimalType.getScale())) { | ||
// Correspond to case | ||
// org.apache.avro.AvroTypeException: Cannot encode decimal with scale 5 as scale 2 without rounding. | ||
// org.apache.avro.AvroTypeException: Cannot encode decimal with scale 3 as scale 2 without rounding | ||
return Pair.of(false, null); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have moved this code to DecimalFieldProcessor class, so this can be reused by both JsonToAvro and JsonToRow processors.
static Stream<Object> decimalBadCases() { | ||
return Stream.of( | ||
// Invalid schema definition. | ||
Arguments.of(DECIMAL_AVRO_FILE_INVALID_PATH, "123.45", null, false), | ||
// Schema set precision as 5, input overwhelmed the precision. | ||
Arguments.of(DECIMAL_AVRO_FILE_PATH, "123333.45", null, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have moved the test data generators from TestMercifulJsonConvertor to this base class, so that both row and avro conversions can use same data for testing.
@ValueSource(strings = { | ||
"{\"first\":\"John\",\"last\":\"Smith\"}", | ||
"[{\"first\":\"John\",\"last\":\"Smith\"}]", | ||
"{\"first\":\"John\",\"last\":\"Smith\",\"suffix\":3}", | ||
}) | ||
@MethodSource("dataNestedJsonAsString") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have converted this MethodSource and moved it Test Base class to be reused by MercifulJsonToRowConverter
2c511f1
to
857f45b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/RowConverter.java
Outdated
Show resolved
Hide resolved
|
||
import org.apache.hudi.exception.HoodieJsonConversionException; | ||
|
||
public class HoodieJsonToRowConversionException extends HoodieJsonConversionException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/HUDI-8279 as follow-up
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/RowConverter.java
Show resolved
Hide resolved
...ties/src/main/java/org/apache/hudi/utilities/sources/helpers/MercifulJsonToRowConverter.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/RowConverter.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
public Either<Row, String> fromJsonToRowWithError(String json) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain method return in comment. I think the name could be more clear. fromJsonToRowForErrortable
? fromJsonToRowSafe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment. I've followed similar naming as AvroConverter, since it returns a pair of Row and Error string, I think fromJsonToRowWithError makes sense.
98f6dde
to
ef80959
Compare
Change Logs
Currently when source data has to read in row format and sanitization is enabled, we first read the data in avro format(which supports sanitization) and later convert from avro to row. This new approach simplifies this process by directly converting from json to row while applying sanitization.
Impact
When source data has to be read in row format, and sanitization is enabled. This change should make the conversion from json to row faster by directly converting from json to row.
This change directly affects the existing streams. This is currently added behind a flag which can be disabled if any issues are to be found.
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist