-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GpuCast from ArrayType to StringType [databricks] #4221
Conversation
Signed-off-by: Remzi Yang <13716567376@163.com>
…pids into branch-22.02
Signed-off-by: Remzi Yang <13716567376@163.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
build |
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good except several small issues.
|
||
|
||
@pytest.mark.parametrize('data_gen', all_gens_for_cast_to_string, ids=idfn) | ||
def test_legacy_cast_array_to_string(data_gen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it is better to merge this test with test_cast_array_to_string
, since castComplexTypesToString
config can be a pytest param of this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely agree with you.
However, the current style is more consistent with tests for casting struct to string.
Just as Bobby said, we should have a follow-on issue for moving up those tests to test_cast.py and cleaning up the code.
lambda spark: unary_op_df(spark, data_gen).select( | ||
f.col('a').cast("STRING") | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline at end of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
withResource(ColumnVector.fromScalar(space, child.getRowCount.toInt)) { spaceVec => | ||
withResource( | ||
ColumnVector.stringConcatenate(Array(spaceVec, strChild)) | ||
) { AddSpace => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lower the first character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it.
*/ | ||
def removeFirstSpace(strVec: ColumnVector): ColumnVector = { | ||
if (legacyCastToString){ | ||
withResource(ColumnVector.fromScalar(space, strVec.getRowCount.toInt)){spaceVec => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem necessary to construct a spaceVec, since equalTo
works between a column and a scalar as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice, I will fix it
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that the config LEGACY_COMPLEX_TYPES_TO_STRING
only exists in 3.1.0+. However, we also need to support Spark 3.0.X in current.
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
# casting these types to string are not exact match, marked as xfail when testing | ||
not_matched_gens = [float_gen, double_gen, timestamp_gen, decimal_gen_neg_scale] | ||
# casting these types to string are not supported, marked as xfail when testing | ||
not_support_gens = decimal_128_gens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names of these variables are too generic. Can we put _for_cast_to_string
in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I have revised the names
@@ -249,3 +249,63 @@ def test_sql_array_scalars(query): | |||
assert_gpu_and_cpu_are_equal_collect( | |||
lambda spark : spark.sql('SELECT {}'.format(query)), | |||
conf={'spark.sql.legacy.allowNegativeScaleOfDecimal': 'true'}) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a lot of the other tests around casting are in cast_test.py
. I can see arguments to have them here because it is specifically for arrays, but we already have tests there for casting arrays to arrays. As such I would rather see the tests moved there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is much more reasonable to move the tests to cast_test.py
, I agree with you!
However, you may find that tests for casting struct to string are all in struct_test.py
instead of cast_test.py
:
spark-rapids/integration_tests/src/main/python/struct_test.py
Lines 75 to 166 in 877c1b2
# conf with legacy cast to string on | |
legacy_complex_types_to_string = {'spark.sql.legacy.castComplexTypesToString.enabled': 'true'} | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", boolean_gen], ["second", byte_gen], ["third", short_gen], ["fourth", int_gen], ["fifth", long_gen], ["sixth", string_gen], ["seventh", date_gen]])], ids=idfn) | |
def test_legacy_cast_struct_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING")), | |
conf = legacy_complex_types_to_string) | |
# https://github.com/NVIDIA/spark-rapids/issues/2309 | |
@pytest.mark.parametrize('cast_conf', ['LEGACY', 'SPARK311+']) | |
def test_one_nested_null_field_legacy_cast(cast_conf): | |
def was_broken_for_nested_null(spark): | |
data = [ | |
(('foo',),), | |
((None,),), | |
(None,) | |
] | |
df = spark.createDataFrame(data) | |
return df.select(df._1.cast(StringType())) | |
assert_gpu_and_cpu_are_equal_collect(was_broken_for_nested_null, { | |
'spark.sql.legacy.castComplexTypesToString.enabled': cast_conf == 'LEGACY' | |
}) | |
# https://github.com/NVIDIA/spark-rapids/issues/2315 | |
@pytest.mark.parametrize('cast_conf', ['LEGACY', 'SPARK311+']) | |
def test_two_col_struct_legacy_cast(cast_conf): | |
def broken_df(spark): | |
key_data_gen = StructGen([ | |
('a', IntegerGen(min_val=0, max_val=4)), | |
('b', IntegerGen(min_val=5, max_val=9)), | |
], nullable=False) | |
val_data_gen = IntegerGen() | |
df = two_col_df(spark, key_data_gen, val_data_gen) | |
return df.select(df.a.cast(StringType())).filter(df.b > 1) | |
assert_gpu_and_cpu_are_equal_collect(broken_df, { | |
'spark.sql.legacy.castComplexTypesToString.enabled': cast_conf == 'LEGACY' | |
}) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", float_gen]])], ids=idfn) | |
@pytest.mark.xfail(reason='casting float to string is not an exact match') | |
def test_legacy_cast_struct_with_float_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING")), | |
conf = legacy_complex_types_to_string) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", double_gen]])], ids=idfn) | |
@pytest.mark.xfail(reason='casting double to string is not an exact match') | |
def test_legacy_cast_struct_with_double_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING")), | |
conf = legacy_complex_types_to_string) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", timestamp_gen]])], ids=idfn) | |
@pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/219') | |
def test_legacy_cast_struct_with_timestamp_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING")), | |
conf = legacy_complex_types_to_string) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", boolean_gen], ["second", byte_gen], ["third", short_gen], ["fourth", int_gen], ["fifth", long_gen], ["sixth", string_gen], ["seventh", date_gen]])], ids=idfn) | |
def test_cast_struct_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING"))) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", float_gen]])], ids=idfn) | |
@pytest.mark.xfail(reason='casting float to string is not an exact match') | |
def test_cast_struct_with_float_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING"))) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", double_gen]])], ids=idfn) | |
@pytest.mark.xfail(reason='casting double to string is not an exact match') | |
def test_cast_struct_with_double_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING"))) | |
@pytest.mark.parametrize('data_gen', [StructGen([["first", timestamp_gen]])], ids=idfn) | |
@pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/219') | |
def test_cast_struct_with_timestamp_to_string(data_gen): | |
assert_gpu_and_cpu_are_equal_collect( | |
lambda spark : unary_op_df(spark, data_gen).select( | |
f.col('a').cast("STRING"))) |
Should we be consistent with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should, and the struct tests should probably move too at some point. If you want to file a follow on issue you can. But this is a nit so if you just want both to be a follow on issue that is fine too.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala
Outdated
Show resolved
Hide resolved
…ateListElements API Signed-off-by: remzi <13716567376yh@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation is at least documented. I would love it if we could have two follow on issues to find a way to make this cast better.
- I would like to see if we can clean up the code. Even if we have to have a Spark specific kernel for doing some of the concat.
- We should look a the performance impact of this vs a version that has spit paths for legacy vs not legacy. I am a little concerned that we are taking a performance hit when legacy is false because we wanted to reuse as much code as possible.
The only reason I am not approving this is because some of the comments by @sperlingxx have not been addressed yet.
@@ -249,3 +249,63 @@ def test_sql_array_scalars(query): | |||
assert_gpu_and_cpu_are_equal_collect( | |||
lambda spark : spark.sql('SELECT {}'.format(query)), | |||
conf={'spark.sql.legacy.allowNegativeScaleOfDecimal': 'true'}) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should, and the struct tests should probably move too at some point. If you want to file a follow on issue you can. But this is a nit so if you just want both to be a follow on issue that is fine too.
private def castArrayToString(input: ColumnView, | ||
elementType: DataType, | ||
ansiMode: Boolean, | ||
legacyCastToString: Boolean, | ||
stringToDateAnsiModeEnabled: Boolean): ColumnVector = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suspect that our Scala style doesn't support such alignment 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alignment is auto generated by IDEA when I press enter
key. I find no way to change it. But it has passed the Scala style check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a performance report of casting array to string.
|
So far, I test the performance on local mode. I will upload another report of standalone mode. |
Similar results are gotten in standalone mode |
Signed-off-by: HaoYang670 13716567376yh@gmail.com
add a new feature: support GpuCast from ArrayType to StringType
close #4028