Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GpuCast from ArrayType to StringType [databricks] #4221

Merged
merged 15 commits into from
Dec 2, 2021

Conversation

HaoYang670
Copy link
Collaborator

@HaoYang670 HaoYang670 commented Nov 26, 2021

Signed-off-by: HaoYang670 13716567376yh@gmail.com

add a new feature: support GpuCast from ArrayType to StringType

close #4028

HaoYang670 and others added 11 commits November 19, 2021 11:42
Signed-off-by: Remzi Yang <13716567376@163.com>
Signed-off-by: Remzi Yang <13716567376@163.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@HaoYang670 HaoYang670 added the feature request New feature or request label Nov 26, 2021
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@pxLi pxLi changed the title Support "GpuCast from non-nested ArrayType to StringType" Support "GpuCast from non-nested ArrayType to StringType" [databricks] Nov 26, 2021
@pxLi
Copy link
Collaborator

pxLi commented Nov 26, 2021

build

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

Copy link
Collaborator

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good except several small issues.



@pytest.mark.parametrize('data_gen', all_gens_for_cast_to_string, ids=idfn)
def test_legacy_cast_array_to_string(data_gen):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it is better to merge this test with test_cast_array_to_string, since castComplexTypesToString config can be a pytest param of this test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agree with you.
However, the current style is more consistent with tests for casting struct to string.
Just as Bobby said, we should have a follow-on issue for moving up those tests to test_cast.py and cleaning up the code.

lambda spark: unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at end of file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

withResource(ColumnVector.fromScalar(space, child.getRowCount.toInt)) { spaceVec =>
withResource(
ColumnVector.stringConcatenate(Array(spaceVec, strChild))
) { AddSpace =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lower the first character.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it.

*/
def removeFirstSpace(strVec: ColumnVector): ColumnVector = {
if (legacyCastToString){
withResource(ColumnVector.fromScalar(space, strVec.getRowCount.toInt)){spaceVec =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem necessary to construct a spaceVec, since equalToworks between a column and a scalar as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice, I will fix it

Copy link
Collaborator

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that the config LEGACY_COMPLEX_TYPES_TO_STRING only exists in 3.1.0+. However, we also need to support Spark 3.0.X in current.

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

@revans2 revans2 changed the title Support "GpuCast from non-nested ArrayType to StringType" [databricks] GpuCast from ArrayType to StringType [databricks] Nov 29, 2021
# casting these types to string are not exact match, marked as xfail when testing
not_matched_gens = [float_gen, double_gen, timestamp_gen, decimal_gen_neg_scale]
# casting these types to string are not supported, marked as xfail when testing
not_support_gens = decimal_128_gens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names of these variables are too generic. Can we put _for_cast_to_string in them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I have revised the names

@@ -249,3 +249,63 @@ def test_sql_array_scalars(query):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : spark.sql('SELECT {}'.format(query)),
conf={'spark.sql.legacy.allowNegativeScaleOfDecimal': 'true'})


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a lot of the other tests around casting are in cast_test.py. I can see arguments to have them here because it is specifically for arrays, but we already have tests there for casting arrays to arrays. As such I would rather see the tests moved there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is much more reasonable to move the tests to cast_test.py, I agree with you!

However, you may find that tests for casting struct to string are all in struct_test.py instead of cast_test.py:

# conf with legacy cast to string on
legacy_complex_types_to_string = {'spark.sql.legacy.castComplexTypesToString.enabled': 'true'}
@pytest.mark.parametrize('data_gen', [StructGen([["first", boolean_gen], ["second", byte_gen], ["third", short_gen], ["fourth", int_gen], ["fifth", long_gen], ["sixth", string_gen], ["seventh", date_gen]])], ids=idfn)
def test_legacy_cast_struct_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")),
conf = legacy_complex_types_to_string)
# https://github.com/NVIDIA/spark-rapids/issues/2309
@pytest.mark.parametrize('cast_conf', ['LEGACY', 'SPARK311+'])
def test_one_nested_null_field_legacy_cast(cast_conf):
def was_broken_for_nested_null(spark):
data = [
(('foo',),),
((None,),),
(None,)
]
df = spark.createDataFrame(data)
return df.select(df._1.cast(StringType()))
assert_gpu_and_cpu_are_equal_collect(was_broken_for_nested_null, {
'spark.sql.legacy.castComplexTypesToString.enabled': cast_conf == 'LEGACY'
})
# https://github.com/NVIDIA/spark-rapids/issues/2315
@pytest.mark.parametrize('cast_conf', ['LEGACY', 'SPARK311+'])
def test_two_col_struct_legacy_cast(cast_conf):
def broken_df(spark):
key_data_gen = StructGen([
('a', IntegerGen(min_val=0, max_val=4)),
('b', IntegerGen(min_val=5, max_val=9)),
], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)
return df.select(df.a.cast(StringType())).filter(df.b > 1)
assert_gpu_and_cpu_are_equal_collect(broken_df, {
'spark.sql.legacy.castComplexTypesToString.enabled': cast_conf == 'LEGACY'
})
@pytest.mark.parametrize('data_gen', [StructGen([["first", float_gen]])], ids=idfn)
@pytest.mark.xfail(reason='casting float to string is not an exact match')
def test_legacy_cast_struct_with_float_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")),
conf = legacy_complex_types_to_string)
@pytest.mark.parametrize('data_gen', [StructGen([["first", double_gen]])], ids=idfn)
@pytest.mark.xfail(reason='casting double to string is not an exact match')
def test_legacy_cast_struct_with_double_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")),
conf = legacy_complex_types_to_string)
@pytest.mark.parametrize('data_gen', [StructGen([["first", timestamp_gen]])], ids=idfn)
@pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/219')
def test_legacy_cast_struct_with_timestamp_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")),
conf = legacy_complex_types_to_string)
@pytest.mark.parametrize('data_gen', [StructGen([["first", boolean_gen], ["second", byte_gen], ["third", short_gen], ["fourth", int_gen], ["fifth", long_gen], ["sixth", string_gen], ["seventh", date_gen]])], ids=idfn)
def test_cast_struct_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")))
@pytest.mark.parametrize('data_gen', [StructGen([["first", float_gen]])], ids=idfn)
@pytest.mark.xfail(reason='casting float to string is not an exact match')
def test_cast_struct_with_float_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")))
@pytest.mark.parametrize('data_gen', [StructGen([["first", double_gen]])], ids=idfn)
@pytest.mark.xfail(reason='casting double to string is not an exact match')
def test_cast_struct_with_double_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")))
@pytest.mark.parametrize('data_gen', [StructGen([["first", timestamp_gen]])], ids=idfn)
@pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/219')
def test_cast_struct_with_timestamp_to_string(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, data_gen).select(
f.col('a').cast("STRING")))

Should we be consistent with it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should, and the struct tests should probably move too at some point. If you want to file a follow on issue you can. But this is a nit so if you just want both to be a follow on issue that is fine too.

…ateListElements API

Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation is at least documented. I would love it if we could have two follow on issues to find a way to make this cast better.

  1. I would like to see if we can clean up the code. Even if we have to have a Spark specific kernel for doing some of the concat.
  2. We should look a the performance impact of this vs a version that has spit paths for legacy vs not legacy. I am a little concerned that we are taking a performance hit when legacy is false because we wanted to reuse as much code as possible.

The only reason I am not approving this is because some of the comments by @sperlingxx have not been addressed yet.

@@ -249,3 +249,63 @@ def test_sql_array_scalars(query):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : spark.sql('SELECT {}'.format(query)),
conf={'spark.sql.legacy.allowNegativeScaleOfDecimal': 'true'})


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should, and the struct tests should probably move too at some point. If you want to file a follow on issue you can. But this is a nit so if you just want both to be a follow on issue that is fine too.

Comment on lines +669 to +673
private def castArrayToString(input: ColumnView,
elementType: DataType,
ansiMode: Boolean,
legacyCastToString: Boolean,
stringToDateAnsiModeEnabled: Boolean): ColumnVector = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suspect that our Scala style doesn't support such alignment 😃

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alignment is auto generated by IDEA when I press enter key. I find no way to change it. But it has passed the Scala style check.

Copy link
Collaborator

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaoYang670 HaoYang670 requested a review from revans2 December 1, 2021 02:39
@HaoYang670
Copy link
Collaborator Author

HaoYang670 commented Dec 7, 2021

This is a performance report of casting array to string.

  1. Aims.
  • Compare the performance of CPU-version and GPU-version (end-to-end)
  • Compare the performance of spark-rapids when legacy enabled and legacy disabled (GpuProject transformation)
  1. Hardware / Software environment:
  • CPU: Intel Core i7-10700K, 16 cores
  • Memory: 64 GB
  • Storage: 813 GB
  • GPU: Quadro RTX 8000, memory 48578 MB, driver 495.29.05, CUDA 11.5
  • Spark: spark-3.2.0-bin-hadoop-3.2
  • Spark-rapids: 22.02
  • Cudf: 22.02
  1. test method:
  • spark mode: local
  • We use DataFrames with 1 column. Each row contains one array of bytes. The number of rows is from 1e8 to 1e9. Array length is from 1 to 50. All the DataFrames are stored as parquet files.
  • All spark configurations are defaults except spark.sql.legacy.castComplexTypesToString.enable.
  • In each test, we first load the dataframe, and then cast the column to string, and then call the aggregate max, and finally call collect. In Python, we can write the code as:
  • spark.read.parquet(data_path).withColumn("b", f.col("value").cast("STRING")).agg({"b": "max"}).collect().
  • For each DataFrame, we firstly run the test code 5 times for warm up. Then we run the test code 5 times and calculate the average consumed time.
  • When comparing CPU-version and GPU-version, we test the end-to-end performance (from read parquet to collect()). When compare GPU -version with enabled / disabled legacy, we test the performance of GpuProject transformation which only contains cast array to string.
  1. test result:
    use ETE as a shorthand of end-to-end, GP as a shorthand of GpuProject transformation.
    Measure time in seconds.
Rows Array length CPU ETE CPU-legacy ETE GPU ETE GPU-legacy ETE GPU GP (total) GPU-legacy GP (total)
1e8 1 6.14 secs 6.13 secs. 0.24 secs 0.28 secs 0.050 secs 0.098 secs
1e8 5 25.03 secs 23.12 secs 0.48 secs 0.62 secs 0.096 secs 0.253 secs
1e8 10 47.11 secs 47.16 secs 0.87 secs 1.20 secs 0.218 secs 0.563secs
1e8 50 228.55 secs 226.89 secs 20.67 secs 41.19 secs 16.00 secs 36.50 secs
1e9 1 60.23 secs 63.85 secs 1.08 secs 1.44 secs 0.388 secs 0.815 secs
  1. conclusion:
  • For end-to-end performance, GPU has about 5 ~ 60 times acceleration. The acceleration increases as the number of rows increases, and the acceleration decreases as array length increases.
  • When legacy is enabled, spark-rapids will take twice as long as disabling legacy. The size of dataframe does not have an obvious impact on this slow-down.
  1. discussion.
    So far, cast array to string takes two different paths for legacy and no-legacy, which impacts the performance. We should clean up the code in this method to reduce the difference.

@HaoYang670
Copy link
Collaborator Author

So far, I test the performance on local mode. I will upload another report of standalone mode.

@HaoYang670
Copy link
Collaborator Author

Similar results are gotten in standalone mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] support GpuCast from non-nested ArrayType to StringType
5 participants