Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement duplicate key handling for GpuCreateMap #4007

Merged
merged 9 commits into from
Dec 1, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Nov 2, 2021

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #3250

Depends on rapidsai/cudf#9553

scala> val df = Seq(1,2,3).toDF("c0")
df: org.apache.spark.sql.DataFrame = [c0: int]

scala> df.repartition(2).createOrReplaceTempView("foo")

scala> spark.sql("SELECT map(c0, 1, 1 + c0 - 1, 2) FROM foo").collect

21/11/02 20:46:05 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 3)
java.lang.RuntimeException: Duplicate map key was found, please check the input data. If you want to remove the duplicated keys, you can set spark.sql.mapKeyDedupPolicy to LAST_WIN so that the key inserted at last takes precedence.
	at com.nvidia.spark.rapids.GpuMapUtils$.duplicateMapKeyFoundError(GpuMapUtils.scala:134)
	at org.apache.spark.sql.rapids.GpuCreateMap.$anonfun$columnarEval$10(complexTypeCreator.scala:118)
	
scala> spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN")

scala> spark.sql("SELECT map(c0, 1, 1 + c0 - 1, 2) FROM foo").collect

res3: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 2)], [Map(2 -> 2)], [Map(3 -> 2)])

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review November 12, 2021 19:38
@andygrove andygrove changed the title WIP: Implement duplicate key handling for GpuCreateMap Implement duplicate key handling for GpuCreateMap Nov 12, 2021
@andygrove andygrove added this to the Nov 1 - Nov 12 milestone Nov 12, 2021
@andygrove andygrove self-assigned this Nov 12, 2021
@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 16, 2021
@andygrove andygrove changed the base branch from branch-21.12 to branch-22.02 November 16, 2021 23:37
jlowe
jlowe previously approved these changes Nov 30, 2021
@jlowe
Copy link
Member

jlowe commented Nov 30, 2021

build

@andygrove
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes Dec 1, 2021
integration_tests/src/main/python/map_test.py Show resolved Hide resolved
@andygrove andygrove dismissed stale reviews from revans2 and jlowe via f1f2a94 December 1, 2021 18:36
@revans2
Copy link
Collaborator

revans2 commented Dec 1, 2021

build

@andygrove andygrove merged commit d95b043 into NVIDIA:branch-22.02 Dec 1, 2021
@andygrove andygrove deleted the create-map-duplicate-keys branch December 1, 2021 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Make CreateMap duplicate key handling compatible with Spark and enable CreateMap by default
5 participants