Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49713][PYTHON][FOLLOWUP] Make function count_min_sketch accept long seed #48223

Closed

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Make function count_min_sketch accept long seed

Why are the changes needed?

existing implementation only accepts int seed, which is inconsistent with other ExpressionWithRandomSeed:

In [3]:     >>> from pyspark.sql import functions as sf
   ...:     >>> spark.range(100).select(
   ...:     ...     sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.6, 1111111111111111111))
   ...:     ... ).show(truncate=False)

...
AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "count_min_sketch(id, 1.5, 0.6, 1111111111111111111)" due to data type mismatch: The 4th parameter requires the "INT" type, however "1111111111111111111" has the type "BIGINT". SQLSTATE: 42K09;
'Aggregate [unresolvedalias('hex(count_min_sketch(id#64L, 1.5, 0.6, 1111111111111111111, 0, 0)))]
+- Range (0, 100, step=1, splits=Some(12))
...

Does this PR introduce any user-facing change?

no

How was this patch tested?

added doctest

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the count_min_sk_long_seed branch September 24, 2024 23:35
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…pt long seed

### What changes were proposed in this pull request?
Make function `count_min_sketch` accept long seed

### Why are the changes needed?
existing implementation only accepts int seed, which is inconsistent with other `ExpressionWithRandomSeed`:

```py
In [3]:     >>> from pyspark.sql import functions as sf
   ...:     >>> spark.range(100).select(
   ...:     ...     sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.6, 1111111111111111111))
   ...:     ... ).show(truncate=False)

...
AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "count_min_sketch(id, 1.5, 0.6, 1111111111111111111)" due to data type mismatch: The 4th parameter requires the "INT" type, however "1111111111111111111" has the type "BIGINT". SQLSTATE: 42K09;
'Aggregate [unresolvedalias('hex(count_min_sketch(id#64L, 1.5, 0.6, 1111111111111111111, 0, 0)))]
+- Range (0, 100, step=1, splits=Some(12))
...

```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
added doctest

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48223 from zhengruifeng/count_min_sk_long_seed.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…pt long seed

### What changes were proposed in this pull request?
Make function `count_min_sketch` accept long seed

### Why are the changes needed?
existing implementation only accepts int seed, which is inconsistent with other `ExpressionWithRandomSeed`:

```py
In [3]:     >>> from pyspark.sql import functions as sf
   ...:     >>> spark.range(100).select(
   ...:     ...     sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.6, 1111111111111111111))
   ...:     ... ).show(truncate=False)

...
AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "count_min_sketch(id, 1.5, 0.6, 1111111111111111111)" due to data type mismatch: The 4th parameter requires the "INT" type, however "1111111111111111111" has the type "BIGINT". SQLSTATE: 42K09;
'Aggregate [unresolvedalias('hex(count_min_sketch(id#64L, 1.5, 0.6, 1111111111111111111, 0, 0)))]
+- Range (0, 100, step=1, splits=Some(12))
...

```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
added doctest

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48223 from zhengruifeng/count_min_sk_long_seed.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants