-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49611][SQL] Introduce TVF collations()
& remove the SHOW COLLATIONS
command
#48087
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @cloud-fan could you review/merge
@@ -1158,6 +1158,7 @@ object TableFunctionRegistry { | |||
generator[PosExplode]("posexplode"), | |||
generator[PosExplode]("posexplode_outer", outer = true), | |||
generator[Stack]("stack"), | |||
generator[AllCollations]("all_collations"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given we have sql_keywords
, shall we call it string_collations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To me this sounds weird. We suggest that there might exist collations for some other type, for keywords it is more likely to have other keywords (python, scala ...) @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, not only string type? then all_collations
is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, Let's restore to all_collations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_collation is fine although I'm not sure why all is necessary. But no strong feelings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all_collation
, not all_collations
, right?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala
Show resolved
Hide resolved
can we also remove the SHOW COLLATIONS command in this PR? #47364 (comment) I'd like to avoid confusion as Spark has a very special LIKE semantic in SHOW commands (different from Spark's own LIKE operator for string matching) |
Okay, Let me handle it together. |
…essions/generators.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
SHOW COLLATIONS
command
SHOW COLLATIONS
commandstring_collations()
and remove the SHOW COLLATIONS
command
Row("SYSTEM", "BUILTIN", "UTF8_BINARY", null, null, | ||
"ACCENT_SENSITIVE", "CASE_SENSITIVE", "NO_PAD", null)) | ||
|
||
checkAnswer(sql("SHOW COLLATIONS '*zh_Hant_HKG*'"), | ||
checkAnswer(sql("SELECT * FROM string_collations() WHERE COLLATION_NAME LIKE '%zh_Hant_HKG%'"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please do some test where we have for example WHERE country collate UTF8_LCASE like '%china%'
so that we know we can actually use search on other fields, which is one of the main reasons we added this functionality over SHOW COLLATIONS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from small suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
string_collations()
and remove the SHOW COLLATIONS
commandall_collations()
and remove the SHOW COLLATIONS
command
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD NULL | ||
SYSTEM BUILTIN UTF8_LCASE NULL NULL ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this output deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be, listCollationMeta enforces UTF8_* collations first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would replace this LIMIT
by WHERE
which always returns one row, to don't depend on order and how LIMIT behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -18,6 +18,7 @@ | |||
package org.apache.spark.sql.catalyst.expressions | |||
|
|||
import scala.collection.mutable | |||
import scala.jdk.CollectionConverters.CollectionHasAsScala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
UTF8String.fromString(m.catalog), | ||
UTF8String.fromString(m.schema), | ||
UTF8String.fromString(m.collationName), | ||
if (m.language != null) UTF8String.fromString(m.language) else null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromString()
does null-checking, not need to check this additionally, see:
public static UTF8String fromString(String str) {
return str == null ? null : fromBytes(str.getBytes(StandardCharsets.UTF_8));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
override def elementSchema: StructType = new StructType() | ||
.add("COLLATION_CATALOG", StringType, nullable = false) | ||
.add("COLLATION_SCHEMA", StringType, nullable = false) | ||
.add("COLLATION_NAME", StringType, nullable = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need the COLLATION
prefix? Can it just be CATALOG
, SCHEMA
and NAME
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I noticed an issue. Do we need to register it as a SQL built-in function
? I noticed that some functions in FunctionRegistry
are both TVF
and SQL built-in function
, eg:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Lines 370 to 371 in 9fc58aa
expression[Inline]("inline"), | |
expressionGeneratorOuter[Inline]("inline_outer"), |
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Lines 1155 to 1156 in 9fc58aa
generator[Inline]("inline"), | |
generator[Inline]("inline_outer", outer = true), |
so we can use it as follows:
SELECT all_collations();
or
SELECT * FROM all_collations();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using generator functions in SELECT is not recommended. We only keep them for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I see.
An we discuss the name? Are there any other collations? Is there precedence? Can we lean into what we did for keywords function?
Sent from my iPhone
On Sep 13, 2024, at 3:02 AM, Maxim Gekk ***@***.***> wrote:
@MaxGekk commented on this pull request.
________________________________
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala<#48087 (comment)>:
+ SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD NULL
+ SYSTEM BUILTIN UTF8_LCASE NULL NULL ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD NULL
I would replace this LIMIT by WHERE which always returns one row, to don't depend on order and how LIMIT behaves.
—
Reply to this email directly, view it on GitHub<#48087 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA22CFHCAGKZHSXT3QNJPHDZWKZ3FAVCNFSM6AAAAABODCERQ2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGMBSG4YDGNJTGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
all_collations()
and remove the SHOW COLLATIONS
commandall_collations()
& remove the SHOW COLLATIONS
command
@srielau @mihailom-db What are your thoughts about the function name? |
I do not have strong preferences, but to me maybe something like |
Or call it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for collations()
- seems most appropriate to me, no unnecessary complications
@panbingkun Which function name do other systems use? |
Oracle has SHOW COLLATION??
Do you have link?
Also we should have a wider spread… 5 flavors of MySQL re still MySQL.
Collations() works for me. Happens to line up with the ANSI information schema
…Sent from my iPhone
On Sep 14, 2024, at 4:02 AM, panbingkun ***@***.***> wrote:
@panbingkun<https://github.com/panbingkun> Which function name do other systems use?
Database Support Name
mysql SHOW COLLATION COLLATION
MariaDB SHOW COLLATION LIKE 'latin2%'; COLLATION
oracle SHOW COLLATION COLLATION
TiDB SHOW COLLATION COLLATION
PingCap SHOW COLLATION COLLATION
Firebird SHOW COLLATIONS COLLATIONS
MS SQL Server SELECT name, description FROM sys.fn_helpcollations(); ...collations()
—
Reply to this email directly, view it on GitHub<#48087 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA22CFGODHWK2YYFZUVKQ6LZWQJSVAVCNFSM6AAAAABODCERQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJQHE2TENBTGA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
It was my bad, I misread |
let's go with |
Updated, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, collations() seems like the most appropriate choice
all_collations()
& remove the SHOW COLLATIONS
commandcollations()
& remove the SHOW COLLATIONS
command
thanks, merging to master! |
Thanks all, @mihailom-db @uros-db @MaxGekk @cloud-fan @srielau |
…LLATIONS` command ### What changes were proposed in this pull request? The pr aims to - introduce `TVF` `collations()`. - remove the `SHOW COLLATIONS` command. ### Why are the changes needed? Based on cloud-fan's suggestion: apache#47364 (comment) I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg: ```sql SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%'; ``` ### Does this PR introduce _any_ user-facing change? Yes, provide a new TVF `collations()` for end-users. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48087 from panbingkun/SPARK-49611. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…LLATIONS` command ### What changes were proposed in this pull request? The pr aims to - introduce `TVF` `collations()`. - remove the `SHOW COLLATIONS` command. ### Why are the changes needed? Based on cloud-fan's suggestion: apache#47364 (comment) I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg: ```sql SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%'; ``` ### Does this PR introduce _any_ user-facing change? Yes, provide a new TVF `collations()` for end-users. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48087 from panbingkun/SPARK-49611. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The pr aims to
TVF
collations()
.SHOW COLLATIONS
command.Why are the changes needed?
Based on @cloud-fan's suggestion: #47364 (comment)
I believe that after this, we can do many things based on it, such as
filtering
andquerying
based onLANGUAGE
orCOUNTRY
, etc. eg:Does this PR introduce any user-facing change?
Yes, provide a new TVF
collations()
for end-users.How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.