-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48906][SQL] Introduce SHOW COLLATIONS LIKE ...
syntax to show all collations
#47364
Conversation
…w all collations
Currently only show |
If necessary, we can also show columns: |
Another option:
|
Hi @panbingkun, thanks for taking initiative to push this work forward. The design of the table was discussed previously and the structure that was agreed upon should take a slightly different format.
All fields should be of string type and only language, country and version should be nullable. Apart from SQL API, we need to support other APIs as well, which should be used by calling Please let me know if you have any additional questions and we can work through this PR together. |
@@ -91,7 +91,7 @@ public Optional<String> getVersion() { | |||
/** | |||
* Entry encapsulating all information about a collation. | |||
*/ | |||
public static class Collation { | |||
public static class Collation implements Comparable<Collation> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? When do we want to order them? I would say list/table building should be deterministic and should always output collations in the same way. Also I would expect UTF8_*
family collations to come in first as they represent OSS internal implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
@@ -442,6 +442,7 @@ Below is a list of all the keywords in Spark SQL. | |||
|CODEGEN|non-reserved|non-reserved|non-reserved| | |||
|COLLATE|reserved|non-reserved|reserved| | |||
|COLLATION|reserved|non-reserved|reserved| | |||
|COLLATIONS|reserved|non-reserved|reserved| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we look at MySQL, they use SHOW COLLATION
, PostgreSQL uses pg_collation
catalog, so I would argue use of COLLATION
keyword is sufficient, so let's keep this keyword free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay,
The reason for using COLLATIONS
is that I want to follow some of Spark's unwritten rules
, eg:
spark/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
Line 223 in 42d1479
| SHOW TABLES ((FROM | IN) identifierReference)? |
spark/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
Line 227 in 42d1479
| SHOW TBLPROPERTIES table=identifierReference |
spark/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
Line 229 in 42d1479
| SHOW COLUMNS (FROM | IN) table=identifierReference |
Of course, if we decide to use
COLLATION
, I am fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will follow-up on this a bit later, let's leave it as COLLATIONS
for now, and then we can just revert keyword addition if we decide to switch to COLLATION
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
checkAnswer(sql("SHOW COLLATIONS LIKE '*UTF8_BINARY*'"), | ||
Row("UTF8_BINARY", "spark", "1.0", true, true, false)) | ||
checkAnswer(sql("SHOW COLLATIONS '*zh_Hant_HKG*'"), | ||
Seq(Row("zh_Hant_HKG", "icu", "153.120.0.0", false, false, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version seems weird, I would say it should contain 75.1 as the version of ICU library. @nikolamand-db do you have any more info on is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -918,4 +967,8 @@ public static String getClosestSuggestionsOnInvalidName( | |||
|
|||
return String.join(", ", suggestions); | |||
} | |||
|
|||
public static List<Collation> fetchAllCollations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation seems a bit too expensive. We always build the whole table and then do a like on it. We need to do something similar to show tables where we have the pattern, as we do not want to ask for collation information if like operator is not concerned with it. Also it feels pretty weird to have fetchAllCollations
in many places like this. This is not as expensive as building the whole table, but we are also allocating ArrayList multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to think of a better way.
@mihailom-db 1.
|
COLLATION_CATALOG | COLLATION_SCHEMA | COLLATION_NAME | LANGUAGE | COUNTRY | ACCENT_SENSITIVITY | CASE_SENSITIVITY | PAD_ATTRIBUTE | ICU_VERSION |
---|---|---|---|---|---|---|---|---|
SYSTEM | BUILTIN | UTF8_BINARY | NULL | NULL | ACCENT_SENSITIVE | CASE_SENSITIVITY | NO_PAD | NULL |
SYSTEM | BUILTIN | UTF8_LCASE | NULL | NULL | ACCENT_SENSITIVE | CASE_INSENSITIVE | NO_PAD | NULL |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2.or do we use two commands, eg:
A.SHOW COLLATIONS LIKE ....
COLLATION_NAME |
---|
UTF8_BINARY |
UTF8_LCASE |
... |
B.DESCRIBE COLLATION UTF8_BINARY
COLLATION_CATALOG | COLLATION_SCHEMA | COLLATION_NAME | LANGUAGE | COUNTRY | ACCENT_SENSITIVITY | CASE_SENSITIVITY | PAD_ATTRIBUTE | ICU_VERSION |
---|---|---|---|---|---|---|---|---|
SYSTEM | BUILTIN | UTF8_BINARY | NULL | NULL | ACCENT_SENSITIVE | CASE_SENSITIVITY | NO_PAD | NULL |
Which of the above is more suitable?
I believe for now we agreed to have only |
Okay. |
@mihailom-db All suggestions have been updated and verified locally as follows: |
Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you. |
@mihailom-db |
COLLATION_SCHEMA, | ||
collationName(), | ||
ICULocaleMap.get(locale).getLanguage(), | ||
ICULocaleMap.get(locale).getCountry(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we possible change here to getDisplayCountry() and getDisplayLanguage(). The idea is to later be able to look for locales not only on name, but also on country name i.e. Greece, France...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -734,6 +865,10 @@ public CollationIdentifier identifier() { | |||
public static final String PROVIDER_ICU = "icu"; | |||
public static final List<String> SUPPORTED_PROVIDERS = List.of(PROVIDER_SPARK, PROVIDER_ICU); | |||
|
|||
private static final String COLLATION_CATALOG = "SYSTEM"; | |||
private static final String COLLATION_SCHEMA = "BUILTIN"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this information should be stored here, could you take a look at ShowTablesCommand
. What we do there is get the catalog and then work on from there. We could actually do something similar here, as UDF collations in future will be defined with catalog and schema fields, and will be stored somewhere. This is a future problem, but for now, let's try to keep code as close as what we will need in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would love to have a buffer on session.catalog level called collations and then store any user defined collations in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me think about it again.
Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative. |
Thank you very much for your patient review, let me update based on your suggestions. |
@cloud-fan @MaxGekk could you take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@panbingkun Could you resolve conflicts, please.
Done, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@panbingkun It seems the test failure is related to your changes:
[info] - SPARK-43119: Get SQL Keywords *** FAILED *** (11 milliseconds)
[info] "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." did not equal "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..." (ThriftServerWithSparkContextSuite.scala:217)
[info] Analysis:
[info] "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." -> "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..."
Could you fix it, please.
Sorry, it was bad. I made a |
+1, LGTM. Merging to master. |
Thanks all @MaxGekk @mihailom-db @uros-db ❤️ |
Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:
Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like |
BTW we already have a TVF to get all the SQL keywords
|
Okay, let's investigate carefully first. |
I think it's possible, let me try it out. |
about TVF |
I think the command could be useful in the migration from other systems that have already such command:
|
It might not be useful if our |
…LLATIONS` command ### What changes were proposed in this pull request? The pr aims to - introduce `TVF` `collations()`. - remove the `SHOW COLLATIONS` command. ### Why are the changes needed? Based on cloud-fan's suggestion: #47364 (comment) I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg: ```sql SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%'; ``` ### Does this PR introduce _any_ user-facing change? Yes, provide a new TVF `collations()` for end-users. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48087 from panbingkun/SPARK-49611. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…w all collations ### What changes were proposed in this pull request? The pr aims to introduce `SHOW COLLATIONS LIKE ...` syntax to `show all collations`. ### Why are the changes needed? End-users will be able to obtain `collations` currently supported by the spark through SQL. Other databases, such as `MySQL`, also have similar syntax, ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html <img width="958" alt="image" src="https://github.com/user-attachments/assets/1d5106b3-f8b8-42c5-b3ad-0f35c61ad5e2"> postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/ ### Does this PR introduce _any_ user-facing change? Yes, end-users will be able to obtain `collation` currently supported by the spark through commands similar to the following |name|provider|version|binaryEquality|binaryOrdering|lowercaseEquality| | --------- | ----------- | ----------- | ----------- | ----------- | ----------- | ``` spark-sql (default)> SHOW COLLATIONS; UTF8_BINARY spark 1.0 true true false UTF8_LCASE spark 1.0 false false true ff_Adlm icu 153.120.0.0 false false false ff_Adlm_CI icu 153.120.0.0 false false false ff_Adlm_AI icu 153.120.0.0 false false false ff_Adlm_CI_AI icu 153.120.0.0 false false false ... spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*'; UTF8_BINARY spark 1.0 true true false Time taken: 0.043 seconds, Fetched 1 row(s) ``` <img width="513" alt="image" src="https://github.com/user-attachments/assets/d5765e32-718d-4236-857d-d508f5473329"> ### How was this patch tested? Add new UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47364 from panbingkun/show_collation_syntax. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…LLATIONS` command ### What changes were proposed in this pull request? The pr aims to - introduce `TVF` `collations()`. - remove the `SHOW COLLATIONS` command. ### Why are the changes needed? Based on cloud-fan's suggestion: apache#47364 (comment) I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg: ```sql SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%'; ``` ### Does this PR introduce _any_ user-facing change? Yes, provide a new TVF `collations()` for end-users. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48087 from panbingkun/SPARK-49611. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…w all collations ### What changes were proposed in this pull request? The pr aims to introduce `SHOW COLLATIONS LIKE ...` syntax to `show all collations`. ### Why are the changes needed? End-users will be able to obtain `collations` currently supported by the spark through SQL. Other databases, such as `MySQL`, also have similar syntax, ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html <img width="958" alt="image" src="https://github.com/user-attachments/assets/1d5106b3-f8b8-42c5-b3ad-0f35c61ad5e2"> postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/ ### Does this PR introduce _any_ user-facing change? Yes, end-users will be able to obtain `collation` currently supported by the spark through commands similar to the following |name|provider|version|binaryEquality|binaryOrdering|lowercaseEquality| | --------- | ----------- | ----------- | ----------- | ----------- | ----------- | ``` spark-sql (default)> SHOW COLLATIONS; UTF8_BINARY spark 1.0 true true false UTF8_LCASE spark 1.0 false false true ff_Adlm icu 153.120.0.0 false false false ff_Adlm_CI icu 153.120.0.0 false false false ff_Adlm_AI icu 153.120.0.0 false false false ff_Adlm_CI_AI icu 153.120.0.0 false false false ... spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*'; UTF8_BINARY spark 1.0 true true false Time taken: 0.043 seconds, Fetched 1 row(s) ``` <img width="513" alt="image" src="https://github.com/user-attachments/assets/d5765e32-718d-4236-857d-d508f5473329"> ### How was this patch tested? Add new UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47364 from panbingkun/show_collation_syntax. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…LLATIONS` command ### What changes were proposed in this pull request? The pr aims to - introduce `TVF` `collations()`. - remove the `SHOW COLLATIONS` command. ### Why are the changes needed? Based on cloud-fan's suggestion: apache#47364 (comment) I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg: ```sql SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%'; ``` ### Does this PR introduce _any_ user-facing change? Yes, provide a new TVF `collations()` for end-users. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48087 from panbingkun/SPARK-49611. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The pr aims to introduce
SHOW COLLATIONS LIKE ...
syntax toshow all collations
.Why are the changes needed?
End-users will be able to obtain
collations
currently supported by the spark through SQL.Other databases, such as
MySQL
, also have similar syntax,ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html
postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/
Does this PR introduce any user-facing change?
Yes, end-users will be able to obtain
collation
currently supported by the spark through commands similar to the followingHow was this patch tested?
Add new UT.
Was this patch authored or co-authored using generative AI tooling?
No.