Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48906][SQL] Introduce SHOW COLLATIONS LIKE ... syntax to show all collations #47364

Closed
wants to merge 41 commits into from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Jul 16, 2024

What changes were proposed in this pull request?

The pr aims to introduce SHOW COLLATIONS LIKE ... syntax to show all collations.

Why are the changes needed?

End-users will be able to obtain collations currently supported by the spark through SQL.
Other databases, such as MySQL, also have similar syntax,
ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html
image

postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/

Does this PR introduce any user-facing change?

Yes, end-users will be able to obtain collation currently supported by the spark through commands similar to the following

name provider version binaryEquality binaryOrdering lowercaseEquality
spark-sql (default)> SHOW COLLATIONS;
UTF8_BINARY	spark	1.0	true	true	false
UTF8_LCASE	spark	1.0	false	false	true
ff_Adlm	icu	153.120.0.0	false	false	false
ff_Adlm_CI	icu	153.120.0.0	false	false	false
ff_Adlm_AI	icu	153.120.0.0	false	false	false
ff_Adlm_CI_AI	icu	153.120.0.0	false	false	false
...

spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*';
UTF8_BINARY	spark	1.0	true	true	false
Time taken: 0.043 seconds, Fetched 1 row(s)
image

How was this patch tested?

Add new UT.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 16, 2024
@panbingkun
Copy link
Contributor Author

panbingkun commented Jul 16, 2024

cc @cloud-fan @mihailom-db

@panbingkun panbingkun marked this pull request as ready for review July 16, 2024 03:21
@github-actions github-actions bot added the DOCS label Jul 16, 2024
@panbingkun
Copy link
Contributor Author

Currently only show normalized collation name.

@panbingkun
Copy link
Contributor Author

If necessary, we can also show columns: CaseSensitivity and AccentSensitivity

@panbingkun
Copy link
Contributor Author

panbingkun commented Jul 17, 2024

Another option:

  • when execute SHOW COLLATIONS ..., only show columns: name, provider and version
  • when execute DESCRIBE COLLATIONS ..., will display columns: name, provider, version, binary_equality, binary_ordering, lowercase_equality, case_sensitivity and accent_sensitivity.

@mihailom-db
Copy link
Contributor

mihailom-db commented Jul 22, 2024

Hi @panbingkun, thanks for taking initiative to push this work forward. The design of the table was discussed previously and the structure that was agreed upon should take a slightly different format.
Let me list out things we need to include:

  • COLLATION_CATALOG (important for udf collations, for now it should be SYSTEM)
  • COLLATION_SCHEMA (important for udf collations, for now it should be BUILTIN)
  • COLLATION_NAME (Full name, with all identifiers, just like you did) ✅
  • LANGUAGE (Name of the language that corresponds to the locale of given collation. Null if there is no backing language (e.g. for UTF8_* family of collations))
  • COUNTRY (Name of the country that corresponds to the locale of given collation. Null if there is no backing country (e.g. for UTF8_* family of collations))
  • ACCENT_SENSITIVITY (ACCENT_SENSITIVE/ACCENT_INSENSITIVE)
  • CASE_SENSITIVITY (CASE_SENSITIVE/CASE_INSENSITIVE)
  • PAD_ATTRIBUTE (Attribute affects whether leading or trailing spaces are significant in string comparisons. Currently always NO_PAD)
  • ICU_VERSION (Null if not icu collation) (✅, partially done, just switch UTF8_* family to null)

All fields should be of string type and only language, country and version should be nullable.

Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.

Please let me know if you have any additional questions and we can work through this PR together.

@@ -91,7 +91,7 @@ public Optional<String> getVersion() {
/**
* Entry encapsulating all information about a collation.
*/
public static class Collation {
public static class Collation implements Comparable<Collation> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? When do we want to order them? I would say list/table building should be deterministic and should always output collations in the same way. Also I would expect UTF8_* family collations to come in first as they represent OSS internal implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@@ -442,6 +442,7 @@ Below is a list of all the keywords in Spark SQL.
|CODEGEN|non-reserved|non-reserved|non-reserved|
|COLLATE|reserved|non-reserved|reserved|
|COLLATION|reserved|non-reserved|reserved|
|COLLATIONS|reserved|non-reserved|reserved|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we look at MySQL, they use SHOW COLLATION, PostgreSQL uses pg_collation catalog, so I would argue use of COLLATION keyword is sufficient, so let's keep this keyword free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay,

The reason for using COLLATIONS is that I want to follow some of Spark's unwritten rules, eg:

| SHOW TABLES ((FROM | IN) identifierReference)?

| SHOW TBLPROPERTIES table=identifierReference

| SHOW COLUMNS (FROM | IN) table=identifierReference

Of course, if we decide to use COLLATION, I am fine too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow-up on this a bit later, let's leave it as COLLATIONS for now, and then we can just revert keyword addition if we decide to switch to COLLATION.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

checkAnswer(sql("SHOW COLLATIONS LIKE '*UTF8_BINARY*'"),
Row("UTF8_BINARY", "spark", "1.0", true, true, false))
checkAnswer(sql("SHOW COLLATIONS '*zh_Hant_HKG*'"),
Seq(Row("zh_Hant_HKG", "icu", "153.120.0.0", false, false, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version seems weird, I would say it should contain 75.1 as the version of ICU library. @nikolamand-db do you have any more info on is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -918,4 +967,8 @@ public static String getClosestSuggestionsOnInvalidName(

return String.join(", ", suggestions);
}

public static List<Collation> fetchAllCollations() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation seems a bit too expensive. We always build the whole table and then do a like on it. We need to do something similar to show tables where we have the pattern, as we do not want to ask for collation information if like operator is not concerned with it. Also it feels pretty weird to have fetchAllCollations in many places like this. This is not as expensive as building the whole table, but we are also allocating ArrayList multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to think of a better way.

@panbingkun
Copy link
Contributor Author

Apart from SQL API, we need to support other APIs as well, which should be used by calling Session.catalog.collation. Because of this, your approach might need to be reworked a bit.

@mihailom-db
Thank you very much for your reply (very detailed). Okay, let's work together to complete it.
I only have a small question, do we only use one command to display all the fields in the above list,

1.SHOW COLLATIONS LIKE ....

COLLATION_CATALOG COLLATION_SCHEMA COLLATION_NAME LANGUAGE COUNTRY ACCENT_SENSITIVITY CASE_SENSITIVITY PAD_ATTRIBUTE ICU_VERSION
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVITY NO_PAD NULL
SYSTEM BUILTIN UTF8_LCASE NULL NULL ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD NULL
... ... ... ... ... ... ... ... ...

2.or do we use two commands, eg:

A.SHOW COLLATIONS LIKE ....

COLLATION_NAME
UTF8_BINARY
UTF8_LCASE
...

B.DESCRIBE COLLATION UTF8_BINARY

COLLATION_CATALOG COLLATION_SCHEMA COLLATION_NAME LANGUAGE COUNTRY ACCENT_SENSITIVITY CASE_SENSITIVITY PAD_ATTRIBUTE ICU_VERSION
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVITY NO_PAD NULL

Which of the above is more suitable?

@mihailom-db
Copy link
Contributor

I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.

@panbingkun
Copy link
Contributor Author

I believe for now we agreed to have only SHOW COLLATION(S) as a command, and then add support for both LIKE and ILIKE operators for searching. But it is enough to have LIKE as a start, and ILIKE can be done as an addition later.

Okay.

@panbingkun
Copy link
Contributor Author

@mihailom-db All suggestions have been updated and verified locally as follows:
image

@panbingkun panbingkun requested a review from mihailom-db July 23, 2024 03:08
@mihailom-db
Copy link
Contributor

Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.

@panbingkun
Copy link
Contributor Author

Hi @panbingkun this is starting to look like what we want to get as a result. Thanks for taking the initiative. Apart from the SQL command for SHOW COLLATIONS we need to get the other APIs working. Usually how we list tables and databases for other APIs is through use of catalog information. This is the point where we need to make sure we make the initial design properly. I would argue catalog and schema information should not be stored per collation, but actually only inferred for the purpose of listing here. This becomes crucially important later when we introduce user-defined collations, as these collations will not be available in the list of ICU locales. I will take one more look into how this could be done completely and will get back to you.

@mihailom-db
Thank you very much for your detailed explanation and response!
Looking forward to your take one more look into results!

COLLATION_SCHEMA,
collationName(),
ICULocaleMap.get(locale).getLanguage(),
ICULocaleMap.get(locale).getCountry(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we possible change here to getDisplayCountry() and getDisplayLanguage(). The idea is to later be able to look for locales not only on name, but also on country name i.e. Greece, France...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -734,6 +865,10 @@ public CollationIdentifier identifier() {
public static final String PROVIDER_ICU = "icu";
public static final List<String> SUPPORTED_PROVIDERS = List.of(PROVIDER_SPARK, PROVIDER_ICU);

private static final String COLLATION_CATALOG = "SYSTEM";
private static final String COLLATION_SCHEMA = "BUILTIN";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this information should be stored here, could you take a look at ShowTablesCommand. What we do there is get the catalog and then work on from there. We could actually do something similar here, as UDF collations in future will be defined with catalog and schema fields, and will be stored somewhere. This is a future problem, but for now, let's try to keep code as close as what we will need in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would love to have a buffer on session.catalog level called collations and then store any user defined collations in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me think about it again.

@mihailom-db
Copy link
Contributor

Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.

@panbingkun
Copy link
Contributor Author

Hi @panbingkun I left some comments for you, sorry for the delay. I believe you are really close to getting to the point of what we want from SHOW COLLATIONS command and if you happen to have any questions feel free to ping me. Thanks again for taking this innitiative.

Thank you very much for your patient review, let me update based on your suggestions.
(Also, this PR is not currently the longest time span PR I have been working on. I remember there was another PR that spanned 1 year and 3 months, joking, 😄)

@mihailom-db
Copy link
Contributor

@cloud-fan @MaxGekk could you take a look at this PR?

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@panbingkun Could you resolve conflicts, please.

@panbingkun
Copy link
Contributor Author

@panbingkun Could you resolve conflicts, please.

Done, thanks!

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@panbingkun It seems the test failure is related to your changes:

[info] - SPARK-43119: Get SQL Keywords *** FAILED *** (11 milliseconds)
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." did not equal "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..." (ThriftServerWithSparkContextSuite.scala:217)
[info]   Analysis:
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." -> "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..."

Could you fix it, please.

@panbingkun
Copy link
Contributor Author

@panbingkun It seems the test failure is related to your changes:

[info] - SPARK-43119: Get SQL Keywords *** FAILED *** (11 milliseconds)
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." did not equal "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..." (ThriftServerWithSparkContextSuite.scala:217)
[info]   Analysis:
[info]   "...LLATE,COLLATION,COLL[ATIONS,COLLECTION],COLUMN,COLUMNS,COMM..." -> "...LLATE,COLLATION,COLL[ECTION,COLLECTIONS],COLUMN,COLUMNS,COMM..."

Could you fix it, please.

Sorry, it was bad. I made a hasty correction, but this time it should have been correct.
Let's wait for GA, thank you very much!

@MaxGekk
Copy link
Member

MaxGekk commented Sep 12, 2024

+1, LGTM. Merging to master.
Thank you, @panbingkun and @mihailom-db @uros-db for review.

@MaxGekk MaxGekk closed this in 0f4d289 Sep 12, 2024
@panbingkun
Copy link
Contributor Author

+1, LGTM. Merging to master. Thank you, @panbingkun and @mihailom-db @uros-db for review.

Thanks all @MaxGekk @mihailom-db @uros-db ❤️

@cloud-fan
Copy link
Contributor

Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:

  • It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
  • The SHOW ... LIKE ... semantic is different from the LIKE operator. It's also different from Hive's behavior which we copied from.
  • No corresponding DataFrame API.
  • We can't perform transformations with the output, something like SELECT ... FROM (SHOW ...) is not allowed.

Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau

@cloud-fan
Copy link
Contributor

BTW we already have a TVF to get all the SQL keywords

@ExpressionDescription(
  usage = """_FUNC_() - Get Spark SQL keywords""",
  examples = """
    Examples:
      > SELECT * FROM _FUNC_() LIMIT 2;
       ADD  false
       AFTER  false
  """,
  since = "3.5.0",
  group = "generator_funcs")
case class SQLKeywords() extends LeafExpression with Generator with CodegenFallback {
  override def elementSchema: StructType = new StructType()
    .add("keyword", StringType, nullable = false)
    .add("reserved", BooleanType, nullable = false)

  override def eval(input: InternalRow): IterableOnce[InternalRow] = {
    val reservedList = getReservedList()
    keywords.zip(reservedList).map { case (keyword, isReserved) =>
      InternalRow(UTF8String.fromString(keyword), isReserved)
    }
  }

  override def prettyName: String = "sql_keywords"
}

@panbingkun
Copy link
Contributor Author

Hi all, sorry for the late review as I've been struggling with the user-facing API. I know we have a lot SHOW commands already but there are known issues:

  • It's not standard, but we don't have a choice as Spark doesn't have information schema yet.
  • The SHOW ... LIKE ... semantic is different from the LIKE operator. It's also different from Hive's behavior which we copied from.
  • No corresponding DataFrame API.
  • We can't perform transformations with the output, something like SELECT ... FROM (SHOW ...) is not allowed.

Building information schema is a big effort, but for this SHOW COLLATIONS feature, can we add a builtin TVF like SELECT ... FROM all_collations()? Also cc @srielau

Okay, let's investigate carefully first.

@panbingkun
Copy link
Contributor Author

I think it's possible, let me try it out.

@panbingkun
Copy link
Contributor Author

about TVF all_collations: #48087

@MaxGekk
Copy link
Member

MaxGekk commented Sep 12, 2024

@cloud-fan
Copy link
Contributor

It might not be useful if our SHOW ... LIKE ... semantic is different from all other systems...

cloud-fan pushed a commit that referenced this pull request Sep 16, 2024
…LLATIONS` command

### What changes were proposed in this pull request?
The pr aims to
- introduce `TVF` `collations()`.
- remove the `SHOW COLLATIONS` command.

### Why are the changes needed?
Based on cloud-fan's suggestion: #47364 (comment)
I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg:
```sql
SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';
```

### Does this PR introduce _any_ user-facing change?
Yes, provide a new TVF `collations()` for end-users.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48087 from panbingkun/SPARK-49611.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…w all collations

### What changes were proposed in this pull request?
The pr aims to introduce `SHOW COLLATIONS LIKE ...` syntax to `show all collations`.

### Why are the changes needed?
End-users will be able to obtain `collations` currently supported by the spark through SQL.
Other databases, such as `MySQL`, also have similar syntax,
ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html
<img width="958" alt="image" src="https://github.com/user-attachments/assets/1d5106b3-f8b8-42c5-b3ad-0f35c61ad5e2">

postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/

### Does this PR introduce _any_ user-facing change?
Yes, end-users will be able to obtain `collation` currently supported by the spark through commands similar to the following

|name|provider|version|binaryEquality|binaryOrdering|lowercaseEquality|
| --------- | ----------- | ----------- | ----------- | ----------- | ----------- |

```
spark-sql (default)> SHOW COLLATIONS;
UTF8_BINARY	spark	1.0	true	true	false
UTF8_LCASE	spark	1.0	false	false	true
ff_Adlm	icu	153.120.0.0	false	false	false
ff_Adlm_CI	icu	153.120.0.0	false	false	false
ff_Adlm_AI	icu	153.120.0.0	false	false	false
ff_Adlm_CI_AI	icu	153.120.0.0	false	false	false
...

spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*';
UTF8_BINARY	spark	1.0	true	true	false
Time taken: 0.043 seconds, Fetched 1 row(s)
```

<img width="513" alt="image" src="https://github.com/user-attachments/assets/d5765e32-718d-4236-857d-d508f5473329">

### How was this patch tested?
Add new UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47364 from panbingkun/show_collation_syntax.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…LLATIONS` command

### What changes were proposed in this pull request?
The pr aims to
- introduce `TVF` `collations()`.
- remove the `SHOW COLLATIONS` command.

### Why are the changes needed?
Based on cloud-fan's suggestion: apache#47364 (comment)
I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg:
```sql
SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';
```

### Does this PR introduce _any_ user-facing change?
Yes, provide a new TVF `collations()` for end-users.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48087 from panbingkun/SPARK-49611.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…w all collations

### What changes were proposed in this pull request?
The pr aims to introduce `SHOW COLLATIONS LIKE ...` syntax to `show all collations`.

### Why are the changes needed?
End-users will be able to obtain `collations` currently supported by the spark through SQL.
Other databases, such as `MySQL`, also have similar syntax,
ref: https://dev.mysql.com/doc/refman/9.0/en/show-collation.html
<img width="958" alt="image" src="https://github.com/user-attachments/assets/1d5106b3-f8b8-42c5-b3ad-0f35c61ad5e2">

postgresql: https://database.guide/how-to-return-a-list-of-available-collations-in-postgresql/

### Does this PR introduce _any_ user-facing change?
Yes, end-users will be able to obtain `collation` currently supported by the spark through commands similar to the following

|name|provider|version|binaryEquality|binaryOrdering|lowercaseEquality|
| --------- | ----------- | ----------- | ----------- | ----------- | ----------- |

```
spark-sql (default)> SHOW COLLATIONS;
UTF8_BINARY	spark	1.0	true	true	false
UTF8_LCASE	spark	1.0	false	false	true
ff_Adlm	icu	153.120.0.0	false	false	false
ff_Adlm_CI	icu	153.120.0.0	false	false	false
ff_Adlm_AI	icu	153.120.0.0	false	false	false
ff_Adlm_CI_AI	icu	153.120.0.0	false	false	false
...

spark-sql (default)> SHOW COLLATIONS LIKE '*UTF8_BINARY*';
UTF8_BINARY	spark	1.0	true	true	false
Time taken: 0.043 seconds, Fetched 1 row(s)
```

<img width="513" alt="image" src="https://github.com/user-attachments/assets/d5765e32-718d-4236-857d-d508f5473329">

### How was this patch tested?
Add new UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47364 from panbingkun/show_collation_syntax.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…LLATIONS` command

### What changes were proposed in this pull request?
The pr aims to
- introduce `TVF` `collations()`.
- remove the `SHOW COLLATIONS` command.

### Why are the changes needed?
Based on cloud-fan's suggestion: apache#47364 (comment)
I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg:
```sql
SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';
```

### Does this PR introduce _any_ user-facing change?
Yes, provide a new TVF `collations()` for end-users.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48087 from panbingkun/SPARK-49611.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants