Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49611][SQL] Introduce TVF collations() & remove the SHOW COLLATIONS command #48087

Closed
wants to merge 13 commits into from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Sep 12, 2024

What changes were proposed in this pull request?

The pr aims to

  • introduce TVF collations().
  • remove the SHOW COLLATIONS command.

Why are the changes needed?

Based on @cloud-fan's suggestion: #47364 (comment)
I believe that after this, we can do many things based on it, such as filtering and querying based on LANGUAGE or COUNTRY, etc. eg:

SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';

Does this PR introduce any user-facing change?

Yes, provide a new TVF collations() for end-users.

How was this patch tested?

  • Add new UT.
  • Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@panbingkun
Copy link
Contributor Author

@panbingkun panbingkun marked this pull request as ready for review September 12, 2024 11:48
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should get @srielau's opinion here (incl. link to original PR: #47364)

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @cloud-fan could you review/merge

@@ -1158,6 +1158,7 @@ object TableFunctionRegistry {
generator[PosExplode]("posexplode"),
generator[PosExplode]("posexplode_outer", outer = true),
generator[Stack]("stack"),
generator[AllCollations]("all_collations"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given we have sql_keywords, shall we call it string_collations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To me this sounds weird. We suggest that there might exist collations for some other type, for keywords it is more likely to have other keywords (python, scala ...) @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, not only string type? then all_collations is good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, Let's restore to all_collations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_collation is fine although I'm not sure why all is necessary. But no strong feelings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all_collation, not all_collations, right?

@cloud-fan
Copy link
Contributor

can we also remove the SHOW COLLATIONS command in this PR? #47364 (comment) I'd like to avoid confusion as Spark has a very special LIKE semantic in SHOW commands (different from Spark's own LIKE operator for string matching)

@panbingkun
Copy link
Contributor Author

can we also remove the SHOW COLLATIONS command in this PR? #47364 (comment) I'd like to avoid confusion as Spark has a very special LIKE semantic in SHOW commands (different from Spark's own LIKE operator for string matching)

Okay, Let me handle it together.

@github-actions github-actions bot added the DOCS label Sep 13, 2024
@panbingkun panbingkun changed the title [SPARK-49611][SQL] Introduce TVF all_collations() [SPARK-49611][SQL] Introduce TVF string_collations() and remove the SHOW COLLATIONS command Sep 13, 2024
@panbingkun panbingkun changed the title [SPARK-49611][SQL] Introduce TVF string_collations() and remove the SHOW COLLATIONS command [SPARK-49611][SQL] Introduce TVF string_collations() and remove the SHOW COLLATIONS command Sep 13, 2024
Row("SYSTEM", "BUILTIN", "UTF8_BINARY", null, null,
"ACCENT_SENSITIVE", "CASE_SENSITIVE", "NO_PAD", null))

checkAnswer(sql("SHOW COLLATIONS '*zh_Hant_HKG*'"),
checkAnswer(sql("SELECT * FROM string_collations() WHERE COLLATION_NAME LIKE '%zh_Hant_HKG%'"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please do some test where we have for example WHERE country collate UTF8_LCASE like '%china%' so that we know we can actually use search on other fields, which is one of the main reasons we added this functionality over SHOW COLLATIONS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from small suggestions

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@panbingkun panbingkun changed the title [SPARK-49611][SQL] Introduce TVF string_collations() and remove the SHOW COLLATIONS command [SPARK-49611][SQL] Introduce TVF all_collations() and remove the SHOW COLLATIONS command Sep 13, 2024
Comment on lines 628 to 629
SYSTEM BUILTIN UTF8_BINARY NULL NULL ACCENT_SENSITIVE CASE_SENSITIVE NO_PAD NULL
SYSTEM BUILTIN UTF8_LCASE NULL NULL ACCENT_SENSITIVE CASE_INSENSITIVE NO_PAD NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this output deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, UTF8_BINARY and UTF8_LCASE have always been at the forefront
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be, listCollationMeta enforces UTF8_* collations first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace this LIMIT by WHERE which always returns one row, to don't depend on order and how LIMIT behaves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -18,6 +18,7 @@
package org.apache.spark.sql.catalyst.expressions

import scala.collection.mutable
import scala.jdk.CollectionConverters.CollectionHasAsScala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is necessary to use from java(List) to scala(Iterable), as follows:
image

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

UTF8String.fromString(m.catalog),
UTF8String.fromString(m.schema),
UTF8String.fromString(m.collationName),
if (m.language != null) UTF8String.fromString(m.language) else null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromString() does null-checking, not need to check this additionally, see:

  public static UTF8String fromString(String str) {
    return str == null ? null : fromBytes(str.getBytes(StandardCharsets.UTF_8));
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

override def elementSchema: StructType = new StructType()
.add("COLLATION_CATALOG", StringType, nullable = false)
.add("COLLATION_SCHEMA", StringType, nullable = false)
.add("COLLATION_NAME", StringType, nullable = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the COLLATION prefix? Can it just be CATALOG, SCHEMA and NAME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it because sql_keywords doesn't seem to have a prefix either.
image

Copy link
Contributor Author

@panbingkun panbingkun Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I noticed an issue. Do we need to register it as a SQL built-in function? I noticed that some functions in FunctionRegistry are both TVF and SQL built-in function, eg:

expression[Inline]("inline"),
expressionGeneratorOuter[Inline]("inline_outer"),

generator[Inline]("inline"),
generator[Inline]("inline_outer", outer = true),

so we can use it as follows:

SELECT all_collations();

or

SELECT * FROM all_collations();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using generator functions in SELECT is not recommended. We only keep them for backward compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I see.

@srielau
Copy link
Contributor

srielau commented Sep 13, 2024 via email

@panbingkun panbingkun changed the title [SPARK-49611][SQL] Introduce TVF all_collations() and remove the SHOW COLLATIONS command [SPARK-49611][SQL] Introduce TVF all_collations() & remove the SHOW COLLATIONS command Sep 13, 2024
@cloud-fan
Copy link
Contributor

@srielau @mihailom-db What are your thoughts about the function name? string_collations is also fine to me since collation only makes sense for string.

@mihailom-db
Copy link
Contributor

I do not have strong preferences, but to me maybe something like collation(s)_info seems like the most appropriate. As this TVF is not generation only collation names but actually providing some more information on what the given collations are. @cloud-fan @srielau

@panbingkun
Copy link
Contributor Author

Or call it collations() ?

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for collations() - seems most appropriate to me, no unnecessary complications

@MaxGekk
Copy link
Member

MaxGekk commented Sep 14, 2024

@panbingkun Which function name do other systems use?

@panbingkun
Copy link
Contributor Author

panbingkun commented Sep 14, 2024

@panbingkun Which function name do other systems use?

It seems to have no reference value, most of them are SQL command, only MS SQL Serve seems to be a TVF.

Database Support Name Link Screenshot
mysql SHOW COLLATION COLLATION https://dev.mysql.com/doc/refman/8.4/en/show-collation.html image
MariaDB SHOW COLLATION LIKE 'latin2%'; COLLATION https://mariadb.com/kb/en/show-collation/ image
Oracle select value from v$nls_valid_values where parameter = 'SORT'; https://stackoverflow.com/questions/74795796/how-to-get-all-supported-collate-in-oracle-database https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/appendix-A-locale-data.html#GUID-D2FCFD55-EDC3-473F-9832-AAB564457830 https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/V-NLS_VALID_VALUES.html image
PingCap/TiDB SHOW COLLATION COLLATION https://docs.pingcap.com/tidb/stable/sql-statement-show-collation image
Firebird SHOW COLLATIONS COLLATIONS https://www.firebirdsql.org/file/documentation/html/en/firebirddocs/isql/firebird-isql.html#isql-show-collations image
MS SQL Server SELECT name, description FROM sys.fn_helpcollations(); ...collations() https://learn.microsoft.com/en-us/sql/relational-databases/collations/view-collation-information?view=sql-server-ver16 image
PostgreSQL SELECT * FROM pg_collation; https://www.postgresql.org/docs/current/catalog-pg-collation.html image

@srielau
Copy link
Contributor

srielau commented Sep 14, 2024 via email

@panbingkun
Copy link
Contributor Author

panbingkun commented Sep 14, 2024

Oracle has SHOW COLLATION?? Do you have link?

It was my bad, I misread Oracle.
Additionally, I have updated links and screenshots for all databases.

@cloud-fan
Copy link
Contributor

let's go with collations() then.

@panbingkun
Copy link
Contributor Author

let's go with collations() then.

Updated, thanks!

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, collations() seems like the most appropriate choice

@panbingkun panbingkun changed the title [SPARK-49611][SQL] Introduce TVF all_collations() & remove the SHOW COLLATIONS command [SPARK-49611][SQL] Introduce TVF collations() & remove the SHOW COLLATIONS command Sep 15, 2024
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2113f10 Sep 16, 2024
@panbingkun
Copy link
Contributor Author

Thanks all, @mihailom-db @uros-db @MaxGekk @cloud-fan @srielau

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…LLATIONS` command

### What changes were proposed in this pull request?
The pr aims to
- introduce `TVF` `collations()`.
- remove the `SHOW COLLATIONS` command.

### Why are the changes needed?
Based on cloud-fan's suggestion: apache#47364 (comment)
I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg:
```sql
SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';
```

### Does this PR introduce _any_ user-facing change?
Yes, provide a new TVF `collations()` for end-users.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48087 from panbingkun/SPARK-49611.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…LLATIONS` command

### What changes were proposed in this pull request?
The pr aims to
- introduce `TVF` `collations()`.
- remove the `SHOW COLLATIONS` command.

### Why are the changes needed?
Based on cloud-fan's suggestion: apache#47364 (comment)
I believe that after this, we can do many things based on it, such as `filtering` and `querying` based on `LANGUAGE` or `COUNTRY`, etc. eg:
```sql
SELECT * FROM collations() WHERE LANGUAGE like '%Chinese%';
```

### Does this PR introduce _any_ user-facing change?
Yes, provide a new TVF `collations()` for end-users.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48087 from panbingkun/SPARK-49611.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants