-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48906][SQL] Introduce SHOW COLLATIONS LIKE ...
syntax to show all collations
#47364
Changes from 4 commits
1fde4b4
bc5ce3a
7c76601
395e40a
4dfb5e0
a8ed3e3
e30b5a6
0411f6d
5928f59
15cf451
d7a7d81
c0a9808
e48361e
f3ac8ad
270ce4f
f731ce1
37272d9
fd8803f
f96a597
10c22dc
887c468
e51542b
6855b6a
a6870bb
4c567d6
008b9dc
e2c17fc
ee805c9
1a85852
220c5a0
e61f3df
46be9ca
cd31c31
83f4e81
a01b786
bb8760e
261f3c3
949e16a
dbeeb62
811196e
dd966fa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,7 +91,7 @@ public Optional<String> getVersion() { | |
/** | ||
* Entry encapsulating all information about a collation. | ||
*/ | ||
public static class Collation { | ||
public static class Collation implements Comparable<Collation> { | ||
public final String collationName; | ||
public final String provider; | ||
public final Collator collator; | ||
|
@@ -174,6 +174,11 @@ public Collation( | |
} | ||
} | ||
|
||
@Override | ||
public int compareTo(Collation other) { | ||
return this.collationName.compareTo(other.collationName); | ||
} | ||
|
||
/** | ||
* Collation ID is defined as 32-bit integer. We specify binary layouts for different classes of | ||
* collations. Classes of collations are differentiated by most significant 3 bits (bit 31, 30 | ||
|
@@ -342,6 +347,13 @@ private static int collationNameToId(String collationName) throws SparkException | |
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think third-party implementation of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is true actually. I will do a bit more of testing on how we can handle these two, as to me it seems a bit weird to only have it in meta, but not in the Collation itself. Also I feel like if we end up having calls for catalog and schema we need to think of the case where we add user defined collations. Will provide update by the end of the day. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, thanks! |
||
protected abstract Collation buildCollation(); | ||
|
||
public static List<Collation> fetchAllCollations() { | ||
List<Collation> collations = new ArrayList<>(); | ||
collations.addAll(CollationSpecUTF8.fetchAllCollations()); | ||
collations.addAll(CollationSpecICU.fetchAllCollations()); | ||
return collations; | ||
} | ||
} | ||
|
||
private static class CollationSpecUTF8 extends CollationSpec { | ||
|
@@ -428,6 +440,13 @@ protected Collation buildCollation() { | |
/* supportsLowercaseEquality = */ true); | ||
} | ||
} | ||
|
||
public static List<Collation> fetchAllCollations() { | ||
List<Collation> collations = new ArrayList<>(); | ||
collations.add(UTF8_BINARY_COLLATION); | ||
collations.add(UTF8_LCASE_COLLATION); | ||
return collations; | ||
} | ||
} | ||
|
||
private static class CollationSpecICU extends CollationSpec { | ||
|
@@ -704,6 +723,36 @@ private String collationName() { | |
} | ||
return builder.toString(); | ||
} | ||
|
||
private static void fetchCollation(List<Collation> collations, String collationName) { | ||
try { | ||
int collationId = CollationSpecICU.collationNameToId( | ||
collationName, collationName.toUpperCase()); | ||
Collation collation = CollationSpecICU.fromCollationId(collationId).buildCollation(); | ||
collations.add(collation); | ||
} catch (SparkException ignored) { | ||
// ignore | ||
} | ||
} | ||
|
||
public static List<Collation> fetchAllCollations() { | ||
List<Collation> collations = new ArrayList<>(); | ||
for (Map.Entry<String, Integer> entry: ICULocaleToId.entrySet()) { | ||
String locale = entry.getKey(); | ||
// CaseSensitivity.CS + AccentSensitivity.AS | ||
fetchCollation(collations, locale); | ||
|
||
// CaseSensitivity.CI + AccentSensitivity.AS | ||
fetchCollation(collations, locale + "_CI"); | ||
|
||
// CaseSensitivity.CS + AccentSensitivity.AI | ||
fetchCollation(collations, locale + "_AI"); | ||
|
||
// CaseSensitivity.CI + AccentSensitivity.AI | ||
fetchCollation(collations, locale + "_CI_AI"); | ||
} | ||
return collations; | ||
} | ||
} | ||
|
||
/** | ||
|
@@ -918,4 +967,8 @@ public static String getClosestSuggestionsOnInvalidName( | |
|
||
return String.join(", ", suggestions); | ||
} | ||
|
||
public static List<Collation> fetchAllCollations() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This operation seems a bit too expensive. We always build the whole table and then do a like on it. We need to do something similar to show tables where we have the pattern, as we do not want to ask for collation information if like operator is not concerned with it. Also it feels pretty weird to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me try to think of a better way. |
||
return Collation.CollationSpec.fetchAllCollations(); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -442,6 +442,7 @@ Below is a list of all the keywords in Spark SQL. | |||||||
|CODEGEN|non-reserved|non-reserved|non-reserved| | ||||||||
|COLLATE|reserved|non-reserved|reserved| | ||||||||
|COLLATION|reserved|non-reserved|reserved| | ||||||||
|COLLATIONS|reserved|non-reserved|reserved| | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we look at MySQL, they use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, The reason for using spark/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 Line 223 in 42d1479
spark/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 Line 227 in 42d1479
spark/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 Line 229 in 42d1479
Of course, if we decide to use COLLATION , I am fine too.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will follow-up on this a bit later, let's leave it as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay |
||||||||
|COLLECTION|non-reserved|non-reserved|non-reserved| | ||||||||
|COLUMN|reserved|non-reserved|reserved| | ||||||||
|COLUMNS|non-reserved|non-reserved|non-reserved| | ||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,7 @@ CLOSE | |
COALESCE | ||
COLLATE | ||
COLLATION | ||
COLLATIONS | ||
COLLECT | ||
COLUMN | ||
COMMIT | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1116,4 +1116,18 @@ class SparkSqlAstBuilder extends AstBuilder { | |
withIdentClause(ctx.identifierReference(), UnresolvedNamespace(_)), | ||
cleanedProperties) | ||
} | ||
|
||
/** | ||
* Create a [[ShowCollationsCommand]] command. | ||
* Expected format: | ||
* {{{ | ||
* SHOW COLLATIONS (LIKE? pattern)?; | ||
* }}} | ||
*/ | ||
override def visitShowCollations(ctx: ShowCollationsContext): LogicalPlan = withOrigin(ctx) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you take a look at what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, could we move than this function to AstBuilder, where visitShowFunctions is as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to define the syntax of
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can do two things here. Follow completely functions implementation, but just throw a new error that user defined collations are not supported, or just keep SHOW COLLATIONS (LIKE? pattern=stringLit)? and make sure we implement the listing of collation in a similar way as functions. Second approach might be more clear for now, as we would just throw parse exception if they tried to use some user defined collation specific implementation. ((FROM | IN) ns=identifierReference)? part of the parsing is a good add-on, but for now we are not going to use it. In showFunctions we only use identifier? and ((FROM | IN) ns=identifierReference)? when we are listing udfs and temporary functions from specific catalogs/schemas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you ask me I would keep SHOW COLLATIONS (LIKE? pattern=stringLit)? but try to implement the underlying stuff as for functions. The addition of other features will not be that hard if we follow the functions implementation now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me implement a version first ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Because why use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with that. What do you think about using ShowCollations as a new LogicalPlan node and then resolve that to either ShowCollationsCommand or ShowCollationsExec, as ShowCollationsExec represents a v2 path of execution. (Something similar was done for Functions) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this is reasonable, but based on the logic that has already been implemented, is it sufficient? I would like to ask @MaxGekk , what do you think? I would like to hear your advice. |
||
if (!SQLConf.get.collationEnabled) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This config was removed as collations are now on by default in OSS. Please remove it from here as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated, thanks! |
||
throw QueryCompilationErrors.collationNotEnabledError() | ||
} | ||
ShowCollationsCommand(Option(ctx.pattern).map(x => string(visitStringLit(x)))) | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.command | ||
|
||
import java.util.regex.PatternSyntaxException | ||
|
||
import scala.collection.mutable | ||
import scala.jdk.CollectionConverters.CollectionHasAsScala | ||
|
||
import org.apache.spark.sql.{Row, SparkSession} | ||
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} | ||
import org.apache.spark.sql.catalyst.util.CollationFactory | ||
import org.apache.spark.sql.catalyst.util.CollationFactory.Collation | ||
import org.apache.spark.sql.types.{BooleanType, StringType} | ||
|
||
/** | ||
* A command for `SHOW COLLATIONS`. | ||
* | ||
* The syntax of this command is: | ||
* {{{ | ||
* SHOW COLLATIONS (LIKE? pattern)?; | ||
* }}} | ||
*/ | ||
case class ShowCollationsCommand(pattern: Option[String]) extends LeafRunnableCommand { | ||
override val output: Seq[Attribute] = Seq( | ||
AttributeReference("name", StringType, nullable = false)(), | ||
AttributeReference("provider", StringType, nullable = false)(), | ||
AttributeReference("version", StringType, nullable = false)(), | ||
AttributeReference("binaryEquality", BooleanType, nullable = false)(), | ||
AttributeReference("binaryOrdering", BooleanType, nullable = false)(), | ||
AttributeReference("lowercaseEquality", BooleanType, nullable = false)()) | ||
|
||
override def run(sparkSession: SparkSession): Seq[Row] = { | ||
val allCollations = CollationFactory.fetchAllCollations().asScala.toSeq | ||
val filteredCollations = pattern.map(filterPattern(allCollations, _)).getOrElse(allCollations) | ||
filteredCollations.map(c => Row( | ||
c.collationName, | ||
c.provider, | ||
c.version, | ||
c.supportsBinaryEquality, | ||
c.supportsBinaryOrdering, | ||
c.supportsLowercaseEquality)) | ||
} | ||
|
||
private def filterPattern(collations: Seq[Collation], pattern: String): Seq[Collation] = { | ||
val filteredCollations = mutable.SortedSet.empty[Collation] | ||
pattern.trim().split("\\|").foreach { subPattern => | ||
try { | ||
val regex = ("(?i)" + subPattern.replaceAll("\\*", ".*")).r | ||
filteredCollations ++= collations.filter { | ||
collation => regex.pattern.matcher(collation.collationName).matches() | ||
} | ||
} catch { | ||
case _: PatternSyntaxException => | ||
} | ||
} | ||
filteredCollations.toSeq | ||
} | ||
} |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1465,4 +1465,24 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper { | |||||||||
} | ||||||||||
} | ||||||||||
} | ||||||||||
|
||||||||||
test("show collations") { | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add another test for just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated val df = sql("SHOW COLLATIONS").limit(10)
checkAnswer(df,
Seq(Row("SYSTEM", "BUILTIN", "UTF8_BINARY", null, null,
"ACCENT_SENSITIVE", "CASE_SENSITIVE", "NO_PAD", null),
Row("SYSTEM", "BUILTIN", "UTF8_LCASE", null, null,
"ACCENT_SENSITIVE", "CASE_INSENSITIVE", "NO_PAD", null),
Row("SYSTEM", "BUILTIN", "UNICODE", "", "",
"ACCENT_SENSITIVE", "CASE_SENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "UNICODE_AI", "", "",
"ACCENT_SENSITIVE", "CASE_INSENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "UNICODE_CI", "", "",
"ACCENT_INSENSITIVE", "CASE_SENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "UNICODE_CI_AI", "", "",
"ACCENT_INSENSITIVE", "CASE_INSENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "af", "Afrikaans", "",
"ACCENT_SENSITIVE", "CASE_SENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "af_AI", "Afrikaans", "",
"ACCENT_SENSITIVE", "CASE_INSENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "af_CI", "Afrikaans", "",
"ACCENT_INSENSITIVE", "CASE_SENSITIVE", "NO_PAD", "75.1.0.0"),
Row("SYSTEM", "BUILTIN", "af_CI_AI", "Afrikaans", "",
"ACCENT_INSENSITIVE", "CASE_INSENSITIVE", "NO_PAD", "75.1.0.0")))
|
||||||||||
assert(sql("SHOW COLLATIONS").collect().length == 562) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you need the strict requirement? Could you either replace it by:
Suggested change
or at least set to:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||||||||||
|
||||||||||
checkAnswer(sql("SHOW COLLATIONS LIKE '*UTF8_BINARY*'"), | ||||||||||
Row("UTF8_BINARY", "spark", "1.0", true, true, false)) | ||||||||||
checkAnswer(sql("SHOW COLLATIONS '*zh_Hant_HKG*'"), | ||||||||||
Seq(Row("zh_Hant_HKG", "icu", "153.120.0.0", false, false, false), | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This version seems weird, I would say it should contain 75.1 as the version of ICU library. @nikolamand-db do you have any more info on is this expected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||||||||||
Row("zh_Hant_HKG_CI", "icu", "153.120.0.0", false, false, false), | ||||||||||
Row("zh_Hant_HKG_AI", "icu", "153.120.0.0", false, false, false), | ||||||||||
Row("zh_Hant_HKG_CI_AI", "icu", "153.120.0.0", false, false, false))) | ||||||||||
withSQLConf(SQLConf.COLLATION_ENABLED.key -> "false") { | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto: on removing config calls There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||||||||||
checkError( | ||||||||||
exception = intercept[AnalysisException] { | ||||||||||
sql("SHOW COLLATIONS") | ||||||||||
}, | ||||||||||
errorClass = "UNSUPPORTED_FEATURE.COLLATION" | ||||||||||
) | ||||||||||
} | ||||||||||
} | ||||||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? When do we want to order them? I would say list/table building should be deterministic and should always output collations in the same way. Also I would expect
UTF8_*
family collations to come in first as they represent OSS internal implementations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.