Revisit case sensitivity of identifiers #1122

vgapeyev · 2023-06-13T00:46:39Z

In short: What is currently implemented for identifiers in partiql-lang-kotlin:

Is not what is specified in SQL;
Not specified or described anywhere;
The implementation is hard to follow and understand, as it is not encapsulated, but rather spread around and duplicated at use sites places.

Ideally, we implement what SQL prescribes, while taking care of properly abstracting it.
This will surely be backward-incompatible, but everyone's sanity can be worth it.

SQL

In SQL, there are two syntactic variants of identifiers: regular identifiers like foo and delimited/quoted identifiers like "fOO".

A regular identifier is considered case-insensitive and has the fully-capitalized variant as the canonical form.
A delimited identifier is case-sensitive.
The regular vs delimited distinction is only lexical; after parsing and canonicalization, all identifiers are essentially strings. This is to say that foo, Foo, fOo, FOO and, notably, "FOO" are all the same identifier, canonically being FOO. This identifier is distinct from "foo", "Foo", "fOo", all of which are also distinct from each other.
The equivalence of identifiers is the equality of these canonicalized strings, in case-sensitive way.
Either lexical kind of an identifier (regular or delimited) can be used either for binding a variable (as in FROM or GROUP BY) or as a reference variable to something bound elsewhere.
Importantly, it is ok to name an object "FOO" and then refer to it as FOO or foo and, conversely, name an object gOO and then refer to it as "GOO".

partiql-lang-kotlin

In contrast, in the current implementation:

While there are lexical regular and delimited identifiers of the same lexical appearance as in SQL, there are three kinds of identifiers internally. The first distinction is based on whether it is an identifier used as a binder vs used as a reference.
For a binder, the case of characters in the lexical appearance is preserved semantically, while the fact whether the identifier was regular or delimited is discarded. The binder identifiers are used as keys for things like environments.
For a reference identifier, a delimited identifier becomes "an identifier with case-sensitive lookup", while a regular identifier becomes "an identifier with case-insensitive lookup". For the latter, the canonical representation is lower case.
The lookup of a reference identifier in an environment depends on whether it is case-sensitive or case-insensitive.
The case-sensitive lookup is straightforward, using case-sensitive string equality.
The case-insensitive lookup uses case-insensitive string equality. It can give an ambiguous result (reference Foo matching binders FOO, Foo, foo) -- apparently, identifiers that are equivalent in case-insensitive sense can bind to different entities, without shadowing.

[Disclaimer: the above is an attempt at a rational reconstruction of the design in place; no claim is made about its fidelity to the intent or to all intricacies of the implementation.]

Known history

Lower case was chosen as the canonical case in order to conform with PostgreSQL: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

Conclusion

While the SQL's design is not an exhibit of simplicity, the approach in p-l-k is even more complicated. While the two seem to coincide in common usage (say, only non-delimited identifiers and no malicious introduction of ambiguous binders), the differences are significant and break SQL conformance claims.

The text was updated successfully, but these errors were encountered:

johnedquinn · 2023-07-19T17:25:23Z

Adding some background from SQL-1999, section 20.1:

The representation of an <identifier> in the base tables and views of the Information Schema is by
a character string corresponding to its <identifier body> (in the case of a <regular identifier>) or its
<delimited identifier body> (in the case of a <delimited identifier>). Within this character string,
any lower-case letter appearing in a <regular identifier> is replaced by the equivalent upper-case
letter, and any <doublequote symbol> appearing in a <delimited identifier body> is replaced by a
<double quote>. Where an <actual identifier> has multiple forms that are equal according to the
rules of Subclause 8.2, ‘‘<comparison predicate>’’, the form stored is that encountered at definition
time.

The specification opts for using uppercase

RCHowell · 2023-07-21T23:14:30Z

Related, we will need to reify the case-sensitivity of binders in the AST so that when we pretty-print we will know to quote or not to quote (in the case of keywords). However, the quotes should not matter for binders as we use that literal value regardless of quotes.

Today's behavior is what one would expect, that is SELECT x as a FROM T to output << { 'a': .. } ... >> rather than << { 'A': ... } ... >>

dlurton · 2023-09-14T21:10:36Z

TBH, I'm not sure where this is coming from. The SQL-92 specification AFAICT is clear about it in section 5.2:

Two <delimited identifier>s are equivalent if their <delimitedidentifier body>s (with all occurrences of <quote> replaced by and all occurrences of <doublequote symbol> replaced by <doublequote>), considered as the repetition of acharacter string literal> that specifies a <character set specification> of SQL_TEXT and an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, "<comparison predicate>

RCHowell · 2024-08-21T18:36:42Z

@yliuuuu could you please provide an update here as to why #1519 was closed?

vgapeyev added the enhancement New feature or request label Jun 13, 2023

vgapeyev changed the title ~~Revisit cases sensitivity of identifiers~~ Revisit case sensitivity of identifiers Jun 13, 2023

vgapeyev added SQL92Spec Features from the SQL92 spec and removed enhancement New feature or request labels Jun 13, 2023

vgapeyev mentioned this issue Jul 26, 2023

Parsing of GPML label patterns. #1139

Merged

am357 mentioned this issue Aug 16, 2023

Revisit name case sensitivy in Catalog partiql/partiql-lang-rust#426

Open

vgapeyev mentioned this issue Aug 28, 2023

Introduces "definitional identifier" AST node #1187

Closed

This was referenced Sep 5, 2023

Being more straightforward on variables vs identifiers in the AST #1194

Closed

Cumulative preliminaries for SQL-conformant identifiers #1195

Closed

Change AST terminology from "case sensitivity" to "identifier kinds" #1196

Closed

yliuuuu mentioned this issue Jun 27, 2024

PartiQL Identifier - Variable Binding and Variable Look Up. partiql/partiql-lang#83

Open

This was referenced Jul 12, 2024

Identifier mode #1511

Closed

identifier normalization #1519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit case sensitivity of identifiers #1122

Revisit case sensitivity of identifiers #1122

vgapeyev commented Jun 13, 2023 •

edited

Loading

johnedquinn commented Jul 19, 2023 •

edited by vgapeyev

Loading

RCHowell commented Jul 21, 2023

dlurton commented Sep 14, 2023 •

edited

Loading

RCHowell commented Aug 21, 2024

Revisit case sensitivity of identifiers #1122

Revisit case sensitivity of identifiers #1122

Comments

vgapeyev commented Jun 13, 2023 • edited Loading

SQL

partiql-lang-kotlin

Known history

Conclusion

johnedquinn commented Jul 19, 2023 • edited by vgapeyev Loading

RCHowell commented Jul 21, 2023

dlurton commented Sep 14, 2023 • edited Loading

RCHowell commented Aug 21, 2024

vgapeyev commented Jun 13, 2023 •

edited

Loading

johnedquinn commented Jul 19, 2023 •

edited by vgapeyev

Loading

dlurton commented Sep 14, 2023 •

edited

Loading