Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit case sensitivity of identifiers #1122

Open
vgapeyev opened this issue Jun 13, 2023 · 4 comments
Open

Revisit case sensitivity of identifiers #1122

vgapeyev opened this issue Jun 13, 2023 · 4 comments
Labels
SQL92Spec Features from the SQL92 spec

Comments

@vgapeyev
Copy link
Contributor

vgapeyev commented Jun 13, 2023

In short: What is currently implemented for identifiers in partiql-lang-kotlin:

  • Is not what is specified in SQL;
  • Not specified or described anywhere;
  • The implementation is hard to follow and understand, as it is not encapsulated, but rather spread around and duplicated at use sites places.

Ideally, we implement what SQL prescribes, while taking care of properly abstracting it.
This will surely be backward-incompatible, but everyone's sanity can be worth it.

SQL

In SQL, there are two syntactic variants of identifiers: regular identifiers like foo and delimited/quoted identifiers like "fOO".

  • A regular identifier is considered case-insensitive and has the fully-capitalized variant as the canonical form.
  • A delimited identifier is case-sensitive.
  • The regular vs delimited distinction is only lexical; after parsing and canonicalization, all identifiers are essentially strings. This is to say that foo, Foo, fOo, FOO and, notably, "FOO" are all the same identifier, canonically being FOO. This identifier is distinct from "foo", "Foo", "fOo", all of which are also distinct from each other.
  • The equivalence of identifiers is the equality of these canonicalized strings, in case-sensitive way.
  • Either lexical kind of an identifier (regular or delimited) can be used either for binding a variable (as in FROM or GROUP BY) or as a reference variable to something bound elsewhere.
    Importantly, it is ok to name an object "FOO" and then refer to it as FOO or foo and, conversely, name an object gOO and then refer to it as "GOO".

partiql-lang-kotlin

In contrast, in the current implementation:

  • While there are lexical regular and delimited identifiers of the same lexical appearance as in SQL, there are three kinds of identifiers internally. The first distinction is based on whether it is an identifier used as a binder vs used as a reference.
  • For a binder, the case of characters in the lexical appearance is preserved semantically, while the fact whether the identifier was regular or delimited is discarded. The binder identifiers are used as keys for things like environments.
  • For a reference identifier, a delimited identifier becomes "an identifier with case-sensitive lookup", while a regular identifier becomes "an identifier with case-insensitive lookup". For the latter, the canonical representation is lower case.
  • The lookup of a reference identifier in an environment depends on whether it is case-sensitive or case-insensitive.
  • The case-sensitive lookup is straightforward, using case-sensitive string equality.
  • The case-insensitive lookup uses case-insensitive string equality. It can give an ambiguous result (reference Foo matching binders FOO, Foo, foo) -- apparently, identifiers that are equivalent in case-insensitive sense can bind to different entities, without shadowing.

[Disclaimer: the above is an attempt at a rational reconstruction of the design in place; no claim is made about its fidelity to the intent or to all intricacies of the implementation.]

Known history

Lower case was chosen as the canonical case in order to conform with PostgreSQL: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

Conclusion

While the SQL's design is not an exhibit of simplicity, the approach in p-l-k is even more complicated. While the two seem to coincide in common usage (say, only non-delimited identifiers and no malicious introduction of ambiguous binders), the differences are significant and break SQL conformance claims.

@vgapeyev vgapeyev added the enhancement New feature or request label Jun 13, 2023
@vgapeyev vgapeyev changed the title Revisit cases sensitivity of identifiers Revisit case sensitivity of identifiers Jun 13, 2023
@vgapeyev vgapeyev added SQL92Spec Features from the SQL92 spec and removed enhancement New feature or request labels Jun 13, 2023
@johnedquinn
Copy link
Member

johnedquinn commented Jul 19, 2023

Adding some background from SQL-1999, section 20.1:

The representation of an <identifier> in the base tables and views of the Information Schema is by
a character string corresponding to its <identifier body> (in the case of a <regular identifier>) or its
<delimited identifier body> (in the case of a <delimited identifier>). Within this character string,
any lower-case letter appearing in a <regular identifier> is replaced by the equivalent upper-case
letter, and any <doublequote symbol> appearing in a <delimited identifier body> is replaced by a
<double quote>. Where an <actual identifier> has multiple forms that are equal according to the
rules of Subclause 8.2, ‘‘<comparison predicate>’’, the form stored is that encountered at definition
time.

The specification opts for using uppercase

@RCHowell
Copy link
Member

Related, we will need to reify the case-sensitivity of binders in the AST so that when we pretty-print we will know to quote or not to quote (in the case of keywords). However, the quotes should not matter for binders as we use that literal value regardless of quotes.

Today's behavior is what one would expect, that is SELECT x as a FROM T to output << { 'a': .. } ... >> rather than << { 'A': ... } ... >>

@dlurton
Copy link
Member

dlurton commented Sep 14, 2023

TBH, I'm not sure where this is coming from. The SQL-92 specification AFAICT is clear about it in section 5.2:

  1. Two <delimited identifier>s are equivalent if their <delimitedidentifier body>s (with all occurrences of <quote> replaced by and all occurrences of <doublequote symbol> replaced by <doublequote>), considered as the repetition of acharacter string literal> that specifies a <character set specification> of SQL_TEXT and an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, "<comparison predicate>

@RCHowell
Copy link
Member

@yliuuuu could you please provide an update here as to why #1519 was closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SQL92Spec Features from the SQL92 spec
Projects
None yet
Development

No branches or pull requests

4 participants