[Feature Request] Support additional generation expressions for automatic data skipping #1442

allisonport-db · 2022-10-18T22:28:00Z

Overview

For partition columns that are generated columns, we are able to automatically generate partition filters when we see a data filter on its generating columns. Right now we automatically generate these for a small subset of possible generation expressions (defined here.)

Details

We can add support for additional expressions. Here are the supported generation expressions in Delta.

A few specific expressions that would make sense to add include:

Ceil: [Feature Request] Automatically generate partition filters for generated columns using the ceiling function #1443
Floor: [Feature Request] Automatically generate partition filters for generated columns using the floor function #1444
TruncDate: [Feature Request] Automatically generate partition filters for generated columns using the trunc date function #1446
TruncTimestamp: [Feature Request] Automatically generate partition filters for generated columns using the timestamp trunc function #1445

These are just a few, feel free to comment or create a new sub-issue for any additional expressions you think would be beneficial.

What to update?

justinTM · 2022-11-04T00:55:54Z

would generated automatically as identity for bigint columns be applicable here? (idk)

allisonport-db · 2022-11-08T18:10:37Z

would generated automatically as identity for bigint columns be applicable here? (idk)

can you clarify for me what you mean (maybe provide the table schema you have in mind)? we do support the identity function currently which shouldn't be data type specific

justinTM · 2022-11-08T19:52:30Z

sure thing, yeah.

it seems like it's not supported to create a table with an identity column? i was following an Issue/PR which was going to bring GENERATED ALWAYS AS IDENTITY into spark sql for delta.

if there's already a way to specify generated columns in schema like StructType([StructField("mycol1", StringType, generated_as="identity")]) that would be super sweet.

otherwise this is what works in DataBricks but not locally:

>>> # custom package spark_utils just calls delta.configure_spark_with_delta_pip()
>>> from spark_utils.delta import get_delta_postgres_aws_spark
>>> spark = get_delta_postgres_aws_spark()
>>> spark.sql("""
...     CREATE TABLE mytable1 (
...         mycol1 BIGINT GENERATED ALWAYS AS IDENTITY
...     )
...     LOCATION '/tmp/delta/mytable1'
...     USING delta
... """)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.ParseException: 
Syntax error at or near 'GENERATED'(line 3, pos 22)

== SQL ==

    CREATE TABLE mytable1 (
        mycol1 BIGINT GENERATED ALWAYS AS IDENTITY
----------------------^^^
    )
    LOCATION '/tmp/delta/mytable1'
    USING delta

>>>

allisonport-db · 2022-11-08T20:01:21Z

I misunderstood. Identity columns aren't supported yet but are on the roadmap and is being tracked by #1072. I don't see how identity columns would apply to automatic data skipping however.

theelderbeever · 2023-03-08T23:16:28Z

Feature Request: Complex/Nested SQL methods for partition filters.

Other databases let you do hash partitioning for example. It would be really useful to be able to do generated columns that get used in the partition filters in the following manner

        DeltaTable.createOrReplace(spark)
        .tableName("table")
        .addColumn("very_cardinal_string_column", "string")
        .addColumn("hashmod", "integer", generatedAlwaysAs="ABS(HASHTEXT(very_cardinal_string_column)) % 100")
        .partitionBy("hashmod")

FWIW I don't know what spark's sql function for hashing strings to integers is but HASHTEXT is the postgres one.

allisonport-db added enhancement New feature or request good first issue Good for newcomers labels Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support additional generation expressions for automatic data skipping #1442

[Feature Request] Support additional generation expressions for automatic data skipping #1442

allisonport-db commented Oct 18, 2022 •

edited

Loading

justinTM commented Nov 4, 2022

allisonport-db commented Nov 8, 2022

justinTM commented Nov 8, 2022

allisonport-db commented Nov 8, 2022

theelderbeever commented Mar 8, 2023

[Feature Request] Support additional generation expressions for automatic data skipping #1442

[Feature Request] Support additional generation expressions for automatic data skipping #1442

Comments

allisonport-db commented Oct 18, 2022 • edited Loading

Overview

Details

What to update?

justinTM commented Nov 4, 2022

allisonport-db commented Nov 8, 2022

justinTM commented Nov 8, 2022

allisonport-db commented Nov 8, 2022

theelderbeever commented Mar 8, 2023

allisonport-db commented Oct 18, 2022 •

edited

Loading