Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support additional generation expressions for automatic data skipping #1442

Open
allisonport-db opened this issue Oct 18, 2022 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@allisonport-db
Copy link
Collaborator

allisonport-db commented Oct 18, 2022

Overview

For partition columns that are generated columns, we are able to automatically generate partition filters when we see a data filter on its generating columns. Right now we automatically generate these for a small subset of possible generation expressions (defined here.)

Details

We can add support for additional expressions. Here are the supported generation expressions in Delta.

A few specific expressions that would make sense to add include:

These are just a few, feel free to comment or create a new sub-issue for any additional expressions you think would be beneficial.

What to update?

@justinTM
Copy link

justinTM commented Nov 4, 2022

would generated automatically as identity for bigint columns be applicable here? (idk)

@allisonport-db
Copy link
Collaborator Author

would generated automatically as identity for bigint columns be applicable here? (idk)

can you clarify for me what you mean (maybe provide the table schema you have in mind)? we do support the identity function currently which shouldn't be data type specific

@justinTM
Copy link

justinTM commented Nov 8, 2022

sure thing, yeah.

it seems like it's not supported to create a table with an identity column? i was following an Issue/PR which was going to bring GENERATED ALWAYS AS IDENTITY into spark sql for delta.

if there's already a way to specify generated columns in schema like StructType([StructField("mycol1", StringType, generated_as="identity")]) that would be super sweet.

otherwise this is what works in DataBricks but not locally:

>>> # custom package spark_utils just calls delta.configure_spark_with_delta_pip()
>>> from spark_utils.delta import get_delta_postgres_aws_spark
>>> spark = get_delta_postgres_aws_spark()
>>> spark.sql("""
...     CREATE TABLE mytable1 (
...         mycol1 BIGINT GENERATED ALWAYS AS IDENTITY
...     )
...     LOCATION '/tmp/delta/mytable1'
...     USING delta
... """)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.ParseException: 
Syntax error at or near 'GENERATED'(line 3, pos 22)

== SQL ==

    CREATE TABLE mytable1 (
        mycol1 BIGINT GENERATED ALWAYS AS IDENTITY
----------------------^^^
    )
    LOCATION '/tmp/delta/mytable1'
    USING delta

>>> 

@allisonport-db
Copy link
Collaborator Author

I misunderstood. Identity columns aren't supported yet but are on the roadmap and is being tracked by #1072. I don't see how identity columns would apply to automatic data skipping however.

@theelderbeever
Copy link

Feature Request: Complex/Nested SQL methods for partition filters.

Other databases let you do hash partitioning for example. It would be really useful to be able to do generated columns that get used in the partition filters in the following manner

        DeltaTable.createOrReplace(spark)
        .tableName("table")
        .addColumn("very_cardinal_string_column", "string")
        .addColumn("hashmod", "integer", generatedAlwaysAs="ABS(HASHTEXT(very_cardinal_string_column)) % 100")
        .partitionBy("hashmod")

FWIW I don't know what spark's sql function for hashing strings to integers is but HASHTEXT is the postgres one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants