-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Support additional generation expressions for automatic data skipping #1442
Comments
would |
can you clarify for me what you mean (maybe provide the table schema you have in mind)? we do support the identity function currently which shouldn't be data type specific |
sure thing, yeah. it seems like it's not supported to create a table with an identity column? i was following an Issue/PR which was going to bring GENERATED ALWAYS AS IDENTITY into spark sql for delta. if there's already a way to specify generated columns in schema like otherwise this is what works in DataBricks but not locally: >>> # custom package spark_utils just calls delta.configure_spark_with_delta_pip()
>>> from spark_utils.delta import get_delta_postgres_aws_spark
>>> spark = get_delta_postgres_aws_spark()
>>> spark.sql("""
... CREATE TABLE mytable1 (
... mycol1 BIGINT GENERATED ALWAYS AS IDENTITY
... )
... LOCATION '/tmp/delta/mytable1'
... USING delta
... """)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/Users/justin.mai/Library/Caches/pypoetry/virtualenvs/my-secret-work-project-cVVc29YA-py3.8/lib/python3.8/site-packages/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.ParseException:
Syntax error at or near 'GENERATED'(line 3, pos 22)
== SQL ==
CREATE TABLE mytable1 (
mycol1 BIGINT GENERATED ALWAYS AS IDENTITY
----------------------^^^
)
LOCATION '/tmp/delta/mytable1'
USING delta
>>> |
I misunderstood. Identity columns aren't supported yet but are on the roadmap and is being tracked by #1072. I don't see how identity columns would apply to automatic data skipping however. |
Feature Request: Complex/Nested SQL methods for partition filters. Other databases let you do hash partitioning for example. It would be really useful to be able to do generated columns that get used in the partition filters in the following manner DeltaTable.createOrReplace(spark)
.tableName("table")
.addColumn("very_cardinal_string_column", "string")
.addColumn("hashmod", "integer", generatedAlwaysAs="ABS(HASHTEXT(very_cardinal_string_column)) % 100")
.partitionBy("hashmod") FWIW I don't know what spark's sql function for hashing strings to integers is but |
Overview
For partition columns that are generated columns, we are able to automatically generate partition filters when we see a data filter on its generating columns. Right now we automatically generate these for a small subset of possible generation expressions (defined here.)
Details
We can add support for additional expressions. Here are the supported generation expressions in Delta.
A few specific expressions that would make sense to add include:
These are just a few, feel free to comment or create a new sub-issue for any additional expressions you think would be beneficial.
What to update?
The text was updated successfully, but these errors were encountered: