-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1372] [Feature] Support for adding customized file formats #501
Comments
Hi @osopardo1 could you be a bit more specific? I.e. what is the file format? Is there already a Spark extension (i.e. a JAR) that allows you to read this format? |
Hi! Yes, the format has a Spark library extension. You can read and write through both Spark SQL and Spark DataFrame API. Actually, our particular case is an extension of the Delta Lake table format (open-sourced as well in https://github.com/Qbeast-io/qbeast-spark). It adds certain metadata to the commit log to be able to do data skipping on both multidimensional filtering and sampling. |
Hey @osopardo1! I'd love to use this issue to talk about ways we could more generically support additional file formats, since there are a number of places that require updating today. You can see many of them in the PRs for adding Hudi + Iceberg support: #187, #432. I just want to clarify one thing in your description:
You can override this macro's behavior by defining a macro with the same name in your own dbt project, without needing to fork the repo/plugin. So for instance: -- macros/some_file_in_your_project.sql
{% macro dbt_spark_validate_get_file_format(raw_file_format) %}
{#-- Validate the file format #}
{# added "qbeast"! #}
{% set accepted_formats = ['qbeast', 'text', 'csv', 'json', 'jdbc', 'parquet', 'orc', 'hive', 'delta', 'libsvm', 'hudi'] %}
{% set invalid_file_format_msg -%}
Invalid file format provided: {{ raw_file_format }}
Expected one of: {{ accepted_formats | join(', ') }}
{%- endset %}
{% if raw_file_format not in accepted_formats %}
{% do exceptions.raise_compiler_error(invalid_file_format_msg) %}
{% endif %}
{% do return(raw_file_format) %}
{% endmacro %} |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Is this your first time submitting a feature request?
Describe the feature
I am a developer of an open-source table format based on Delta Lake, and we wanted to use it on a dbt project. But when trying to use incremental writes, an error appears:
I am wondering if there's any possibility of adding support for customized file formats into the
dbt-spark
project. Since I am very green on the code, I could not provide a more detailed description of the new feature, and it may be not possible to advance further with it. Sorry for the inconvenience 😢Describe alternatives you've considered
An alternative is to fork the repository and add the desired file format into the
accepted_formats
in the validation of incremental writes:dbt-spark/dbt/include/spark/macros/materializations/incremental/validate.sql
Line 4 in 37dcfe3
Also make the corresponding changes to use the new format properly and open a PR to contribute to the
dbt-spark
project.Who will this benefit?
I think with this use case many developers may benefit from using extended features in the tables created by the SQL transformations in an easy way.
Are you interested in contributing this feature?
I am very green on dbt code, I can provide some feedback but I don't think it's going to be much useful :(
Anything else?
No response
The text was updated successfully, but these errors were encountered: