[CT-1372] [Feature] Support for adding customized file formats #501

osopardo1 · 2022-10-19T08:02:23Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-spark functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I am a developer of an open-source table format based on Delta Lake, and we wanted to use it on a dbt project. But when trying to use incremental writes, an error appears:

09:06:29    Invalid file format provided: qbeast
09:06:29        Expected one of: text, csv, json, jdbc, parquet, orc, hive, delta, libsvm, hudi

I am wondering if there's any possibility of adding support for customized file formats into the dbt-spark project. Since I am very green on the code, I could not provide a more detailed description of the new feature, and it may be not possible to advance further with it. Sorry for the inconvenience 😢

Describe alternatives you've considered

An alternative is to fork the repository and add the desired file format into the accepted_formats in the validation of incremental writes:

dbt-spark/dbt/include/spark/macros/materializations/incremental/validate.sql

Line 4 in 37dcfe3

    
           {% set accepted_formats = ['text', 'csv', 'json', 'jdbc', 'parquet', 'orc', 'hive', 'delta', 'libsvm', 'hudi'] %}

Also make the corresponding changes to use the new format properly and open a PR to contribute to the dbt-spark project.

Who will this benefit?

I think with this use case many developers may benefit from using extended features in the tables created by the SQL transformations in an easy way.

Are you interested in contributing this feature?

I am very green on dbt code, I can provide some feedback but I don't think it's going to be much useful :(

Anything else?

No response

The text was updated successfully, but these errors were encountered:

guillesd · 2022-10-27T11:33:43Z

Hi @osopardo1 could you be a bit more specific? I.e. what is the file format? Is there already a Spark extension (i.e. a JAR) that allows you to read this format?

osopardo1 · 2022-10-27T12:22:10Z

Hi! Yes, the format has a Spark library extension. You can read and write through both Spark SQL and Spark DataFrame API. Actually, our particular case is an extension of the Delta Lake table format (open-sourced as well in https://github.com/Qbeast-io/qbeast-spark). It adds certain metadata to the commit log to be able to do data skipping on both multidimensional filtering and sampling.

jtcohen6 · 2022-12-01T18:02:17Z

Hey @osopardo1! I'd love to use this issue to talk about ways we could more generically support additional file formats, since there are a number of places that require updating today. You can see many of them in the PRs for adding Hudi + Iceberg support: #187, #432.

I just want to clarify one thing in your description:

An alternative is to fork the repository and add the desired file format into the accepted_formats in the validation of incremental writes:

You can override this macro's behavior by defining a macro with the same name in your own dbt project, without needing to fork the repo/plugin. So for instance:

-- macros/some_file_in_your_project.sql
{% macro dbt_spark_validate_get_file_format(raw_file_format) %}
  {#-- Validate the file format #}

  {# added "qbeast"! #}
  {% set accepted_formats = ['qbeast', 'text', 'csv', 'json', 'jdbc', 'parquet', 'orc', 'hive', 'delta', 'libsvm', 'hudi'] %}

  {% set invalid_file_format_msg -%}
    Invalid file format provided: {{ raw_file_format }}
    Expected one of: {{ accepted_formats | join(', ') }}
  {%- endset %}

  {% if raw_file_format not in accepted_formats %}
    {% do exceptions.raise_compiler_error(invalid_file_format_msg) %}
  {% endif %}

  {% do return(raw_file_format) %}
{% endmacro %}

github-actions · 2023-05-31T02:03:02Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

osopardo1 added enhancement New feature or request triage labels Oct 19, 2022

github-actions bot changed the title ~~[Feature] Support for adding customized file formats~~ [CT-1372] [Feature] Support for adding customized file formats Oct 19, 2022

jtcohen6 removed the triage label Dec 1, 2022

github-actions bot added the Stale label May 31, 2023

github-actions bot closed this as completed Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1372] [Feature] Support for adding customized file formats #501

[CT-1372] [Feature] Support for adding customized file formats #501

osopardo1 commented Oct 19, 2022

guillesd commented Oct 27, 2022

osopardo1 commented Oct 27, 2022

jtcohen6 commented Dec 1, 2022 •

edited

Loading

github-actions bot commented May 31, 2023

[CT-1372] [Feature] Support for adding customized file formats #501

[CT-1372] [Feature] Support for adding customized file formats #501

Comments

osopardo1 commented Oct 19, 2022

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

guillesd commented Oct 27, 2022

osopardo1 commented Oct 27, 2022

jtcohen6 commented Dec 1, 2022 • edited Loading

github-actions bot commented May 31, 2023

jtcohen6 commented Dec 1, 2022 •

edited

Loading