Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1372] [Feature] Support for adding customized file formats #501

Closed
3 tasks done
osopardo1 opened this issue Oct 19, 2022 · 4 comments
Closed
3 tasks done

[CT-1372] [Feature] Support for adding customized file formats #501

osopardo1 opened this issue Oct 19, 2022 · 4 comments
Labels
enhancement New feature or request Stale

Comments

@osopardo1
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-spark functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I am a developer of an open-source table format based on Delta Lake, and we wanted to use it on a dbt project. But when trying to use incremental writes, an error appears:

09:06:29    Invalid file format provided: qbeast
09:06:29        Expected one of: text, csv, json, jdbc, parquet, orc, hive, delta, libsvm, hudi

I am wondering if there's any possibility of adding support for customized file formats into the dbt-spark project. Since I am very green on the code, I could not provide a more detailed description of the new feature, and it may be not possible to advance further with it. Sorry for the inconvenience 😢

Describe alternatives you've considered

An alternative is to fork the repository and add the desired file format into the accepted_formats in the validation of incremental writes:

{% set accepted_formats = ['text', 'csv', 'json', 'jdbc', 'parquet', 'orc', 'hive', 'delta', 'libsvm', 'hudi'] %}

Also make the corresponding changes to use the new format properly and open a PR to contribute to the dbt-spark project.

Who will this benefit?

I think with this use case many developers may benefit from using extended features in the tables created by the SQL transformations in an easy way.

Are you interested in contributing this feature?

I am very green on dbt code, I can provide some feedback but I don't think it's going to be much useful :(

Anything else?

No response

@osopardo1 osopardo1 added enhancement New feature or request triage labels Oct 19, 2022
@github-actions github-actions bot changed the title [Feature] Support for adding customized file formats [CT-1372] [Feature] Support for adding customized file formats Oct 19, 2022
@guillesd
Copy link

Hi @osopardo1 could you be a bit more specific? I.e. what is the file format? Is there already a Spark extension (i.e. a JAR) that allows you to read this format?

@osopardo1
Copy link
Author

Hi! Yes, the format has a Spark library extension. You can read and write through both Spark SQL and Spark DataFrame API. Actually, our particular case is an extension of the Delta Lake table format (open-sourced as well in https://github.com/Qbeast-io/qbeast-spark). It adds certain metadata to the commit log to be able to do data skipping on both multidimensional filtering and sampling.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 1, 2022

Hey @osopardo1! I'd love to use this issue to talk about ways we could more generically support additional file formats, since there are a number of places that require updating today. You can see many of them in the PRs for adding Hudi + Iceberg support: #187, #432.

I just want to clarify one thing in your description:

An alternative is to fork the repository and add the desired file format into the accepted_formats in the validation of incremental writes:

You can override this macro's behavior by defining a macro with the same name in your own dbt project, without needing to fork the repo/plugin. So for instance:

-- macros/some_file_in_your_project.sql
{% macro dbt_spark_validate_get_file_format(raw_file_format) %}
  {#-- Validate the file format #}

  {# added "qbeast"! #}
  {% set accepted_formats = ['qbeast', 'text', 'csv', 'json', 'jdbc', 'parquet', 'orc', 'hive', 'delta', 'libsvm', 'hudi'] %}

  {% set invalid_file_format_msg -%}
    Invalid file format provided: {{ raw_file_format }}
    Expected one of: {{ accepted_formats | join(', ') }}
  {%- endset %}

  {% if raw_file_format not in accepted_formats %}
    {% do exceptions.raise_compiler_error(invalid_file_format_msg) %}
  {% endif %}

  {% do return(raw_file_format) %}
{% endmacro %}

@jtcohen6 jtcohen6 removed the triage label Dec 1, 2022
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

3 participants