Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1482] [Feature] Add support for data share objects #217

Closed
3 tasks done
Tracked by #742
randy-rc opened this issue Nov 9, 2022 · 12 comments
Closed
3 tasks done
Tracked by #742

[CT-1482] [Feature] Add support for data share objects #217

randy-rc opened this issue Nov 9, 2022 · 12 comments
Assignees
Labels
enhancement New feature or request ra3_node issues relating to ra3 node support

Comments

@randy-rc
Copy link

randy-rc commented Nov 9, 2022

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-redshift functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Issue 1

Calling dbt_utils.get_column_values on a data share object results in error

{% macro test5() -%}

    {% set details = dbt_utils.get_column_values(table=source('[name]', '[object]'), column='[column]') %}

    {% for detail in details %}
        {{ log('Detail: ' ~ detail, True) }}
    {% endfor %}

{%- endmacro %}
09:25:34  Encountered an error while running operation: Compilation Error in macro test5 (macros/test.sql)
  In get_column_values(): relation "[database]"."[schema]"."[object]" does not exist and no default value was provided.
  
  > in macro statement (macros/etc/statement.sql)
  > called by macro default__get_column_values (macros/sql/get_column_values.sql)
  > called by macro get_column_values (macros/sql/get_column_values.sql)
  > called by macro test5 (macros/test.sql)
  > called by macro test5 (macros/test.sql)

dbt Cloud System Logs (Details) captures the statement used to check if object of concern exists. Unfortunately data share objects are not stored in both pg_tables and pg_views and hence the "does not exist" error.

...
09:25:33  On macro_test5: /* {"app": "dbt", "dbt_version": "1.2.2", "profile_name": "user", "target_name": "default", "connection_name": "macro_test5"} */
select
      '[database]' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike '[schema]'
    union all
    select
      '[database]' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike '[schema]'
09:25:34  SQL status: SELECT in 0.24 seconds
09:25:34  with database=[database], schema=[schema], relations=[]
09:25:34  Postgres adapter: Error running SQL: macro test5
...

Issue 2

Calling adapter.get_columns_in_relation on a data share object does not result in error but no result is returned

{% macro test6() -%}
    
    {%- set cols = adapter.get_columns_in_relation(source('[name]', '[object]')) -%}
    {{ log('Detail: ' ~ cols, True) }}

{%- endmacro %}
05:01:51  Detail: []

Describe alternatives you've considered

Statement used to check if object of concern exists is found in macro redshift__list_relations_without_caching which references macro postgres__list_relations_without_caching and thus a solution is to overwrite redshift__list_relations_without_caching to get the required information from svv_datashare_objects.

Similarly statement used to check if columns exists is found in macro redshift__get_columns_in_relation and thus a solution is to overwrite redshift__get_columns_in_relation to get the required information from svv_all_columns
and svv_datashare_objects.

{% macro redshift__list_relations_without_caching(schema_relation) %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      '{{ schema_relation.database }}' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike '{{ schema_relation.schema }}'
    union all
    select
      '{{ schema_relation.database }}' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike '{{ schema_relation.schema }}'
    union all
    select
      '{{ schema_relation.database }}' as database,
      split_part(object_name, '.', 2) as name,
      split_part(object_name, '.', 1) as schema,
      'datashare' as type
    from svv_datashare_objects
    where share_type = 'INBOUND'
    and   object_type in ('table', 'view', 'materialized view', 'late binding view')
    and   split_part(object_name, '.', 1) ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

{% macro redshift__get_columns_in_relation(relation) -%}
  {% call statement('get_columns_in_relation', fetch_result=True) %}
      with bound_views as (
        select
          ordinal_position,
          table_schema,
          column_name,
          data_type,
          character_maximum_length,
          numeric_precision,
          numeric_scale

        from information_schema."columns"
        where table_name = '{{ relation.identifier }}'
    ),

    unbound_views as (
      select
        ordinal_position,
        view_schema,
        col_name,
        case
          when col_type ilike 'character varying%' then
            'character varying'
          when col_type ilike 'numeric%' then 'numeric'
          else col_type
        end as col_type,
        case
          when col_type like 'character%'
          then nullif(REGEXP_SUBSTR(col_type, '[0-9]+'), '')::int
          else null
        end as character_maximum_length,
        case
          when col_type like 'numeric%'
          then nullif(
            SPLIT_PART(REGEXP_SUBSTR(col_type, '[0-9,]+'), ',', 1),
            '')::int
          else null
        end as numeric_precision,
        case
          when col_type like 'numeric%'
          then nullif(
            SPLIT_PART(REGEXP_SUBSTR(col_type, '[0-9,]+'), ',', 2),
            '')::int
          else null
        end as numeric_scale

      from pg_get_late_binding_view_cols()
      cols(view_schema name, view_name name, col_name name,
           col_type varchar, ordinal_position int)
      where view_name = '{{ relation.identifier }}'
    ),

    external_views as (
      select
        columnnum,
        schemaname,
        columnname,
        case
          when external_type ilike 'character varying%' or external_type ilike 'varchar%'
          then 'character varying'
          when external_type ilike 'numeric%' then 'numeric'
          else external_type
        end as external_type,
        case
          when external_type like 'character%' or external_type like 'varchar%'
          then nullif(
            REGEXP_SUBSTR(external_type, '[0-9]+'),
            '')::int
          else null
        end as character_maximum_length,
        case
          when external_type like 'numeric%'
          then nullif(
            SPLIT_PART(REGEXP_SUBSTR(external_type, '[0-9,]+'), ',', 1),
            '')::int
          else null
        end as numeric_precision,
        case
          when external_type like 'numeric%'
          then nullif(
            SPLIT_PART(REGEXP_SUBSTR(external_type, '[0-9,]+'), ',', 2),
            '')::int
          else null
        end as numeric_scale
      from
        pg_catalog.svv_external_columns
      where
        schemaname = '{{ relation.schema }}'
        and tablename = '{{ relation.identifier }}'

    ),

    data_share as (
      select
        ordinal_position as columnnum,
        schema_name as schemaname,
        column_name as columnname,
        case
          when data_type ilike 'character varying%' then
            'character varying'
          when data_type ilike 'numeric%' then 'numeric'
          else data_type
        end as col_type,
        case
          when data_type like 'character%'
          then nullif(REGEXP_SUBSTR(data_type, '[0-9]+'), '')::int
          else null
        end as character_maximum_length,
        case
          when data_type like 'numeric%'
          then nullif(
            SPLIT_PART(REGEXP_SUBSTR(data_type, '[0-9,]+'), ',', 1),
            '')::int
          else null
        end as numeric_precision,
        case
          when data_type like 'numeric%'
          then nullif(
            SPLIT_PART(REGEXP_SUBSTR(data_type, '[0-9,]+'), ',', 2),
            '')::int
          else null
        end as numeric_scale
      from svv_all_columns a
      inner join (
        select *
        from svv_datashare_objects
        where share_type = 'INBOUND'
        and   object_type in ('table', 'view', 'materialized view', 'late binding view')
        and   object_name = '{{ relation.schema }}' || '.' || '{{ relation.identifier }}'
      ) b on a.schema_name || '.' || a.table_name  = b.object_name
    ),

    unioned as (
      select * from bound_views
      union all
      select * from unbound_views
      union all
      select * from external_views
      union all
      select * from data_share
    )

    select
      column_name,
      data_type,
      character_maximum_length,
      numeric_precision,
      numeric_scale

    from unioned
    {% if relation.schema %}
    where table_schema = '{{ relation.schema }}'
    {% endif %}
    order by ordinal_position
  {% endcall %}
  {% set table = load_result('get_columns_in_relation').table %}
  {{ return(sql_convert_columns_in_relation(table)) }}
{% endmacro %}

Who will this benefit?

Any users who are exploring the use of Redshift data share.

Are you interested in contributing this feature?

No response

Anything else?

No response

@randy-rc randy-rc added enhancement New feature or request triage labels Nov 9, 2022
@github-actions github-actions bot changed the title [Feature] Add support for data share objects [CT-1482] [Feature] Add support for data share objects Nov 9, 2022
@jtcohen6
Copy link
Contributor

@randy-rc Thanks for the detailed write-up!

This issue sounds a lot like #179. It would probably make sense to tackle the two together.

Tables in other databases (accessed via RA3 cross-db reads) don't appear in pg_tables or pg_views, either. They have to come from svv_tables.

In this case, it looks like there's a svv_datashare_objects system table, and that we'll also want to pull column-level info from svv_all_columns (in addition to information_schema."columns" and pg_catalog.svv_external_columns).

@jtcohen6 jtcohen6 removed the triage label Nov 14, 2022
@Fleid Fleid self-assigned this Nov 14, 2022
@randy-rc
Copy link
Author

randy-rc commented Nov 28, 2022

Update to macro redshift__list_relations_without_caching data share section to check for relations under the right database. Prior to this change, generating snapshots in consumer fails as it detects the resulting tables in data share.

{% macro redshift__list_relations_without_caching(schema_relation) %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      '{{ schema_relation.database }}' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike '{{ schema_relation.schema }}'
    union all
    select
      '{{ schema_relation.database }}' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike '{{ schema_relation.schema }}'
    union all
    (
      with
      shared_db_schema as (
        select
          database_name,
          schema_name
        from  svv_all_schemas
        where schema_type='shared'
      ),
      inbound_datashare_obj as (
        select
          split_part(object_name, '.', 1) as schema_name,
          split_part(object_name, '.', 2) as table_name
        from  svv_datashare_objects
        where share_type='INBOUND'
        and   object_type in ('table', 'view', 'materialized view', 'late binding view')
      )
      select
        a.database_name as database,
        a.table_name    as name,
        a.schema_name   as schema,
        'datashare'     as type
      from       svv_all_tables a
      inner join shared_db_schema b
      on         a.database_name = b.database_name
      and        a.schema_name = b.schema_name
      inner join inbound_datashare_obj c
      on         a.schema_name = c.schema_name
      and        a.table_name = c.table_name
      where      a.database_name ilike '{{ schema_relation.database }}'
      and        a.schema_name ilike '{{ schema_relation.schema }}'
    )
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 28, 2023
@Fleid Fleid removed the Stale label May 31, 2023
@rarup1
Copy link

rarup1 commented Jun 28, 2023

@jiezhen-chen Is this issue going to be resolved soon?
We are using Redshift datasharing with a Redshift serverless as our CICD / Dev environment as it helps separate our resource use. The dbt.utils.union_relations is an example of a utils that doesn't work if you are using a datashare since it is looking in the info_schema before a model is created in the server (if you are deferring to a datashare). The work around we are using at the moment is to have to build all upstream models of a union in the Redshift Serverless environment before we call the macro. It is not the best experience as it stands.

Grateful if this can be resolved in the macro / passing an override to point at the datashare schema...

Thanks!

@rarup1
Copy link

rarup1 commented Jul 30, 2023

Just seen that this is no longer fixed in dbt 1.6 - is there any chance this will be looked at soon?

Cheers

@yuna-tang
Copy link

@Fleid Just would like some updates, when will this issue be looked at for the future dbt version? We are planning to use serverless for our CICD and currently our workaround is to use the above override macros.

@jiezhen-chen
Copy link
Contributor

@rarup1 Unfortunately the PR that was merged didn't end up providing support for datashares. We decided not to use the redshift_connector API call, and instead queried from information_schema, because the redshift_connector API call doesn't support MV at the moment. Apologies for the confusion!

@yuna-tang
Copy link

@jiezhen-chen Can you please let us know when dbt will support Redshift datashare?

@rarup1
Copy link

rarup1 commented Aug 10, 2023

@yuna-tang are you able to give me a run through of how you have got your workaround CICD/Dev process working with a RS datashare?

I am trying to implement the override macros that were shared above by @randy-rc:

{% macro redshift__list_relations_without_caching(schema_relation) %}
{% macro redshift__get_columns_in_relation(relation) -%}

I have set up a test model test_union.sql using the dbt.utils.union_relations macro as follows:

WITH _union as (
{{ dbt_utils.union_relations(relations=[ref('fct_table_1'), ref('fct_table_2')]) }}
)

SELECT * FROM _union

Where fct_table_1 and fct_table_2 are only in a datashare named warehouse and the target db is dev.

And then running a deferral:

dbt run -m test_union --state target_defer --defer

Where the default target db is dev and the deferrals all defer back to the datashare db warehouse through the target_defer/manifests.json if they do not exist in developer's schema eg. dbt_rupert_core.fct_table_1 and dbt_rupert_core.fct_table_2. In this example they do not, as that is where the problems for us arise.

The issue I am having at the moment after deploying the macros above is actually getting them to pick up the datashare columns. The jinja compile is column-less:

  create view "dev"."dbt_rupert_core"."test_union__dbt_tmp" as (
    WITH _union as (
        (select cast('"warehouse"."models_core"."fct_table_1"' as varchar) as _dbt_source_relation,
            from "warehouse"."models_core"."fct_table_1"
        )

        union all
        (select cast('"warehouse"."models_core"."fct_table_2"' as varchar) as _dbt_source_relation,
            from "warehouse"."models_core"."fct_table_2"
        )
)
SELECT * FROM _union
  ) with no schema binding;

Interested to hear how you got yours working! Cheers

update

It was due to the svv_datashare_object not being accesible for the warehouse datashare for any user other than root/admin. I got it working by testing out the whole process with root as the user. The performance was shockingly slow however and I will be looking into saving the svv_ tables in our dwh as a workaround.

@yuna-tang
Copy link

@rarup1 sorry for the late reply. I also tried the test model test_union.sql , which generated the empty columns for me.

For us to use serverless with Datashare, we need to configure all our datasource.yml file to hardcode the database & schema to our main cluster, and then we will be able to create the sample version of downstream models in the serverless.

version: 2
sources:
  - name: <your_data_source>
    database:<your_database>
    schema:  <your_data_source>

Then we need to add below in the datashare :

  • add all the schemas from production into the datashare
ALTER datashare <your_datashare_name> add schema <your_data_source>;
ALTER datashare  <your_datashare_name> add all tables in schema <your_data_source>;
ALTER DATASHARE  <your_datashare_name>  SET INCLUDENEW = TRUE FOR SCHEMA <your_data_source>;
```;
- create all the users in datashare that exist in main cluster

Then I have followed the steps in [this article ](https://medium.com/building-inventa/how-we-slimmed-down-slim-ci-for-dbt-cloud-6a944e7574e2)to create a sample production database and run CICD against it using serverless. 
- we are still at the testing stage but be able to create a full sample database in serverless via datashare with 1 day of data.


Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 12, 2024
@mikealfare mikealfare removed the Stale label Feb 12, 2024
@Fleid Fleid removed their assignment Feb 14, 2024
@Fleid Fleid self-assigned this Feb 22, 2024
@dataders dataders added the ra3_node issues relating to ra3 node support label Mar 26, 2024
@dataders
Copy link
Contributor

dataders commented Apr 3, 2024

closing in favor of #742

@dataders dataders closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ra3_node issues relating to ra3 node support
Projects
None yet
Development

No branches or pull requests

8 participants