Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command: dbt clone #7258

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changes/unreleased/Features-20230401-193614.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Features
body: 'New command: ''dbt clone'''
time: 2023-04-01T19:36:14.622217+02:00
custom:
Author: jtcohen6
Issue: "7256"
37 changes: 37 additions & 0 deletions core/dbt/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from dbt.task.build import BuildTask
from dbt.task.clean import CleanTask
from dbt.task.compile import CompileTask
from dbt.task.clone import CloneTask
from dbt.task.debug import DebugTask
from dbt.task.deps import DepsTask
from dbt.task.freshness import FreshnessTask
Expand Down Expand Up @@ -393,6 +394,42 @@ def show(ctx, **kwargs):
return results, success


# dbt clone
@cli.command("clone")
@click.pass_context
@p.exclude
@p.full_refresh
@p.profile
@p.profiles_dir
@p.project_dir
@p.resource_type
@p.select
@p.selector
@p.state # required
@p.target
@p.target_path
@p.threads
@p.vars
@p.version_check
@requires.preflight
@requires.profile
@requires.project
@requires.runtime_config
@requires.manifest
@requires.postflight
def clone(ctx, **kwargs):
"""Create clones of selected nodes based on their location in the manifest provided to --state."""
task = CloneTask(
ctx.obj["flags"],
ctx.obj["runtime_config"],
ctx.obj["manifest"],
)

results = task.run()
success = task.interpret_results(results)
return results, success


# dbt debug
@cli.command("debug")
@click.pass_context
Expand Down
14 changes: 14 additions & 0 deletions core/dbt/context/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1427,6 +1427,20 @@ def this(self) -> Optional[RelationProxy]:
return None
return self.db_wrapper.Relation.create_from(self.config, self.model)

@contextproperty
def state_relation(self) -> Optional[RelationProxy]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to naming suggestions! This will only be available in the context for the clone command currently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "relation" being used as a byword for table/view. Relation convene, so many things, and I feel like we want an additional term or a better term that can be more specific about what you're trying to achieve. Perhaps "stateful_db_relation". I really don't know the subtleties here though, so you might have picked the best one.

"""
For commands which add information about this node's corresponding
production version (via a --state artifact), access the Relation
object for that stateful other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other feels vague (well, really, just Latinate/French in phrasing :) ). "source node", etc.?

"""
if getattr(self.model, "state_relation", None):
return self.db_wrapper.Relation.create_from_node(
self.config, self.model.state_relation # type: ignore
)
else:
return None


# This is called by '_context_for', used in 'render_with_context'
def generate_parser_model_context(
Expand Down
27 changes: 27 additions & 0 deletions core/dbt/contracts/graph/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
GraphMemberNode,
ResultNode,
BaseNode,
StateRelation,
ManifestOrPublicNode,
)
from dbt.contracts.graph.unparsed import SourcePatch, NodeVersion, UnparsedVersion
Expand Down Expand Up @@ -1103,6 +1104,30 @@ def merge_from_artifact(
sample = list(islice(merged, 5))
fire_event(MergedFromState(num_merged=len(merged), sample=sample))

# Called by CloneTask.defer_to_manifest
def add_from_artifact(
self,
other: "WritableManifest",
) -> None:
"""Update this manifest by *adding* information about each node's location
in the other manifest.

Only non-ephemeral refable nodes are examined.
"""
refables = set(NodeType.refable())
for unique_id, node in other.nodes.items():
current = self.nodes.get(unique_id)
if current and (node.resource_type in refables and not node.is_ephemeral):
other_node = other.nodes[unique_id]
state_relation = StateRelation(
other_node.database, other_node.schema, other_node.alias
)
self.nodes[unique_id] = current.replace(state_relation=state_relation)

# Rebuild the flat_graph, which powers the 'graph' context variable,
# now that we've deferred some nodes
self.build_flat_graph()

# Methods that were formerly in ParseResult

def add_macro(self, source_file: SourceFile, macro: Macro):
Expand Down Expand Up @@ -1316,6 +1341,8 @@ def __post_serialize__(self, dct):
for unique_id, node in dct["nodes"].items():
if "config_call_dict" in node:
del node["config_call_dict"]
if "state_relation" in node:
del node["state_relation"]
Comment on lines +1344 to +1345
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel like state_relation is a thing we should need/want to include in serialized manifest.json. It's really just for our internal use.

return dct


Expand Down
16 changes: 15 additions & 1 deletion core/dbt/contracts/graph/nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,17 @@ def add_public_node(self, value: str):
self.public_nodes.append(value)


@dataclass
class StateRelation(dbtClassMixin):
database: Optional[str]
schema: str
alias: str

@property
def identifier(self):
return self.alias


@dataclass
class ParsedNodeMandatory(GraphNode, HasRelationMetadata, Replaceable):
alias: str
Expand Down Expand Up @@ -358,7 +369,7 @@ def __post_serialize__(self, dct):
@classmethod
def _deserialize(cls, dct: Dict[str, int]):
# The serialized ParsedNodes do not differ from each other
# in fields that would allow 'from_dict' to distinguis
# in fields that would allow 'from_dict' to distinguish
# between them.
resource_type = dct["resource_type"]
if resource_type == "model":
Expand Down Expand Up @@ -615,6 +626,7 @@ class ModelNode(CompiledNode):
constraints: List[ModelLevelConstraint] = field(default_factory=list)
version: Optional[NodeVersion] = None
latest_version: Optional[NodeVersion] = None
state_relation: Optional[StateRelation] = None

@property
def is_latest_version(self) -> bool:
Expand Down Expand Up @@ -797,6 +809,7 @@ class SeedNode(ParsedNode): # No SQLDefaults!
# and we need the root_path to load the seed later
root_path: Optional[str] = None
depends_on: MacroDependsOn = field(default_factory=MacroDependsOn)
state_relation: Optional[StateRelation] = None

def same_seeds(self, other: "SeedNode") -> bool:
# for seeds, we check the hashes. If the hashes are different types,
Expand Down Expand Up @@ -995,6 +1008,7 @@ class IntermediateSnapshotNode(CompiledNode):
class SnapshotNode(CompiledNode):
resource_type: NodeType = field(metadata={"restrict": [NodeType.Snapshot]})
config: SnapshotConfig
state_relation: Optional[StateRelation] = None


# ====================================
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
{% macro can_clone_tables() %}
{{ return(adapter.dispatch('can_clone_tables', 'dbt')()) }}
{% endmacro %}


{% macro default__can_clone_tables() %}
{{ return(False) }}
{% endmacro %}


{% macro snowflake__can_clone_tables() %}
{{ return(True) }}
{% endmacro %}
Comment on lines +1 to +13
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of True/False conditional behavior might be better as an adapter property/method (Python), since it's really a property of the adapter / data platform, rather than something a specific user wants to reimplement. Comparable to the "boolean macros" we defined for logic around grants. Open to thoughts!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to state the obvious: any snowflake__ macros would want to move to dbt-snowflake as part of implementing & testing this on our adapters! I'm just defining them here for now for the sake of comparison & convenience (= laziness)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with both comments



{% macro get_pointer_sql(to_relation) %}
{{ return(adapter.dispatch('get_pointer_sql', 'dbt')(to_relation)) }}
{% endmacro %}


{% macro default__get_pointer_sql(to_relation) %}
{% set pointer_sql %}
Copy link
Contributor

@VersusFacit VersusFacit May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uncomfortable with this use of pointer. I'm not exactly sure why were drawing that comparison, when perhaps we should use something closer to reference or allude to the fact that there's a shadow copy. In my mind pointer is just a very specific thing and somewhat anachronistic in a SQL context.

select * from {{ to_relation }}
{% endset %}
{{ return(pointer_sql) }}
{% endmacro %}


{% macro get_clone_table_sql(this_relation, state_relation) %}
{{ return(adapter.dispatch('get_clone_table_sql', 'dbt')(this_relation, state_relation)) }}
{% endmacro %}


{% macro default__get_clone_table_sql(this_relation, state_relation) %}
create or replace table {{ this_relation }} clone {{ state_relation }}
{% endmacro %}


{% macro snowflake__get_clone_table_sql(this_relation, state_relation) %}
create or replace
{{ "transient" if config.get("transient", true) }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake requires that table clones match the transience/permanence of the table they're cloning.

We determine if the table is a table based on the cache result from the other/prod schema, but here we're just using the current (dev) configuration for transient. There's the possibility of a mismatch if a user has updated the transient config in development.

table {{ this_relation }}
clone {{ state_relation }}
{{ "copy grants" if config.get("copy_grants", false) }}
{% endmacro %}


{%- materialization clone, default -%}

{%- set relations = {'relations': []} -%}

{%- if not state_relation -%}
-- nothing to do
{{ log("No relation found in state manifest for " ~ model.unique_id) }}
{{ return(relations) }}
{%- endif -%}

{%- set existing_relation = load_cached_relation(this) -%}

{%- if existing_relation and not flags.FULL_REFRESH -%}
-- noop!
{{ log("Relation " ~ existing_relation ~ " already exists") }}
{{ return(relations) }}
{%- endif -%}

{%- set other_existing_relation = load_cached_relation(state_relation) -%}

-- If this is a database that can do zero-copy cloning of tables, and the other relation is a table, then this will be a table
-- Otherwise, this will be a view
Comment on lines +68 to +69
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data platforms that support table cloning:

  • Snowflake (docs)
  • BigQuery (docs)
  • Databricks (docs, with two modes: "shallow" (zero-copy) and "deep" (full copy)

Data platforms that don't:

  • Postgres
  • Redshift
  • Trino


{% set can_clone_tables = can_clone_tables() %}

{%- if other_existing_relation and other_existing_relation.type == 'table' and can_clone_tables -%}

{%- set target_relation = this.incorporate(type='table') -%}
{% if existing_relation is not none and not existing_relation.is_table %}
{{ log("Dropping relation " ~ existing_relation ~ " because it is of type " ~ existing_relation.type) }}
{{ drop_relation_if_exists(existing_relation) }}
{% endif %}

-- as a general rule, data platforms that can clone tables can also do atomic 'create or replace'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is me being a little bit lazy, but it does hold true for Snowflake/BigQuery/Databricks. My goal is for all of our data platforms to be able to use this materialization "as is," and only reimplement a targeted set of macros where the behavior genuinely differs.

{% call statement('main') %}
{{ get_clone_table_sql(target_relation, state_relation) }}
{% endcall %}

{% set should_revoke = should_revoke(existing_relation, full_refresh_mode=True) %}
{% do apply_grants(target_relation, grant_config, should_revoke=should_revoke) %}
{% do persist_docs(target_relation, model) %}
Comment on lines +86 to +88
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should still apply grants & table/column-level comments. I have a suspicion that whether these things are copied over, during cloning, varies by data platform; I should really look to confirm/reject that suspicion. It's also possible that the user has defined conditional logic for these that differs between dev & prod, especially grants.


{{ return({'relations': [target_relation]}) }}

{%- else -%}

{%- set target_relation = this.incorporate(type='view') -%}

-- TODO: this should probably be illegal
-- I'm just doing it out of convenience to reuse the 'view' materialization logic
{%- do context.update({
'sql': get_pointer_sql(state_relation),
'compiled_code': get_pointer_sql(state_relation)
}) -%}

-- reuse the view materialization
-- TODO: support actual dispatch for materialization macros
{% set search_name = "materialization_view_" ~ adapter.type() %}
{% if not search_name in context %}
{% set search_name = "materialization_view_default" %}
{% endif %}
{% set materialization_macro = context[search_name] %}
{% set relations = materialization_macro() %}
Comment on lines +96 to +110
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some of the most inspired (read: horrifying) code I've written in years. Big idea:

  • We want to create a view.
  • We want the SQL in that view definition to be select * from {other_relation}.
  • We already know how to create views for this data platform — it's the view materialization! — with all its atomicity and accoutrements.
  • What if we could just call and reuse the view materialization, within this one? Well... we actually can :)

Specific things to call out:

  • Writing back into the context (via context.update) feels a little bit illegal... but I don't think it's actually dangerous, the context is specific to this node
  • It would be all-around better if materialization (+ test) macros followed the same naming convention as other dispatched macros, with adapter__ prefixes, so that this could just be adapter.dispatch("materialization_view") ([CT-112] Better UX for macro dispatch #4646)

{{ return(relations) }}

{%- endif -%}

{%- endmaterialization -%}
Loading