Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1162] Blueprint optional project ref statement parameter #5826

Closed
dpguthrie opened this issue Sep 13, 2022 · 11 comments
Closed

[CT-1162] Blueprint optional project ref statement parameter #5826

dpguthrie opened this issue Sep 13, 2022 · 11 comments
Assignees
Labels
stale Issues that have gone stale

Comments

@dpguthrie
Copy link

Success Criteria:

  • Links to existing files in dbt-core that need to be updated
  • Net new files in dbt-core and why it needs new files
  • Potential "gotcha" implementations based on the above(think: if we update this file here, we have to update 100 files downstream)
  • pseudo-code snippets on initial approach for the above to make this functionality work
@dpguthrie dpguthrie self-assigned this Sep 13, 2022
@github-actions github-actions bot changed the title Blueprint optional project ref statement parameter [CT-1162] Blueprint optional project ref statement parameter Sep 13, 2022
@dpguthrie
Copy link
Author

@dpguthrie
Copy link
Author

dpguthrie commented Sep 13, 2022

Links to Existing Files (WIP)

  • Model uniqueness constraint across projects (core/dbt/parser/manifest.py#971)
  • Some possibilities related to the target_model_package which we currently allow as an argument to ref:
    • Here's where we begin to process the ref and the package name comes into play (dbt/parser/manifest.py)
    • The manifest is then responsible for resolving this ref with an internal method here. This method accepts the target_model_package as an Optional argument.

@dpguthrie
Copy link
Author

dpguthrie commented Sep 19, 2022

Progress Update

Question 1: What do we need to do to make ref accept a project parameter?

Answer: Nothing! Currently, the ref argument accepts two arguments:

  1. target_model_package
  2. target_model_name

The logic exists within _process_refs_for_node

if len(ref) == 1:
    target_model_name = ref[0]
elif len(ref) == 2:
    target_model_package, target_model_name = ref

However, the projects are scoped to what we've defined in our packages.yml file as well as internal packages. The upstream project is not a package though. We don't need to have the project locally but just a representation of it, something like a pared-down manifest.json file.

Question 2: What do we need to do then to resolve a reference to an upstream contract?

This is where it starts to get tricky.

The _process_refs_for_node function mentioned above will eventually call resolve_ref, which is a method on the Manifest class. This eventually calls an internal object called ref_lookup, which is a pointer to a RefableLookup object on the manifest instance. This object has a storage property that is populated from the nodes in the instantiated manifest

def populate(self, manifest):
    for node in manifest.nodes.values():
        self.add_node(node)

My guess is we don't want to include a "node" from an upstream project within the nodes in our manifest, as we don't want these to actually be run in our project's scope. We simply need a way to resolve the appropriate references to the locations of those contract nodes.

So, what could be some possible implementations:

Alter the way we populate the ref_lookup?

def populate(self, manifest):
    for node in chain(manifest.nodes.values(), manifest.contracts_upstream.values()):
        self.add_node(node)

This assumes:

  • we create a contracts_upstream property in our manfiest
    • The Manifest dataclass would have to be updated to account for a new field (dbt/contracts/graph/manifest.py)
    • [WIP] - Still investigating other places where we'd have to update to accommodate for a change like this
  • it would be constructed very similarly to our nodes, in that each resource would be of the ManifestNode type. Presumably, though, the structure for this would already have been created by the upstream project and we should just be able to leverage that during parse time.
contracts_upstream: MutableMapping[str, ManifestNode] = field(default_factory=dict)

Gotchas

Without making any changes to how this is populated, the names of each node would have to be unique across each project. Otherwise, you'd run into scenarios where you would overwrite an existing node within the storage dictionary if it exists more than one time across each project.

This is how a node is added to the storage dictionary:

def add_node(self, node: ManifestNode):
    if node.resource_type in self._lookup_types:
        if node.name not in self.storage:
            self.storage[node.name] = {}
        self.storage[node.name][node.package_name] = node.unique_id

Create a new ContractLookup

When we resolve_ref the node, it could be set to this:

node = (
    self.ref_lookup.find(target_model_name, pkg, self)
    or self.contract_lookup.find(target_model_name, pkg, self)
)

If it's not found in the manifest's nodes then look in the contracts_upstream.

Open Questions:

  • How does this hinder being able to use a contracts-like selector to build from?
  • What would we have to do to include these contracts in docs?
  • Would this even work?

@sungchun12
Copy link
Contributor

After your first answer to your open questions, let me know where I can help round out the answers

@dpguthrie dpguthrie linked a pull request Sep 20, 2022 that will close this issue
6 tasks
@dpguthrie
Copy link
Author

dpguthrie commented Sep 20, 2022

Internal slack feedback for the first attempt in the related PR

Things that this makes me start thinking about:

  • Does this work if the upstream model has the same name as a model in the current project? What would need to change (this PR)?
  • How should the upstream project specify which models are public / ref'able?
  • What sort of error should the downstream project see when trying to ref a private model?
  • Should a "bad ref from other project" error be raised at parse time, or as the very first step during runtime? The former sounds better at first blush, but it would mean that you always need the full "export" artifact from the upstream project just to parse your own project (including dbt ls). I'm honestly not sure about this one!

@dpguthrie
Copy link
Author

Does this work if the upstream model has the same name as a model in the current project? What would need to change (#3053)?

My first inclination is that this will work if an upstream model has the same name as one in the current project. The reason being is how uniqueness is determined:

def _check_resource_uniqueness(
    manifest: Manifest,
    config: RuntimeConfig,
) -> None:
    names_resources: Dict[str, ManifestNode] = {}
    alias_resources: Dict[str, ManifestNode] = {}

    for resource, node in manifest.nodes.items():

This function iterates through the nodes property on the manifest. This proposal creates an entirely new property, consumers on the manifest, which wouldn't be considered, at least in its current form, as dbt checks for uniqueness.

@dpguthrie
Copy link
Author

dpguthrie commented Sep 20, 2022

How should the upstream project specify which models are public / ref'able?

I really like @christineberger's suggestion in the original discussion where the developer explicitly defines what models to make public. As an aside, I think the default here is false and it's up to the developer to opt-in to anything being made public.

models:
    # Model folder level
    +shared: true
    # Sub-folder level
    staging:
       +shared: false

This configuration would share all models except for one's located in the staging folder. This configuration would then be used to inform what actually makes it into a json file that a downstream project could consume.

@dpguthrie
Copy link
Author

What sort of error should the downstream project see when trying to ref a private model?

I think the same compilation error that you would get today if refing a model that doesn't exist in your own project. We shouldn't even state that they're trying to ref a private model.

I think you should treat it as web services treat private resources. For instance, I know the urls of private repos at my last company, but because I no longer have permission to view those, I'll receive 404s if I try directly navigating to those pages. They exist! But that doesn't mean I should have that information.

@dpguthrie dpguthrie linked a pull request Sep 28, 2022 that will close this issue
6 tasks
@dpguthrie dpguthrie removed a link to a pull request Sep 28, 2022
6 tasks
@dpguthrie
Copy link
Author

Another implementation thought I had is to add some more properties to our NodeConfig (or something in a higher parent class). The idea being that these nodes that we're pulling in from upstream projects, are just that, nodes. Shouldn't they live alongside the nodes that we've compiled from our project and other packages? What could some other properties include that would delineate these from the others though?
- execute - This should have a default to True. The thinking behind this is we need a way to identify nodes from contracts so we don't run them within our project's scope (unless given permission to do so by the upstream project)
- share - This would be used by whatever functions we develop to create the json artifact an upstream producer produces.
- is_contract - Pretty self explanatory

I think the downstream impacts of this approach are much greater than the initial implementation proposed using a ConsumerLookup but it may make more sense.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Mar 28, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Development

Successfully merging a pull request may close this issue.

2 participants