Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt_project.yml v2 #2300

Closed
beckjake opened this issue Apr 7, 2020 · 11 comments · Fixed by #2312
Closed

dbt_project.yml v2 #2300

beckjake opened this issue Apr 7, 2020 · 11 comments · Fixed by #2312
Labels
1.0.0 Issues related to the 1.0.0 release of dbt enhancement New feature or request

Comments

@beckjake
Copy link
Contributor

beckjake commented Apr 7, 2020

Describe the feature

dbt_project.yml should get a version 2. We can denote it by adding an optional config-version: field that defaults to v1 and encourage users to migrate to v2 by creating a migration script.

Many of these changes will enable/make it easier to implement #2269 and #2283 - in fact, v2 will probably be required for those features. It'll also make fixing #2265 a lot easier.

The changes:

  • The models config block will be split out into seeds, models, analyses, sources, etc.
  • In those config blocks, every entry will be used as a part of the FQN hierarchy, except for the key named config - that key will denote the actual values to be applied to the config.
    • This will resolve some ambiguities
    • Directories and models named config in the search paths will be an error.
  • Give vars its own top-level key. vars will be project-scoped only, no full paths/FQNs
    • this will still make it possible to override dependency vars, which is important
    • but we won't have to do a bunch of hierarchy-parsing from configs.
    • vars in config blocks will be treated as a config value named vars.

We should provide a migration path: 0.17.0 will support both versions (but you can't have the "good stuff" on v1) and emit deprecation warnings. 0.18.0 will only support v2.

Describe alternatives you've considered

We can just not do it.

Additional context

N/A

Who will this benefit?

anyone developing sources

@beckjake beckjake added enhancement New feature or request triage labels Apr 7, 2020
@drewbanin drewbanin added 1.0.0 Issues related to the 1.0.0 release of dbt and removed triage labels Apr 7, 2020
@drewbanin
Copy link
Contributor

An example dbt_project.yml file that conforms to the v2 spec:

name: project_name
version: 1.0.0

config-version: 2 

# vars cannot be scoped to specific directories
# they may only be scoped to packages, or "global"
vars:
  testing_days_of_data: 7

  my_project:
    testing_days_of_data: 2

  some_package:
    testing_days_of_data: 4

models:
  project_name:
    config:
      materialized: table
    base:
      disabled:
        config:
          enabled: false
      config:
        materialized: ephemeral

seeds:
  snowplow:
    config:
      enabled: false

snapshots:
  config:
    strategy: check
    check_cols: ['id']

The config key

The dbt_project.yml file spec is currently ambiguous. Consider the existing (v1) syntax:

models:
  my_project:
    partition_by:
      field_name: date_day
      data_type: date

From this syntax alone, it is not clear that partition_by is a configuration, rather than a path in the dbt project directory! Today, we're handling this by 1) hoping users don't name folders something like partition_by and 2) enumerating the list of configurations that dbt is aware of, eg:

  • enabled, materialized
  • sort, dist
  • cluster_by, partition_by
  • etc etc etc

If one of these keys is found in the dbt_project.yml file, then dbt knows that it is a config, not a folder path or model name. Historically, this has been mostly fine, most people don't name folders "partition_by", but it would be much better if there was no ambiguity here. Our proposed solution is that we have one special key, config, which denotes that the corresponding value is a configuration instead of a hierarchical file tree.

vars

Today, vars are scoped hierarchically in a dbt project. There isn't a ton of merit to doing it this way, and it's more a fallout of the initial implementation than it is a well-considered design decision. The big downside of this approach is that variable resolution is hard to reason about, and hard to implement in dbt.

For models, folder-scoped variables are reasonably well defined. Consider:

# dbt_project.yml

models:
  my_project:
    folder_path:
      vars:
        abc: 123
    some_other_path:
      vars:
        abc: 456

Here, the abc variable will have the value 123 for all models defined in or under models/folder_path. It is less clear what should happen for a schema.yml file located at models/folder_path/schema.yml. Does it inherit the variable scope based on its path? Or should the variable scope be inferred from the model it is describing?

Consider the following two files:

# models/folder_path/schema.yml

models:
  - name: my_model
    description: "{{ var('abc') }}"
-- models/some_other_path/my_model.sql

select 1 as id

In this example, both 123 and 456 would both be pretty reasonable values for {{ var('abc') }} in the description field, and to be honest, I don't think I could tell you which one dbt returns off the top of my head! Making vars non-hierarchical won't totally solve this exact problem, but it would certainly help!


Let's try to collect some feedback here around how folks are using vars. My personal take is:

  • I don't think moving var out of models: will be incredibly controversial (confidence = 60%)
  • People aren't going to love the config: change to models:, and the more I look at it, the more I'm inclined to agree with them :)

Let's keep the goals in mind here (consistency, unambiguity, ease of parsing) and remain flexible on the exact implementation that we move forwards with.

@clrcrl
Copy link
Contributor

clrcrl commented Apr 8, 2020

Yes love all of this.

Yeah, there's never a time when I use vars scoped for a subdirectory. And moving it out of models: makes sense to me (esp. if you want to use them in sources too)

I'm in favour of config: (configs:?). From a documentation standpoint, it's hard to reason about "the list of keys that can be under models:" (I was literally thinking about this the other day).

I would also be in favor of:

models:
  project_name:
    materialized: view
    enabled: true
    tags: ['my_project']
    base:
      my_big_table:
        materialized: table
        materialization_configs: # ok not this but something like this
         partition_by: 
            field_name: date_day
            data_type: date

For me, there's a difference between the configs that a materialization uses (and are consumed in Jinja-land), versus configs that are consumed by the python parts of dbt. IDK if that's valid though

@jthandy
Copy link
Member

jthandy commented Apr 8, 2020

  1. I'm not worried about the "using vars in a subdirectory" because empirically you can still accomplish the same use cases, you would just need to create multiple variables. This is how you would solve this problem in...normal programming :) There is no concept of "hierarchical variables scoped at a folder level" in Python, so we shouldn't attempt to create one if it's just creating problems for us.

  2. I 100% massively agree that we should have a config key in the yaml spec for this file.

  3. If we are going to make a bunch of changes to the spec for this file, I have two others that I'd really like to get in there!

  • dbt_project.yml should only include the directives required to start up dbt, and I think that's just the file paths. All other directives should follow our other yaml should go into arbitrarily-named files, just like sources.yml works. This would include models, vars, seeds (anything else?). I would imagine creating a single models.yml or seeds.yml for most projects, but I could imagine wanting to separate those contents out into multiple files in certain situations (large projects, decentralized teams...). But dbt would parse them all in the same way that it currently parses sources.yml, which is to say that it shouldn't care about what specific file the yaml content comes from. Beyond just making dbt_project.yml more tightly scoped and resistant to sprawl, this also gives dbt project developers more control about how they package and architect their code.

  • The top-level config in models can be either a project name or a top-level-directory name within a project. I think this inconsistency is incredibly confusing and that we should be less flexible on this, whatever the right answer is. Maybe we have a project: key that is optional and defaults to the current project? Dunno. But resolving this feels consistent with our drive here to reduce uncertainty.

@clrcrl
Copy link
Contributor

clrcrl commented Apr 8, 2020

and I think that's just the file paths.

In fact, you only need:

name: jaffle_shop
version: 0.0.1

Everything else is optional 😄

Speaking of, while we're here, do we want to do something about version:? Either use it meaningfully or nix it?

@beckjake
Copy link
Contributor Author

beckjake commented Apr 8, 2020

dbt_project.yml should only include the directives required to start up dbt, and I think that's just the file paths. All other directives should follow our other yaml should go into arbitrarily-named files, just like sources.yml works.

I agree! This appears to be the cause of a decent amount of confusion. Unfortunately, this is very tricky to specify without also breaking schema.yml behavior and/or creating ambiguities. Currently models, sources, etc are all defined as lists of dicts, where each dict represents one model/source/whatever. I'm not sure I can come up with a model that preserves that without adding the ambiguity back in or having a second layer of keys. and this feels really ugly and verbose:

config:
  models:
    my_project:
      my_folder:
        config:
          enabled: true

The top-level config in models can be either a project name or a top-level-directory name within a project. I think this inconsistency is incredibly confusing and that we should be less flexible on this, whatever the right answer is. Maybe we have a project: key that is optional and defaults to the current project? Dunno. But resolving this feels consistent with our drive here to reduce uncertainty.

This should just be required to be a project name. I actually thought it was that way already, though config resolution is currently so opaque I could easily be wrong.

@jthandy
Copy link
Member

jthandy commented Apr 8, 2020

@clrcrl Everything is optional but if you do want to specify a file path, that specification is required at startup :)

And good ping on version.

@beckjake
Copy link
Contributor Author

beckjake commented Apr 8, 2020

Speaking of, while we're here, do we want to do something about version:? Either use it meaningfully or nix it?

It's used by the hubsite to determine what version a project is, I thought?

@clrcrl
Copy link
Contributor

clrcrl commented Apr 8, 2020

No, not at all. Hubsite uses GitHub releases

@jthandy
Copy link
Member

jthandy commented Apr 8, 2020

Unfortunately, this is very tricky to specify without also breaking schema.yml behavior and/or creating ambiguities.

@beckjake I don't fully understand the challenges you're describing here, although I certainly trust that you're correct. My hope would be that we not totally give up on moving towards that desirable end-state without a fight though :) I'm very game to hop on a call to dig in further on this topic if others have an appetite to do that.

@beckjake
Copy link
Contributor Author

beckjake commented Apr 8, 2020

@jthandy more than happy to do so! Just so everyone can follow along, this is a currently valid schema.yml:

version: 2
models:
- name: emails
  columns:
  - name: email
    tests:
    - unique
- name: users
  columns:
  - name: id
    tests:
    - unique

That models section becomes this in python:

[
  {'name': 'emails', 'columns': [{'name': 'email', 'tests': ['unique']}]},
  {'name': 'users', 'columns': [{'name': 'id', 'tests': ['unique']}]}
]

It's not possible to mix lists and dicts in yaml, so this naive merge of the dbt_project.yml and schema.yml syntax isn't valid:

version: 2
models:
 my_project:
   my_folder:
     config:
       enabled: true
- name: emails
  columns:
  - name: email
    tests:
    - unique
- name: users
  columns:
  - name: id
    tests:
    - unique

I think we probably could get this to work, but I don't love it - I think it's even less intuitive than having two config files:

version: 2
models:
- config:
    my_project:
     my_folder:
      config:
       enabled: true
- name: emails
  columns:
  - name: email
    tests:
    - unique
- name: users
  columns:
  - name: id
    tests:
    - unique

@aescay
Copy link
Member

aescay commented Apr 8, 2020

I really like the proposed solution for handling vars. There was a time where I felt a bit too clever trying to folder scope vars but in reality I didn't really need to, I just thought it would be cleaner. I do think setting them globally/package specific prevents users from building out crazy use cases for vars in the long run! And having its own top-level key makes it really easy to find and define!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issues related to the 1.0.0 release of dbt enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants