Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional field cardinality in documentation #2929

Closed
stevebuckingham opened this issue Dec 3, 2020 · 6 comments
Closed

Optional field cardinality in documentation #2929

stevebuckingham opened this issue Dec 3, 2020 · 6 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@stevebuckingham
Copy link

Describe the feature

Understanding field cardinality is often important and useful for the effective use of data, e.g. what are the possible values of a customer_status field. This change would add a field level flag, defaulted to no, to add a list of the field values and maybe their count to the DBT documentation. In simple terms it would run one of these for each field:

SELECT DISTINCT field_name FROM schema_name.table_name and possibly LIMIT 10

OR

SELECT field_name, COUNT(*) as field_total FROM schema_name.table_name ORDER BY COUNT(*) DESC and possibly LIMIT 10

Describe alternatives you've considered

Manually annotating the descriptions with expected values.

Additional context

This could be applied to all databases. Default to off for a field would help eliminate accidental long build times for the documentation.

Who will this benefit?

Any organization that uses the DBT documentation to share their model definitions and would like a deeper understanding of what key fields look like.

Are you interested in contributing this feature?

I would be interested in writing some documentation code to deliver this.

@stevebuckingham stevebuckingham added enhancement New feature or request triage labels Dec 3, 2020
@jtcohen6 jtcohen6 removed the triage label Dec 3, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 3, 2020

Manually annotating the descriptions with expected values.

Or adding an accepted_values test!

@stevebuckingham I buy it, I think this is a really cool idea. Let's figure out what this will involve.

(And I agree that having it off by default is critical, especially on databases like BigQuery where non-metadata queries can be costly.)

under the hood

dbt powers the documentation site through two artifacts:

  • manifest: information about your project
  • catalog: information from the database

Presently, the catalog contains information about each relational object (view, table, etc) that dbt grabs, either from the information_schema or by running show/describe commands. The exact SQL to run is defined in an adapter-specific macro called get_catalog. It returns metadata including (e.g.) a table's name, overall stats (size in bytes), and the names and data types of its columns.

In order to power the inclusion of additional info, such as cardinality, we'd need to:

  • define adapter-specific macros to wrap the database-specific SQL for calculating cardinality
  • add a stats field at the column level, which can be empty (default) or updated by dbt running subsequent queries in the docs generate step

developer experience

We already have a general resource property, docs, which is actually a dictionary: it could contain multiple attributes. At present, the only available attribute is show, which determines whether they should be hidden from the docs.

models:
  - name: my_model
    docs:
      show: false

Per your proposal, I think we could extend docs to include a cardinality property, which is false by default (as you recommend) but could be set to true. I imagine it would be set at the column level. Should it also be possible to set it at the resource level, and thereby turn it on for all columns in a given model/source/seed?

Let's take this one step further:

  • Top 10 values of a varchar column, with counts: absolute or relative %? Should we include or exclude the counts of nulls?
  • Min, max, quartiles, mean (integer)
  • Counts by year (date, timestamp)

(If it's not clear, I'm inspired by R's summary function, as well as a past hackathon project of some colleagues of mine to write a Jinja {{ summary(relation) }} function.)

Should we offer each of cardinality, min_max, year_bucket as a nested property within docs? Or should dbt take care of which query to run, based on the data type of each column? E.g. I could imagine defining:

models:
  - name: my_model
    docs:
      summary: true

dbt will calculate the 10 ten values of each string-type column, the min/mean/max/etc of each int column, the bucketed counts of each date/time column, and so on.

Curious to hear what you think :)

@stevebuckingham
Copy link
Author

  • define adapter-specific macros to wrap the database-specific SQL for calculating cardinality

Yes - there are a few gotchas - for instance in Redshift it is probably better to use SELECT field_name FROM x GROUP BY field_name because that performs better with External Tables. That might work for every DB - except perhaps Big Query.

  • add a stats field at the column level, which can be empty (default) or updated by dbt running subsequent queries in the docs generate step

Per your proposal, I think we could extend docs to include a cardinality property, which is false by default (as you recommend) but could be set to true. I imagine it would be set at the column level. Should it also be possible to set it at the resource level, and thereby turn it on for all columns in a given model/source/seed?

Agreed. Being able to set at table level is a nice to have - but maybe dangerous, e.g. someone applies it to a large time-series table.

Let's take this one step further:

  • Top 10 values of a varchar column, with counts: absolute or relative %? Should we include or exclude the counts of nulls?
  • Min, max, quartiles, mean (integer)
  • Counts by year (date, timestamp)

I like this idea. I would make Top 10 either optional or configurable with a default to 10. But I would also allow it for integers because things like priority are often integers even though we use them like dimensions.

Should we offer each of cardinality, min_max, year_bucket as a nested property within docs? Or should dbt take care of which query to run, based on the data type of each column? E.g. I could imagine defining:

A simple split might be cardinality and stats - we could go more fine-grained but I think this would be a simple level for people to reason about.

What do you think?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 4, 2020

for instance in Redshift it is probably better to use SELECT field_name FROM x GROUP BY field_name because that performs better with External Tables

More performant than select distinct? I didn't know this!

But I would also allow it for integers because things like priority are often integers even though we use them like dimensions.

Heard—I think of those as "factors" (again inspired by R). They're really enum values that should be thought of like strings. You're right to raise it, dbt would not be able to infer the correct move based on its data type alone.

So, to get started, it sounds like you're thinking about docs.cardinality and docs.stats defined at the column level. I imagine these should be lightly configurable, with sensible defaults:

  • docs.cardinality: Most frequently occurring values. Accepts a dictionary of arguments: number of values (default 10); whether to include absolute or relative (%) counts (default is neither).
  • docs.stats: Accepts an array of numeric aggregate functions. (Default is ['min', 'max', 'mean'].)

@stevebuckingham
Copy link
Author

Thanks for the feedback and advice on where to look. I'll have a deeper look at the code now.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 8, 2021

There are some closely related suggestions for improvements to statistics in catalog.json: #1449, #1990

It would be great to coordinate these efforts!

  • Support stats on columns, including summary stats
  • All statistics (table + column level) include name, value, and how it should be formatted

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 27, 2021
@github-actions github-actions bot closed this as completed Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants