Optional field cardinality in documentation #2929

stevebuckingham · 2020-12-03T12:31:00Z

Describe the feature

Understanding field cardinality is often important and useful for the effective use of data, e.g. what are the possible values of a customer_status field. This change would add a field level flag, defaulted to no, to add a list of the field values and maybe their count to the DBT documentation. In simple terms it would run one of these for each field:

SELECT DISTINCT field_name FROM schema_name.table_name and possibly LIMIT 10

OR

SELECT field_name, COUNT(*) as field_total FROM schema_name.table_name ORDER BY COUNT(*) DESC and possibly LIMIT 10

Describe alternatives you've considered

Manually annotating the descriptions with expected values.

Additional context

This could be applied to all databases. Default to off for a field would help eliminate accidental long build times for the documentation.

Who will this benefit?

Any organization that uses the DBT documentation to share their model definitions and would like a deeper understanding of what key fields look like.

Are you interested in contributing this feature?

I would be interested in writing some documentation code to deliver this.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-12-03T23:29:17Z

Manually annotating the descriptions with expected values.

Or adding an accepted_values test!

@stevebuckingham I buy it, I think this is a really cool idea. Let's figure out what this will involve.

(And I agree that having it off by default is critical, especially on databases like BigQuery where non-metadata queries can be costly.)

under the hood

dbt powers the documentation site through two artifacts:

manifest: information about your project
catalog: information from the database

Presently, the catalog contains information about each relational object (view, table, etc) that dbt grabs, either from the information_schema or by running show/describe commands. The exact SQL to run is defined in an adapter-specific macro called get_catalog. It returns metadata including (e.g.) a table's name, overall stats (size in bytes), and the names and data types of its columns.

In order to power the inclusion of additional info, such as cardinality, we'd need to:

define adapter-specific macros to wrap the database-specific SQL for calculating cardinality
add a stats field at the column level, which can be empty (default) or updated by dbt running subsequent queries in the docs generate step

developer experience

We already have a general resource property, docs, which is actually a dictionary: it could contain multiple attributes. At present, the only available attribute is show, which determines whether they should be hidden from the docs.

models:
  - name: my_model
    docs:
      show: false

Per your proposal, I think we could extend docs to include a cardinality property, which is false by default (as you recommend) but could be set to true. I imagine it would be set at the column level. Should it also be possible to set it at the resource level, and thereby turn it on for all columns in a given model/source/seed?

Let's take this one step further:

Top 10 values of a varchar column, with counts: absolute or relative %? Should we include or exclude the counts of nulls?
Min, max, quartiles, mean (integer)
Counts by year (date, timestamp)

(If it's not clear, I'm inspired by R's summary function, as well as a past hackathon project of some colleagues of mine to write a Jinja {{ summary(relation) }} function.)

Should we offer each of cardinality, min_max, year_bucket as a nested property within docs? Or should dbt take care of which query to run, based on the data type of each column? E.g. I could imagine defining:

models:
  - name: my_model
    docs:
      summary: true

dbt will calculate the 10 ten values of each string-type column, the min/mean/max/etc of each int column, the bucketed counts of each date/time column, and so on.

Curious to hear what you think :)

stevebuckingham · 2020-12-04T08:11:08Z

define adapter-specific macros to wrap the database-specific SQL for calculating cardinality

Yes - there are a few gotchas - for instance in Redshift it is probably better to use SELECT field_name FROM x GROUP BY field_name because that performs better with External Tables. That might work for every DB - except perhaps Big Query.

add a stats field at the column level, which can be empty (default) or updated by dbt running subsequent queries in the docs generate step

Per your proposal, I think we could extend docs to include a cardinality property, which is false by default (as you recommend) but could be set to true. I imagine it would be set at the column level. Should it also be possible to set it at the resource level, and thereby turn it on for all columns in a given model/source/seed?

Agreed. Being able to set at table level is a nice to have - but maybe dangerous, e.g. someone applies it to a large time-series table.

Let's take this one step further:

Top 10 values of a varchar column, with counts: absolute or relative %? Should we include or exclude the counts of nulls?

Min, max, quartiles, mean (integer)

Counts by year (date, timestamp)

I like this idea. I would make Top 10 either optional or configurable with a default to 10. But I would also allow it for integers because things like priority are often integers even though we use them like dimensions.

Should we offer each of cardinality, min_max, year_bucket as a nested property within docs? Or should dbt take care of which query to run, based on the data type of each column? E.g. I could imagine defining:

A simple split might be cardinality and stats - we could go more fine-grained but I think this would be a simple level for people to reason about.

What do you think?

jtcohen6 · 2020-12-04T21:37:15Z

for instance in Redshift it is probably better to use SELECT field_name FROM x GROUP BY field_name because that performs better with External Tables

More performant than select distinct? I didn't know this!

But I would also allow it for integers because things like priority are often integers even though we use them like dimensions.

Heard—I think of those as "factors" (again inspired by R). They're really enum values that should be thought of like strings. You're right to raise it, dbt would not be able to infer the correct move based on its data type alone.

So, to get started, it sounds like you're thinking about docs.cardinality and docs.stats defined at the column level. I imagine these should be lightly configurable, with sensible defaults:

docs.cardinality: Most frequently occurring values. Accepts a dictionary of arguments: number of values (default 10); whether to include absolute or relative (%) counts (default is neither).
docs.stats: Accepts an array of numeric aggregate functions. (Default is ['min', 'max', 'mean'].)

stevebuckingham · 2020-12-09T06:45:07Z

Thanks for the feedback and advice on where to look. I'll have a deeper look at the code now.

jtcohen6 · 2021-01-08T11:31:24Z

There are some closely related suggestions for improvements to statistics in catalog.json: #1449, #1990

It would be great to coordinate these efforts!

Support stats on columns, including summary stats
All statistics (table + column level) include name, value, and how it should be formatted

github-actions · 2021-10-27T01:47:02Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

stevebuckingham added enhancement New feature or request triage labels Dec 3, 2020

jtcohen6 removed the triage label Dec 3, 2020

github-actions bot added the stale Issues that have gone stale label Oct 27, 2021

github-actions bot closed this as completed Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional field cardinality in documentation #2929

Optional field cardinality in documentation #2929

stevebuckingham commented Dec 3, 2020

jtcohen6 commented Dec 3, 2020 •

edited

Loading

stevebuckingham commented Dec 4, 2020

jtcohen6 commented Dec 4, 2020

stevebuckingham commented Dec 9, 2020

jtcohen6 commented Jan 8, 2021

github-actions bot commented Oct 27, 2021

Optional field cardinality in documentation #2929

Optional field cardinality in documentation #2929

Comments

stevebuckingham commented Dec 3, 2020

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Dec 3, 2020 • edited Loading

under the hood

developer experience

stevebuckingham commented Dec 4, 2020

jtcohen6 commented Dec 4, 2020

stevebuckingham commented Dec 9, 2020

jtcohen6 commented Jan 8, 2021

github-actions bot commented Oct 27, 2021

jtcohen6 commented Dec 3, 2020 •

edited

Loading