BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

bodschut · 2020-09-10T10:02:57Z

Describe the feature

With the release of dbt 0.18.0, launching a dbt run command against a bigquery database will output the bytes processed when a model is successfully deployed. However, to get a full picture of the amount of bytes dbt processes on bigquery (and hence to get an idea of the cost of running a specific dbt command), this should also be logged when launching dbt test, dbt source snapshot-freshness and dbt run-operation commands.

Additional context

For dbt run commands, this feature was discussed in #2526, it would be nice to extend this to also log this number for the other commands mentioned above.

Who will this benefit?

This is a bigquery specific issue.

Are you interested in contributing this feature?

Sure, happy to see how we could add this.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-09-10T16:08:41Z

This makes sense to me! I think the long-term answer here still looks like giving users more control via configurable log output (#2580). In the meantime, I see the value of reporting bytes processed for queries that can be sneakily expensive, such as schema tests or freshness checks that scan massive tables.

I think the change would need to happen in the "everything else" query result:

https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/plugins/bigquery/dbt/adapters/bigquery/connections.py#L306-L307

In particular, you'll want to pull in total_bytes_processed, as in the query types above. From a brief glance at BigQuery's docs:

Return total bytes processed from job statistics, if present.

Given that "if present" caveat, it'd be a good idea to add some handling in the event that total_bytes_processed is missing.

Take a swing at this, and let me know if you run into any trouble!

bodschut · 2020-09-11T07:02:12Z

@jtcohen6 Is the STATUS produced in this else clause actually used in the log output for tests? Test log outputs look like [PASS in 1.63s] or [WARN 3 in 1.52s].

alepuccetti · 2020-09-11T08:14:47Z

Looking at the BigQuery logs, I noticed that the statement_type cloud be the string "SELECT". So maybe we could just include it together with "CREATE_TABLE_AS_SELECT". and Keep the else part as a generic catchall.

jtcohen6 · 2020-09-11T13:57:27Z

@alepuccetti @bodschut. That sounds like a good approach. I knew this team would find a way :)

bodschut · 2020-09-11T14:40:15Z

Hi guys

I've done some digging in the code and unfortunately I'm afraid it won't be as simple as we think :-) Let me explain what I found.

As you can see in the execute function of the bigquery connection, there is indeed a status string being built that contains the bytes processed since the last dbt release. This status string is returned by the execute function together with the result of the query, so far so good:
https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/plugins/bigquery/dbt/adapters/bigquery/connections.py#L309

For a normal dbt run, this connection is probably used when a macro is ran, as all materialisations are achieved through macros in dbt. I haven't looked into the details of how those macros are internally executed, but it's clear that the status string that contains the bytes processed is properly handled by the macro code, returned to the calling code and eventually used in the function that prints the log line on the terminal:
https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/core/dbt/task/printer.py#L137-L147

I'm pretty sure that the status string created in the execute function of the bigquery connection is accessed here in the get_printable_result function from the result object (which is of the class RunModelResult and then passed to the print_fancy_output_line function.

The issue with the dbt test command is that the code that calls the bigquery connection to execute the query doesn't put the status string in the status field of the result object (which is also of type RunModelResult in the case of a test run). Rather, the code sets the result.status field to the number of rows that do not pass the test predicate, as you can see in the snippet below (which is for the execute_data_test function, the same is true for the execute_schema_test function:
https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/core/dbt/task/test.py#L43-L58

As you can see, the adapter returns both the status string (catched here in the res variable) and the query output (catched in the table variable), but it only returns the table var to the calling code, so the status string containing the bytes processed info gets lost here ...

So, I think we will need to adapt how this status string is passed up the call chain, but as we're stepping out of the bigquery plugin and into the core dbt code, I'm a bit reluctant to start changing stuff... So @jtcohen6 could you maybe look for someone with knowledge of this part of the core code to see what we could do here?

jtcohen6 · 2020-09-11T18:45:13Z

Good catch @bodschut, and sorry for leading you astray further up. I believe this is the code responsible for printing data + schema test results:

https://github.com/fishtown-analytics/dbt/blob/4a21ea65759d78fe56f33b71b671948ca747161b/core/dbt/task/printer.py#L118-L135

To include bytes processed, I think we'd need to change execute_schema_test and execute_data_test to return a dictionary instead of just an integer status. That brings us within orbit of a much larger topic: changes to run_results.json(#2493) that we already have on the short-term docket.

@beckjake Do you have thoughts about how we could incorporate structured representation of integer values (rows affected, byes processed) into the proposed run results spec? I wouldn't say that the way we've done this for model run results today is all that good, either: status is just a string value "CREATE TABLE (1.0 rows, 0.0 Bytes processed)".

beckjake · 2020-09-11T19:16:37Z

The proposal in #2493 already handles the "status string vs status integer" issue, so that's nice.

For the bytes process/rows affected aspect, I see a couple ways we can do it:

add an unstructured dict of result data or something like it, populated on a per-adapter basis. The adapter is responsible for generating the status text and any extra result data it might think is appropriate, then that gets decorated with any task-specific result data that might be appropriate.
per-adapter per-task result types. This turns result types into a combination of the existing "what is this a result of" and a new axis that's "what adapter is this a result for".

I don't have strong feelings either way, both are probably valid. The first option is easier to implement and probably doesn't have to be a breaking change for adapters, the second option is probably more maintainable in the long run, gives more flexibility for adapters to implement custom result types, and probably is better for consumers on some level since they can know what they're consuming.

We could probably start with the first and transition to the second, though it'll be unpleasant to maintain if we have to preserve the schema and the total development time will surely be much longer than either of the choices on their own.

jtcohen6 · 2020-09-29T20:32:39Z

IIRC, after discussing this live, we decided we'd opt for:

an unstructured dict of result data ... populated on a per-adapter basis. The adapter is responsible for generating the status text and any extra result data it might think is appropriate, then that gets decorated with any task-specific result data that might be appropriate.

The trickiness will be in surfacing that data to stdout. While this isn't a blocker for #2493, it's highly related. I'm removing "good first issue" and "bigquery"—this gets into our core plumbing—and pulling it into v0.19.

Edit: Eventually, we may want this to be a list of dictionaries (query_stats). One model/materialization may include multiple queries and we'd want to collect stats for all of them. This isn't something we do a good job of today in run results more generally.

jtcohen6 · 2020-12-17T14:37:15Z

#2961 adds an adapter-specific dict to run_results.json. Let's consider that a partway resolution to this issue; meanwhile, I'll open a new issue to address the desire for recording bytes_processed for queries from other invocation types—namely dbt test—which we can revisit for v0.20.0.

bodschut added enhancement New feature or request triage labels Sep 10, 2020

jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Sep 10, 2020

bodschut mentioned this issue Sep 11, 2020

Add data processed info into dbt run logs for all statement types #2526

Closed

jtcohen6 mentioned this issue Sep 16, 2020

run results reporting is a mess #2493

Closed

jtcohen6 removed bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! labels Sep 29, 2020

jtcohen6 added this to the Kiyoshi Kuromiya milestone Sep 29, 2020

jtcohen6 added the artifacts label Oct 6, 2020

rbjerrum mentioned this issue Oct 19, 2020

DBT v0.18 introduced data processed. Can DBT_ML do this too? kristeligt-dagblad/dbt_ml#10

Closed

jtcohen6 mentioned this issue Nov 23, 2020

Bytes processed in summary line after a dbt run #2905

Closed

kwigley self-assigned this Dec 16, 2020

jtcohen6 mentioned this issue Dec 17, 2020

[CT-1635] Populate adapter_response for tests + sources #2964

Closed

kwigley mentioned this issue Dec 22, 2020

Include adapter response info in execution results #2961

Merged

4 tasks

kwigley closed this as completed in #2961 Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

bodschut commented Sep 10, 2020

jtcohen6 commented Sep 10, 2020

bodschut commented Sep 11, 2020

alepuccetti commented Sep 11, 2020

jtcohen6 commented Sep 11, 2020

bodschut commented Sep 11, 2020

jtcohen6 commented Sep 11, 2020 •

edited

Loading

beckjake commented Sep 11, 2020

jtcohen6 commented Sep 29, 2020 •

edited

Loading

jtcohen6 commented Dec 17, 2020

BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

Comments

bodschut commented Sep 10, 2020

Describe the feature

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Sep 10, 2020

bodschut commented Sep 11, 2020

alepuccetti commented Sep 11, 2020

jtcohen6 commented Sep 11, 2020

bodschut commented Sep 11, 2020

jtcohen6 commented Sep 11, 2020 • edited Loading

beckjake commented Sep 11, 2020

jtcohen6 commented Sep 29, 2020 • edited Loading

jtcohen6 commented Dec 17, 2020

jtcohen6 commented Sep 11, 2020 •

edited

Loading

jtcohen6 commented Sep 29, 2020 •

edited

Loading