Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

Closed
bodschut opened this issue Sep 10, 2020 · 9 comments · Fixed by #2961
Closed
Assignees
Labels
artifacts enhancement New feature or request

Comments

@bodschut
Copy link
Contributor

Describe the feature

With the release of dbt 0.18.0, launching a dbt run command against a bigquery database will output the bytes processed when a model is successfully deployed. However, to get a full picture of the amount of bytes dbt processes on bigquery (and hence to get an idea of the cost of running a specific dbt command), this should also be logged when launching dbt test, dbt source snapshot-freshness and dbt run-operation commands.

Additional context

For dbt run commands, this feature was discussed in #2526, it would be nice to extend this to also log this number for the other commands mentioned above.

Who will this benefit?

This is a bigquery specific issue.

Are you interested in contributing this feature?

Sure, happy to see how we could add this.

@bodschut bodschut added enhancement New feature or request triage labels Sep 10, 2020
@jtcohen6
Copy link
Contributor

This makes sense to me! I think the long-term answer here still looks like giving users more control via configurable log output (#2580). In the meantime, I see the value of reporting bytes processed for queries that can be sneakily expensive, such as schema tests or freshness checks that scan massive tables.

I think the change would need to happen in the "everything else" query result:

https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/plugins/bigquery/dbt/adapters/bigquery/connections.py#L306-L307

In particular, you'll want to pull in total_bytes_processed, as in the query types above. From a brief glance at BigQuery's docs:

Return total bytes processed from job statistics, if present.

Given that "if present" caveat, it'd be a good idea to add some handling in the event that total_bytes_processed is missing.

Take a swing at this, and let me know if you run into any trouble!

@jtcohen6 jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Sep 10, 2020
@bodschut
Copy link
Contributor Author

@jtcohen6 Is the STATUS produced in this else clause actually used in the log output for tests? Test log outputs look like [PASS in 1.63s] or [WARN 3 in 1.52s].

@alepuccetti
Copy link
Contributor

Looking at the BigQuery logs, I noticed that the statement_type cloud be the string "SELECT". So maybe we could just include it together with "CREATE_TABLE_AS_SELECT". and Keep the else part as a generic catchall.

@jtcohen6
Copy link
Contributor

@alepuccetti @bodschut. That sounds like a good approach. I knew this team would find a way :)

@bodschut
Copy link
Contributor Author

Hi guys

I've done some digging in the code and unfortunately I'm afraid it won't be as simple as we think :-) Let me explain what I found.

As you can see in the execute function of the bigquery connection, there is indeed a status string being built that contains the bytes processed since the last dbt release. This status string is returned by the execute function together with the result of the query, so far so good:
https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/plugins/bigquery/dbt/adapters/bigquery/connections.py#L309

For a normal dbt run, this connection is probably used when a macro is ran, as all materialisations are achieved through macros in dbt. I haven't looked into the details of how those macros are internally executed, but it's clear that the status string that contains the bytes processed is properly handled by the macro code, returned to the calling code and eventually used in the function that prints the log line on the terminal:
https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/core/dbt/task/printer.py#L137-L147

I'm pretty sure that the status string created in the execute function of the bigquery connection is accessed here in the get_printable_result function from the result object (which is of the class RunModelResult and then passed to the print_fancy_output_line function.

The issue with the dbt test command is that the code that calls the bigquery connection to execute the query doesn't put the status string in the status field of the result object (which is also of type RunModelResult in the case of a test run). Rather, the code sets the result.status field to the number of rows that do not pass the test predicate, as you can see in the snippet below (which is for the execute_data_test function, the same is true for the execute_schema_test function:
https://github.com/fishtown-analytics/dbt/blob/fa8a4f2020655c60bdb66270616f5654e168a729/core/dbt/task/test.py#L43-L58

As you can see, the adapter returns both the status string (catched here in the res variable) and the query output (catched in the table variable), but it only returns the table var to the calling code, so the status string containing the bytes processed info gets lost here ...

So, I think we will need to adapt how this status string is passed up the call chain, but as we're stepping out of the bigquery plugin and into the core dbt code, I'm a bit reluctant to start changing stuff... So @jtcohen6 could you maybe look for someone with knowledge of this part of the core code to see what we could do here?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 11, 2020

Good catch @bodschut, and sorry for leading you astray further up. I believe this is the code responsible for printing data + schema test results:

https://github.com/fishtown-analytics/dbt/blob/4a21ea65759d78fe56f33b71b671948ca747161b/core/dbt/task/printer.py#L118-L135

To include bytes processed, I think we'd need to change execute_schema_test and execute_data_test to return a dictionary instead of just an integer status. That brings us within orbit of a much larger topic: changes to run_results.json(#2493) that we already have on the short-term docket.

@beckjake Do you have thoughts about how we could incorporate structured representation of integer values (rows affected, byes processed) into the proposed run results spec? I wouldn't say that the way we've done this for model run results today is all that good, either: status is just a string value "CREATE TABLE (1.0 rows, 0.0 Bytes processed)".

@beckjake
Copy link
Contributor

The proposal in #2493 already handles the "status string vs status integer" issue, so that's nice.

For the bytes process/rows affected aspect, I see a couple ways we can do it:

  • add an unstructured dict of result data or something like it, populated on a per-adapter basis. The adapter is responsible for generating the status text and any extra result data it might think is appropriate, then that gets decorated with any task-specific result data that might be appropriate.
  • per-adapter per-task result types. This turns result types into a combination of the existing "what is this a result of" and a new axis that's "what adapter is this a result for".

I don't have strong feelings either way, both are probably valid. The first option is easier to implement and probably doesn't have to be a breaking change for adapters, the second option is probably more maintainable in the long run, gives more flexibility for adapters to implement custom result types, and probably is better for consumers on some level since they can know what they're consuming.

We could probably start with the first and transition to the second, though it'll be unpleasant to maintain if we have to preserve the schema and the total development time will surely be much longer than either of the choices on their own.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 29, 2020

IIRC, after discussing this live, we decided we'd opt for:

an unstructured dict of result data ... populated on a per-adapter basis. The adapter is responsible for generating the status text and any extra result data it might think is appropriate, then that gets decorated with any task-specific result data that might be appropriate.

The trickiness will be in surfacing that data to stdout. While this isn't a blocker for #2493, it's highly related. I'm removing "good first issue" and "bigquery"—this gets into our core plumbing—and pulling it into v0.19.

Edit: Eventually, we may want this to be a list of dictionaries (query_stats). One model/materialization may include multiple queries and we'd want to collect stats for all of them. This isn't something we do a good job of today in run results more generally.

@jtcohen6
Copy link
Contributor

#2961 adds an adapter-specific dict to run_results.json. Let's consider that a partway resolution to this issue; meanwhile, I'll open a new issue to address the desire for recording bytes_processed for queries from other invocation types—namely dbt test—which we can revisit for v0.20.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifacts enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants