-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-779] Listing tables on big datasets on every compile can be abnormally long (and incomplete?) #205
Comments
Hi @github-christophe-oudar - thanks for opening this issue. We are experiencing the same performance slowness with dbt Cloud v1.0 with BigQuery...actually even worse performance since it takes us 35 minutes to compile 😢 . Regarding your comment:
...I believe that explains the log message we have started to see in dbt Cloud when we compile: I found your issue thanks to Jeremy Yeo (dbt Labs) who has been helping on #115 which is about Wondering if maybe the folks at dbt Labs can point us in the right direction for how to override any macros in the |
This issue deserves a thorough look from the relevant team. Just popping in to offer, as a potential mechanism for temporary relief, you could look into trying out the This is marked "experimental" because it hasn't been extensively battle-tested. It also won't help with full |
Hi @jtcohen6 - 🙏 thanks for your time and support here! Appreciate the idea about using I have some additional learnings to share regarding workarounds for our...workarounds 😅. Recently, our "initial workaround" to generating docs was as follows (dbt Cloud v1.2 w/BigQuery):
As our model count has continued to grow, the above workaround (of splitting the But then we continued to hit memory limits in dbt Cloud again...and so here's the "latest" workaround for getting docs to complete successfully:
Some followup thoughts/questions:
Many thanks and interested to hear if folks have any questions/comments! |
@kevinhoe Thanks for all those details! The config I linked is "global," so it should be supported for all commands, if not as a CLI flag in the place you expect then at least as an env var. Out of curiosity:
Agreed! Christophe put this well: we encourage materializing dbt models in namespaces (schemas/datasets) entirely owned by dbt, containing only dbt-produced objects. But I also appreciate that, as dbt projects mature and get larger, "only dbt-produced objects" may still number in the hundreds/thousands. Just to keep ourselves organized, let's keep this issue specific to the slowdown encountered during caching, when dbt uses the "list tables API" at the start of Let's keep #115 focused on the scaling limits encountered during Separately, we're aware of the scaling limits encountered using |
@jtcohen6 - thank you for the extremely thorough and helpful response! Apologies for the delay from my end 🙏
Thank you again for all your time and insights! Happy to help beta test any ideas, too. |
I'm having a hard time wrapping my head around the overall caching performance topic. Should we tackle it case-by-case, for each adapter, since it's mostly an optimization conversation? Or should we do it top-down, re-designing the entire workflow of caching across the board? In any case, this specific issue I will re-label as It's as much a statement on whether in this specific instance, performance is a must-have or a nice-to-have; that it is a acknowledgement that there is no easy fix here:
|
Not sure how to help push on this or if anything will come, but wanted to throw another data point on and see if I can help encourage further research here. I'm not sure why our performance is so bad, so hopefully some of this helps. In CI we run:
I created a sample PR with just a trivial change in a leaf note, and this results in the following output:
I'm trying to debug and understand what happens:
Not shown: our CI copies the manifest and run_results from the previous prod run. We do have 120 |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Describe the bug
I was troubleshooting a 2/3 min compile time for a small dbt project and it appears that it contained a model that output to a dataset that has 60K tables.
As dbt-bigquery is providing list_relations_without_caching, it will use the list tables API. However the calls is slow with a lot of tables (2/3 minutes).
Also as the max results is set to 100K and it looks like there is no pagination, it might be incomplete.
Steps To Reproduce
Expected behavior
I would expect the behavior to be faster.
Yet I understand that it's not straightforward to improve.
There are few leads:
INFORMATION_SCHEMA
to retrieve the information such as usingSELECT table_name FROM mydataset.INFORMATION_SCHEMA.TABLES WHERE table_name IN (...)
. You get the performance hit from query (2-3s overhead) and you're billed 10 MB though.Obviously a simple workaround is to use a dataset that is not that crowded. Especially since all datasets table listings are running concurrently but it might not be applicable for all users.
Another workaround would be to have a setting in the config that would assume the table exists or not directly to avoid scanning.
System information
The output of
dbt --version
:I'm using dbt-bigquery 1.2.0-b1 (but had the issues as well with 1.1.1)
The operating system you're using:
MacOS
The output of
python --version
:Python 3.10.4
The text was updated successfully, but these errors were encountered: