-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select * Limit is DANGEROUS in BigQuery #17299
Comments
Additional context: BigQuery charges users based (near-exclusively) on the count of bytes loaded into memory. As such, they strongly discourage the use of The ways around this are:
In practice, using the latest partition is the most practical way to constrain costs on bootstrapped queries ( It's also possible to store each partition as a separate table in a dataset and query it hive-style, and it's the way external data is partition queried in BQ. However, this has the disadvantage (for Google) of making it harder for users to write lots of queries that bill the most expensive possible way. I once came across a single query that cost $37,000 USD: A single 37k query, going back over years and trillions of events... just to fetch the table schema 🤦♂️🤣. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue |
@yousoph @eschutho @betodealmeida are we still facing this issue and/or concerned with it? |
@rusackas this gave us great pause on our BigQuery usage with Superset. |
Tempted to close this as stale, but not sure if @yousoph @eschutho @betodealmeida know the current state of affairs here and whether or not this is still a major concern. |
@rusackas if superset hasn't fixed it or addressed it in some way then BigQuery usage is basically a non-starter for anyone doing real work with it. This star query would cost us thousands of dollars on some of our tables so we just avoid them and are moving away from Superset/Preset since Superset doesn't seem to want to care to understand how much of an issue this could be. |
Actually, we do use BigQuery regularly at Preset, as do many other orgs, and this hasn't been raised as an issue by others. That's why I'm asking folks closer to that area of the code. I'll CC @mistercrunch as well here, in case he knows the risks/workarounds to this. |
Not doing select *, and the difference in data usage of limit clauses for BigQuery are two fundamental gotchas of the database. Whether anyone has been caught by it is irrelevant. This approach should not under any circumstances be taken in a BigQuery database without specific intent. A UI submitting one is a landmine even if no one has yet stepped on it. https://www.doit.com/avoiding-eight-common-bigquery-query-mistakes/ |
Oh interesting, I think most other databases have optimizations around this, especially if/when using a simple scan operator on a table with a low limit. For views with breakpoint operators like GROUP or ORDER it's definitely more tricky and known problem, and I agree it should probably be disabled by default. We do have some internals around querying the last partition, but they don't seem to work at this time, at least not for BigQuery RN. See the logic here -> https://github.com/apache/superset/blob/master/superset/db_engine_specs/base.py#L1420 . To be clear about what I'm pointing at, it's a method of the BaseEngineSpec (where we define database-engine-specific logic) and this method is supposed to be able to generate a Some thoughts related to this:
|
Latest partition is still a suboptimal choice relying on an assumption about the size or relative size of the last partition. The correct way to get schema in BQ is to access the schema API or the preview query that generates no billing. The approach from other DBs where you do a select * with low limit just isn't reliably cost efficient. Partitions can be huge so querying the last one is likely percentage cheaper, but not guaranteed to be actually inexpensive. BQ has affordances for this, but they aren't the same query style as non-columnar DBs. Someone will need to dig into them and figure out which is the best approach. |
Is there SQL access to the preview API or only through clients? |
I think the preview API is only through the API. I believe something like this would give you a schema in a query.
|
Yeah btw it's quite a puzzle because you pretty much need to have a hard-coded value on the right side of your predicate for the optimizer to do partition pruning. Doing anything dynamic like I'm doing bellow pretty much results in a full scan since the optimizer doesn't know what's going to come out of that MAX, so it can only rely on execution engine optimizations. SELECT * FROM tbl WHERE _partition_column = (SELECT MAX(_partition_column) FROM tbl) Also clearly any of this type of stuff using a function on the left side of the predicate just can't work against any database engine. SELECT * FROM tbl WHERE ANY_FUNCTION(_partition_column) = (...) Now BigQuery has a useful I think both Oracle and SQL Server had stored procedure methods to run SQL as text so you could get it to prune that way, but all this has been a major pain as you have to go meta at the orchestration level. And clearly the incentives aren't in the right places for cloud vendors (and historically database vendors) to make that easy. The preview/sample API for BigQuery sounds nice, but it's a headache for tool builders that there's no reliable/implemented ANSI SQL support for something like that, and the fact that there's 3-4 ways to /rant Anyhow, seems like on our side we should build the right abstractions, maybe we add a That, and some "expensive preview prevention" settings at the database connection level, letting the admin decide if/when preview are made available to users and/or auto-fetched for tables, views and partionned tables. Now that we're deep in this hole, I should mention that part of the immune system should come from the DBA side of the house, and there are some options there, for instance the "do not allow querying a partitionned table without a predicate against it" or the "don't run queries that cost more than N dollars" with this particular account. Both these settings should prevent the data preview from doing its thing. Not a great user experience, but at least you don't get a surprise on your next bill. Happy to help pushing this forward, though the SQL Lab codebase is a bit tricky to work with around these things. Pointing fingers, I think the reason why the SQL Lab codebase is hard to work with is largely because of the fact we're dealing with all these subtly different database engines. |
Hi all! It's been about four months since this thread saw any action, which is putting it on my "close as stale" radar. Also, while this is clearly a "we ought to do something" situation, I'm not sure if this is a bug, per se. If anyone here wants to tackle this as a project or recruit others to do so on Slack or the mailing list, I'd encourage you to do so. Otherwise, we're likely to either move this to a Discussion thread or close it as stale before long. |
Showing complete disregard for the billing model of the underlying datastore is a bug. If it isn't a bug, then it's an underwater rock that I'm surprised has not killed anyone yet. You can triage it to some low priority status but shelving it is irresponsible. |
I agree this is an open issue that we should keep open until resolved or mitigated. For mitigation, about a |
@mistercrunch I think that would be great, a default of "true" for bigquery makes sense |
This should do it -> #30760 |
The SQL Lab (and unknown other places) currently submits a "select *" query with a limit when loading. This is potentially dangerous in BigQuery as it will query the entire table and every column regardless of the limit. BigQuery has other semantics for querying schema or previews of tables. This functionality should be disabled by default for BigQuery databases until a BigQuery "aware" version can be built.
The text was updated successfully, but these errors were encountered: