Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select * Limit is DANGEROUS in BigQuery #17299

Open
GLStephen opened this issue Oct 30, 2021 · 19 comments · May be fixed by #30760
Open

Select * Limit is DANGEROUS in BigQuery #17299

GLStephen opened this issue Oct 30, 2021 · 19 comments · May be fixed by #30760
Assignees
Labels
#bug Bug report

Comments

@GLStephen
Copy link

GLStephen commented Oct 30, 2021

The SQL Lab (and unknown other places) currently submits a "select *" query with a limit when loading. This is potentially dangerous in BigQuery as it will query the entire table and every column regardless of the limit. BigQuery has other semantics for querying schema or previews of tables. This functionality should be disabled by default for BigQuery databases until a BigQuery "aware" version can be built.

@GLStephen GLStephen added the #bug Bug report label Oct 30, 2021
@eschutho
Copy link
Member

eschutho commented Nov 2, 2021

@yousoph

@riordan
Copy link

riordan commented Dec 29, 2021

Additional context: BigQuery charges users based (near-exclusively) on the count of bytes loaded into memory. As such, they strongly discourage the use of SELECT * to control costs, yet SELECT *... is a great convention to help users bootstrap their own queries.

The ways around this are:

  1. Large datasets are encouraged to be partitioned (usually by time)
  2. Selecting only from the needed columns

In practice, using the latest partition is the most practical way to constrain costs on bootstrapped queries (SELECT * FROM table WHERE date={$latestpartition}). This is how the BQ UI (and most BQ-aware SQL tools) limit costs.

It's also possible to store each partition as a separate table in a dataset and query it hive-style, and it's the way external data is partition queried in BQ. However, this has the disadvantage (for Google) of making it harder for users to write lots of queries that bill the most expensive possible way.

I once came across a single query that cost $37,000 USD:
SELECT * FROM eventlogs* LIMIT 1;

A single 37k query, going back over years and trillions of events... just to fetch the table schema 🤦‍♂️🤣.

@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 17, 2022
@rusackas
Copy link
Member

rusackas commented Feb 1, 2023

@yousoph @eschutho @betodealmeida are we still facing this issue and/or concerned with it?

@stale stale bot removed the inactive Inactive for >= 30 days label Feb 1, 2023
@GLStephen
Copy link
Author

@rusackas this gave us great pause on our BigQuery usage with Superset.

@rusackas
Copy link
Member

Tempted to close this as stale, but not sure if @yousoph @eschutho @betodealmeida know the current state of affairs here and whether or not this is still a major concern.

@GLStephen
Copy link
Author

@rusackas if superset hasn't fixed it or addressed it in some way then BigQuery usage is basically a non-starter for anyone doing real work with it. This star query would cost us thousands of dollars on some of our tables so we just avoid them and are moving away from Superset/Preset since Superset doesn't seem to want to care to understand how much of an issue this could be.

@rusackas
Copy link
Member

Actually, we do use BigQuery regularly at Preset, as do many other orgs, and this hasn't been raised as an issue by others. That's why I'm asking folks closer to that area of the code. I'll CC @mistercrunch as well here, in case he knows the risks/workarounds to this.

@GLStephen
Copy link
Author

GLStephen commented Feb 16, 2024

Not doing select *, and the difference in data usage of limit clauses for BigQuery are two fundamental gotchas of the database. Whether anyone has been caught by it is irrelevant. This approach should not under any circumstances be taken in a BigQuery database without specific intent. A UI submitting one is a landmine even if no one has yet stepped on it.

https://www.doit.com/avoiding-eight-common-bigquery-query-mistakes/

https://shopify.engineering/reducing-bigquery-costs

@mistercrunch
Copy link
Member

Oh interesting, I think most other databases have optimizations around this, especially if/when using a simple scan operator on a table with a low limit. For views with breakpoint operators like GROUP or ORDER it's definitely more tricky and known problem, and I agree it should probably be disabled by default.

We do have some internals around querying the last partition, but they don't seem to work at this time, at least not for BigQuery RN. See the logic here -> https://github.com/apache/superset/blob/master/superset/db_engine_specs/base.py#L1420 . To be clear about what I'm pointing at, it's a method of the BaseEngineSpec (where we define database-engine-specific logic) and this method is supposed to be able to generate a SELECT * with a latest partition predicate when the flag I'm pointing to is True.

Some thoughts related to this:

  • it seems the data preview in SQL Lab should always predicate on latest partition for all databases that support it and have it defined in their db_engine_spec
  • preview / SELECT * LIMIT should probably be disabled for views in general, or asking for confirmation from the user on the preview pane with a warning "Click here to view preview" and "!note that we will run this potentially costly query ....!"
  • maybe we need new per-database configuration settings:
    • "force latest partition predicate on preview"
    • "ask user for preview"
    • "disable preview for views"

@GLStephen
Copy link
Author

GLStephen commented Feb 18, 2024

Latest partition is still a suboptimal choice relying on an assumption about the size or relative size of the last partition. The correct way to get schema in BQ is to access the schema API or the preview query that generates no billing.

The approach from other DBs where you do a select * with low limit just isn't reliably cost efficient. Partitions can be huge so querying the last one is likely percentage cheaper, but not guaranteed to be actually inexpensive.

BQ has affordances for this, but they aren't the same query style as non-columnar DBs.

Someone will need to dig into them and figure out which is the best approach.

@mistercrunch
Copy link
Member

Is there SQL access to the preview API or only through clients?

@GLStephen
Copy link
Author

I think the preview API is only through the API. I believe something like this would give you a schema in a query.

SELECT 
 TO_JSON_STRING(
    ARRAY_AGG(STRUCT( 
      IF(is_nullable = 'YES', 'NULLABLE', 'REQUIRED') AS mode,
      column_name AS name,
      data_type AS type)
    ORDER BY ordinal_position), TRUE) AS schema
FROM
  <YOUR_DATASET>.INFORMATION_SCHEMA.COLUMNS
WHERE
  table_name = <YOUR_TABLE>

@mistercrunch
Copy link
Member

mistercrunch commented Feb 19, 2024

Yeah btw it's quite a puzzle because you pretty much need to have a hard-coded value on the right side of your predicate for the optimizer to do partition pruning. Doing anything dynamic like I'm doing bellow pretty much results in a full scan since the optimizer doesn't know what's going to come out of that MAX, so it can only rely on execution engine optimizations.

SELECT * FROM tbl WHERE _partition_column = (SELECT MAX(_partition_column) FROM tbl)

Also clearly any of this type of stuff using a function on the left side of the predicate just can't work against any database engine.

SELECT * FROM tbl WHERE  ANY_FUNCTION(_partition_column) = (...)

Now BigQuery has a useful <YOUR_DATASET>.INFORMATION_SCHEMA.PARTITIONS table, which is nice and allows for metadata operations in-sql, but for the reasons mentioned above, you pretty much need to do things in 2 phases or the optimizer won't prune. In dbt we're able to do this using macros that run at compile time against those metadata tables. The macro fetches the hard value and stamps it hard-coded into the SQL. You have to mess around with some casting to the right type though as the partition_id is stored as a string, but it can be done, especially when you know the type and have reverse engineered how strings get casted.

I think both Oracle and SQL Server had stored procedure methods to run SQL as text so you could get it to prune that way, but all this has been a major pain as you have to go meta at the orchestration level. And clearly the incentives aren't in the right places for cloud vendors (and historically database vendors) to make that easy.

The preview/sample API for BigQuery sounds nice, but it's a headache for tool builders that there's no reliable/implemented ANSI SQL support for something like that, and the fact that there's 3-4 ways to LIMIT , ROWNUM, TOP, and that all those things don't even offer clear cost/perf-guarantees depending on the context (whether it's a view, a clustered table, ....).

/rant

Anyhow, seems like on our side we should build the right abstractions, maybe we add a get_samples(relation, number) -> Dataframe to BaseDbEngineSpec, and implement it as a LIMIT by default, and someone can implement it in various ways on a per-engine basis. For instance use the API method for BigQuery.

That, and some "expensive preview prevention" settings at the database connection level, letting the admin decide if/when preview are made available to users and/or auto-fetched for tables, views and partionned tables.


Now that we're deep in this hole, I should mention that part of the immune system should come from the DBA side of the house, and there are some options there, for instance the "do not allow querying a partitionned table without a predicate against it" or the "don't run queries that cost more than N dollars" with this particular account. Both these settings should prevent the data preview from doing its thing. Not a great user experience, but at least you don't get a surprise on your next bill.

Happy to help pushing this forward, though the SQL Lab codebase is a bit tricky to work with around these things. Pointing fingers, I think the reason why the SQL Lab codebase is hard to work with is largely because of the fact we're dealing with all these subtly different database engines.

@rusackas
Copy link
Member

Hi all!

It's been about four months since this thread saw any action, which is putting it on my "close as stale" radar. Also, while this is clearly a "we ought to do something" situation, I'm not sure if this is a bug, per se.

If anyone here wants to tackle this as a project or recruit others to do so on Slack or the mailing list, I'd encourage you to do so. Otherwise, we're likely to either move this to a Discussion thread or close it as stale before long.

@GLStephen
Copy link
Author

GLStephen commented Jun 11, 2024

Showing complete disregard for the billing model of the underlying datastore is a bug. If it isn't a bug, then it's an underwater rock that I'm surprised has not killed anyone yet. You can triage it to some low priority status but shelving it is irresponsible.

@mistercrunch
Copy link
Member

I agree this is an open issue that we should keep open until resolved or mitigated. For mitigation, about a disable_data_preview flag at the database connection level, set it to True when engine = bigquery?

@GLStephen
Copy link
Author

@mistercrunch I think that would be great, a default of "true" for bigquery makes sense

@mistercrunch
Copy link
Member

This should do it -> #30760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug Bug report
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants