Select * Limit is DANGEROUS in BigQuery #17299

GLStephen · 2021-10-30T18:23:57Z

The SQL Lab (and unknown other places) currently submits a "select *" query with a limit when loading. This is potentially dangerous in BigQuery as it will query the entire table and every column regardless of the limit. BigQuery has other semantics for querying schema or previews of tables. This functionality should be disabled by default for BigQuery databases until a BigQuery "aware" version can be built.

eschutho · 2021-11-02T00:20:13Z

@yousoph

riordan · 2021-12-29T16:53:07Z

Additional context: BigQuery charges users based (near-exclusively) on the count of bytes loaded into memory. As such, they strongly discourage the use of SELECT * to control costs, yet SELECT *... is a great convention to help users bootstrap their own queries.

The ways around this are:

Large datasets are encouraged to be partitioned (usually by time)
Selecting only from the needed columns

In practice, using the latest partition is the most practical way to constrain costs on bootstrapped queries (SELECT * FROM table WHERE date={$latestpartition}). This is how the BQ UI (and most BQ-aware SQL tools) limit costs.

It's also possible to store each partition as a separate table in a dataset and query it hive-style, and it's the way external data is partition queried in BQ. However, this has the disadvantage (for Google) of making it harder for users to write lots of queries that bill the most expensive possible way.

I once came across a single query that cost $37,000 USD:
SELECT * FROM eventlogs* LIMIT 1;

A single 37k query, going back over years and trillions of events... just to fetch the table schema 🤦‍♂️🤣.

stale · 2022-04-17T20:56:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

rusackas · 2023-02-01T22:00:42Z

@yousoph @eschutho @betodealmeida are we still facing this issue and/or concerned with it?

GLStephen · 2023-02-01T22:41:58Z

@rusackas this gave us great pause on our BigQuery usage with Superset.

rusackas · 2024-02-15T23:09:43Z

Tempted to close this as stale, but not sure if @yousoph @eschutho @betodealmeida know the current state of affairs here and whether or not this is still a major concern.

GLStephen · 2024-02-15T23:27:29Z

@rusackas if superset hasn't fixed it or addressed it in some way then BigQuery usage is basically a non-starter for anyone doing real work with it. This star query would cost us thousands of dollars on some of our tables so we just avoid them and are moving away from Superset/Preset since Superset doesn't seem to want to care to understand how much of an issue this could be.

rusackas · 2024-02-15T23:33:42Z

Actually, we do use BigQuery regularly at Preset, as do many other orgs, and this hasn't been raised as an issue by others. That's why I'm asking folks closer to that area of the code. I'll CC @mistercrunch as well here, in case he knows the risks/workarounds to this.

GLStephen · 2024-02-16T09:43:16Z

Not doing select *, and the difference in data usage of limit clauses for BigQuery are two fundamental gotchas of the database. Whether anyone has been caught by it is irrelevant. This approach should not under any circumstances be taken in a BigQuery database without specific intent. A UI submitting one is a landmine even if no one has yet stepped on it.

https://www.doit.com/avoiding-eight-common-bigquery-query-mistakes/

https://shopify.engineering/reducing-bigquery-costs

mistercrunch · 2024-02-18T20:14:56Z

Oh interesting, I think most other databases have optimizations around this, especially if/when using a simple scan operator on a table with a low limit. For views with breakpoint operators like GROUP or ORDER it's definitely more tricky and known problem, and I agree it should probably be disabled by default.

We do have some internals around querying the last partition, but they don't seem to work at this time, at least not for BigQuery RN. See the logic here -> https://github.com/apache/superset/blob/master/superset/db_engine_specs/base.py#L1420 . To be clear about what I'm pointing at, it's a method of the BaseEngineSpec (where we define database-engine-specific logic) and this method is supposed to be able to generate a SELECT * with a latest partition predicate when the flag I'm pointing to is True.

Some thoughts related to this:

it seems the data preview in SQL Lab should always predicate on latest partition for all databases that support it and have it defined in their db_engine_spec
preview / SELECT * LIMIT should probably be disabled for views in general, or asking for confirmation from the user on the preview pane with a warning "Click here to view preview" and "!note that we will run this potentially costly query ....!"
maybe we need new per-database configuration settings:
- "force latest partition predicate on preview"
- "ask user for preview"
- "disable preview for views"

GLStephen · 2024-02-18T20:27:25Z

Latest partition is still a suboptimal choice relying on an assumption about the size or relative size of the last partition. The correct way to get schema in BQ is to access the schema API or the preview query that generates no billing.

The approach from other DBs where you do a select * with low limit just isn't reliably cost efficient. Partitions can be huge so querying the last one is likely percentage cheaper, but not guaranteed to be actually inexpensive.

BQ has affordances for this, but they aren't the same query style as non-columnar DBs.

Someone will need to dig into them and figure out which is the best approach.

mistercrunch · 2024-02-18T21:56:45Z

Is there SQL access to the preview API or only through clients?

GLStephen · 2024-02-18T23:31:05Z

I think the preview API is only through the API. I believe something like this would give you a schema in a query.

SELECT 
 TO_JSON_STRING(
    ARRAY_AGG(STRUCT( 
      IF(is_nullable = 'YES', 'NULLABLE', 'REQUIRED') AS mode,
      column_name AS name,
      data_type AS type)
    ORDER BY ordinal_position), TRUE) AS schema
FROM
  <YOUR_DATASET>.INFORMATION_SCHEMA.COLUMNS
WHERE
  table_name = <YOUR_TABLE>

mistercrunch · 2024-02-19T05:32:08Z

Yeah btw it's quite a puzzle because you pretty much need to have a hard-coded value on the right side of your predicate for the optimizer to do partition pruning. Doing anything dynamic like I'm doing bellow pretty much results in a full scan since the optimizer doesn't know what's going to come out of that MAX, so it can only rely on execution engine optimizations.

SELECT * FROM tbl WHERE _partition_column = (SELECT MAX(_partition_column) FROM tbl)

Also clearly any of this type of stuff using a function on the left side of the predicate just can't work against any database engine.

SELECT * FROM tbl WHERE  ANY_FUNCTION(_partition_column) = (...)

Now BigQuery has a useful <YOUR_DATASET>.INFORMATION_SCHEMA.PARTITIONS table, which is nice and allows for metadata operations in-sql, but for the reasons mentioned above, you pretty much need to do things in 2 phases or the optimizer won't prune. In dbt we're able to do this using macros that run at compile time against those metadata tables. The macro fetches the hard value and stamps it hard-coded into the SQL. You have to mess around with some casting to the right type though as the partition_id is stored as a string, but it can be done, especially when you know the type and have reverse engineered how strings get casted.

I think both Oracle and SQL Server had stored procedure methods to run SQL as text so you could get it to prune that way, but all this has been a major pain as you have to go meta at the orchestration level. And clearly the incentives aren't in the right places for cloud vendors (and historically database vendors) to make that easy.

The preview/sample API for BigQuery sounds nice, but it's a headache for tool builders that there's no reliable/implemented ANSI SQL support for something like that, and the fact that there's 3-4 ways to LIMIT , ROWNUM, TOP, and that all those things don't even offer clear cost/perf-guarantees depending on the context (whether it's a view, a clustered table, ....).

/rant

Anyhow, seems like on our side we should build the right abstractions, maybe we add a get_samples(relation, number) -> Dataframe to BaseDbEngineSpec, and implement it as a LIMIT by default, and someone can implement it in various ways on a per-engine basis. For instance use the API method for BigQuery.

That, and some "expensive preview prevention" settings at the database connection level, letting the admin decide if/when preview are made available to users and/or auto-fetched for tables, views and partionned tables.

Now that we're deep in this hole, I should mention that part of the immune system should come from the DBA side of the house, and there are some options there, for instance the "do not allow querying a partitionned table without a predicate against it" or the "don't run queries that cost more than N dollars" with this particular account. Both these settings should prevent the data preview from doing its thing. Not a great user experience, but at least you don't get a surprise on your next bill.

Happy to help pushing this forward, though the SQL Lab codebase is a bit tricky to work with around these things. Pointing fingers, I think the reason why the SQL Lab codebase is hard to work with is largely because of the fact we're dealing with all these subtly different database engines.

rusackas · 2024-06-11T21:49:20Z

Hi all!

It's been about four months since this thread saw any action, which is putting it on my "close as stale" radar. Also, while this is clearly a "we ought to do something" situation, I'm not sure if this is a bug, per se.

If anyone here wants to tackle this as a project or recruit others to do so on Slack or the mailing list, I'd encourage you to do so. Otherwise, we're likely to either move this to a Discussion thread or close it as stale before long.

GLStephen · 2024-06-11T21:53:23Z

Showing complete disregard for the billing model of the underlying datastore is a bug. If it isn't a bug, then it's an underwater rock that I'm surprised has not killed anyone yet. You can triage it to some low priority status but shelving it is irresponsible.

mistercrunch · 2024-06-11T23:13:34Z

I agree this is an open issue that we should keep open until resolved or mitigated. For mitigation, about a disable_data_preview flag at the database connection level, set it to True when engine = bigquery?

GLStephen · 2024-06-12T00:34:19Z

@mistercrunch I think that would be great, a default of "true" for bigquery makes sense

mistercrunch · 2024-10-31T05:33:59Z

This should do it -> #30760

GLStephen added the #bug Bug report label Oct 30, 2021

betodealmeida self-assigned this Dec 29, 2021

riordan mentioned this issue Feb 5, 2022

Support preview of table in SQLAlchemy googleapis/python-bigquery-sqlalchemy#411

Open

stale bot added the inactive Inactive for >= 30 days label Apr 17, 2022

stale bot removed the inactive Inactive for >= 30 days label Feb 1, 2023

mistercrunch linked a pull request Oct 31, 2024 that will close this issue

feat(sqllab): add latest partition support for BigQuery #30760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select * Limit is DANGEROUS in BigQuery #17299

Select * Limit is DANGEROUS in BigQuery #17299

GLStephen commented Oct 30, 2021 •

edited

Loading

eschutho commented Nov 2, 2021

riordan commented Dec 29, 2021

stale bot commented Apr 17, 2022

rusackas commented Feb 1, 2023

GLStephen commented Feb 1, 2023

rusackas commented Feb 15, 2024

GLStephen commented Feb 15, 2024

rusackas commented Feb 15, 2024

GLStephen commented Feb 16, 2024 •

edited

Loading

mistercrunch commented Feb 18, 2024

GLStephen commented Feb 18, 2024 •

edited

Loading

mistercrunch commented Feb 18, 2024

GLStephen commented Feb 18, 2024

mistercrunch commented Feb 19, 2024 •

edited

Loading

rusackas commented Jun 11, 2024

GLStephen commented Jun 11, 2024 •

edited

Loading

mistercrunch commented Jun 11, 2024

GLStephen commented Jun 12, 2024

mistercrunch commented Oct 31, 2024

Select * Limit is DANGEROUS in BigQuery #17299

Select * Limit is DANGEROUS in BigQuery #17299

Comments

GLStephen commented Oct 30, 2021 • edited Loading

eschutho commented Nov 2, 2021

riordan commented Dec 29, 2021

stale bot commented Apr 17, 2022

rusackas commented Feb 1, 2023

GLStephen commented Feb 1, 2023

rusackas commented Feb 15, 2024

GLStephen commented Feb 15, 2024

rusackas commented Feb 15, 2024

GLStephen commented Feb 16, 2024 • edited Loading

mistercrunch commented Feb 18, 2024

GLStephen commented Feb 18, 2024 • edited Loading

mistercrunch commented Feb 18, 2024

GLStephen commented Feb 18, 2024

mistercrunch commented Feb 19, 2024 • edited Loading

rusackas commented Jun 11, 2024

GLStephen commented Jun 11, 2024 • edited Loading

mistercrunch commented Jun 11, 2024

GLStephen commented Jun 12, 2024

mistercrunch commented Oct 31, 2024

GLStephen commented Oct 30, 2021 •

edited

Loading

GLStephen commented Feb 16, 2024 •

edited

Loading

GLStephen commented Feb 18, 2024 •

edited

Loading

mistercrunch commented Feb 19, 2024 •

edited

Loading

GLStephen commented Jun 11, 2024 •

edited

Loading