Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting total rows copied in COPY TABLE FROM statement #3714

Closed
v0y4g3r opened this issue Apr 16, 2024 · 4 comments · Fixed by #3819
Closed

Limiting total rows copied in COPY TABLE FROM statement #3714

v0y4g3r opened this issue Apr 16, 2024 · 4 comments · Fixed by #3819
Labels
good first issue Good for newcomers

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Apr 16, 2024

What problem does the new feature solve?

As pointed out by this article, a 42 KiB parquet file may contain hundreds of trillions of values, just like what zip bomb does. If GreptimeDB does not limit the total rows copied in single COPY TABLE FROM statement, these parquet bombs may immediately overload the backend storage.

What does the feature do?

Limit total row imported by single COPY TABLE FROM statement.

The execution of copy statements will build a record batch stream on file and insert the batches yielded to mito engine.

while let Some(r) = stream.next().await {
let record_batch = r.context(error::ReadDfRecordBatchSnafu)?;
let vectors =
Helper::try_into_vectors(record_batch.columns()).context(IntoVectorsSnafu)?;
pending_mem_size += vectors.iter().map(|v| v.memory_size()).sum::<usize>();
let columns_values = fields
.iter()
.cloned()
.zip(vectors)
.collect::<HashMap<_, _>>();
pending.push(self.inserter.handle_table_insert(
InsertRequest {
catalog_name: req.catalog_name.to_string(),
schema_name: req.schema_name.to_string(),
table_name: req.table_name.to_string(),
columns_values,
},
query_ctx.clone(),
));
if pending_mem_size as u64 >= pending_mem_threshold {
let (rows, cost) = batch_insert(&mut pending, &mut pending_mem_size).await?;
rows_inserted += rows;
insert_cost += cost;
}
}
if !pending.is_empty() {
let (rows, cost) = batch_insert(&mut pending, &mut pending_mem_size).await?;
rows_inserted += rows;
insert_cost += cost;
}
}

We can add a config option to limit the total rows read from the record batch stream and terminate the execution once it exceeds threshold.

@v0y4g3r v0y4g3r added the good first issue Good for newcomers label Apr 16, 2024
@irenjj
Copy link
Collaborator

irenjj commented Apr 19, 2024

I'd like to take it, Could you please assign it to me?

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Apr 19, 2024

I'd like to take it, Could you please assign it to me?

Sure. Are you going to add a new runtime option to limit the rows?

@irenjj
Copy link
Collaborator

irenjj commented Apr 20, 2024

Sure. Are you going to add a new runtime option to limit the rows?

Sure, should we add a configuration option in the OptionMap in CopyTableArgument to limit it?

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Apr 22, 2024

Sure. Are you going to add a new runtime option to limit the rows?

Sure, should we add a configuration option in the OptionMap in CopyTableArgument to limit it?

Maybe we can set a default limit, for example 1000 rows, unless user explicitly specified the limit in COPY statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants