Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate Projection for Schema and RecordBatch #1425

Closed
alamb opened this issue Dec 9, 2021 · 2 comments
Closed

Consolidate Projection for Schema and RecordBatch #1425

alamb opened this issue Dec 9, 2021 · 2 comments
Labels
datafusion Changes in the datafusion crate good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 9, 2021

Background

@hntd187 fixed #1361 via #1378 but when I was reviewing the code, I found several other places that project RecordBatchs and Schemas that may also have the same subtle issues about losing the metadata. I am not sure of any bugs related to this yet but I fear they are lurking

The basic idea is to make functions like the following (which handle metadata correctly, following the pattern in #1361 )

fn project_schema(schema: &Schema, projection: &[usize]) -> <Schema> {
...
}

fn project_batch(batch: &RecordBatch, projection: &[usize]) -> Result<RecordBatch> {
...
}

And replace the duplicated code like

        let projected_schema = match &projection {
            Some(columns) => {
                let fields: Result<Vec<Field>> = columns
                    .iter()
                    .map(|i| {
                        if *i < schema.fields().len() {
                            Ok(schema.field(*i).clone())
                        } else {
                            Err(DataFusionError::Internal(
                                "Projection index out of range".to_string(),
                            ))
                        }
                    })
                    .collect();
                Arc::new(Schema::new(fields?))
            }
            None => Arc::clone(&schema),
        };

And

                Some(columns) => Some(RecordBatch::try_new(
                    self.schema.clone(),
                    columns.iter().map(|i| batch.column(*i).clone()).collect(),
                )),

ALl over the datafusion codebase

Additional context
Here is a corresponding arrow ticket to put the logic into arrow-rs: apache/arrow-rs#1014

@alamb alamb added the datafusion Changes in the datafusion crate label Dec 9, 2021
@alamb alamb changed the title Consolidate Projection Consolidate Projection for Schema and RecordBatch Dec 9, 2021
@alamb alamb assigned alamb and unassigned alamb Dec 9, 2021
@alamb alamb added the good first issue Good for newcomers label Dec 9, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

The code for Schema::project and RecordBatch::project from @hntd187 has been merged into apache/arrow-rs#1033

Once that is available in a release (likely arrow 6.5.0 next week) we can clean up some of this code in DataFusion

@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2022

Closed by #1638

@alamb alamb closed this as completed Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant