-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Schema::project and RecordBatch::project functions #1033
Conversation
…eturning a new schema with those columns only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hntd187 ❤️
This is a great start
arrow/src/datatypes/schema.rs
Outdated
let mut new_fields = vec![]; | ||
for i in indices { | ||
let f = self.fields[i].clone(); | ||
new_fields.push(f); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as written
- This will
panic!
if there the index is not in bounds: - is not "idiomatic rust style" (which to me means avoid
mut
). This is far less important
How about something such as (untested):
let mut new_fields = vec![]; | |
for i in indices { | |
let f = self.fields[i].clone(); | |
new_fields.push(f); | |
} | |
let new_fields = indices | |
.into_iter() | |
.map(|i| { | |
self.fields.get(i).map(|f| f.clone())) | |
.ok_or_else(|| Err(ArrowError::SchemaError( | |
format!("project index {} out of bounds, max field {}" | |
i, self.fields().len()), | |
)) | |
}) | |
.collect::<Result<Vec<_>>>()?; |
Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get to avoid fields[i]
and then the somewhat confusing use of turbofish .collect::<Result<Vec<_>>()
-- it took me quite a while to get used to that pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that seems good to me, the for loop was the first thing that popped into my head, but I can't think of any reason it's better than yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the for loop thing is what one would write in other languages like C/C++, Java, go ,etc :) It is certainly what I was writing when I started learning rust.
Then I realized that a big part of how rust avoids bounds checks while still being safe is by the use of the functional style
assert_eq!(projected.fields()[0].name(), "name"); | ||
assert_eq!(projected.fields()[1].name(), "priority"); | ||
assert_eq!(projected.metadata.get("meta").unwrap(), "data") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above -- I recommend a test for handling if index is out of bounds -- like schema.project([2, 3])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do
arrow/src/record_batch.rs
Outdated
@@ -175,6 +175,12 @@ impl RecordBatch { | |||
self.schema.clone() | |||
} | |||
|
|||
|
|||
/// Projects the schema onto the specified columns | |||
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of this field was to project the RecordBatch
rather than just the schema:
A signature like this:
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> { | |
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> { |
(so we would also have to project the columns as well as the schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I thought this part was a bit too easy, okay I'll update to reflect that.
@alamb |
It looks like the new code may not yet have been pushed to github |
Codecov Report
@@ Coverage Diff @@
## master #1033 +/- ##
==========================================
- Coverage 82.31% 82.25% -0.07%
==========================================
Files 168 168
Lines 49031 49197 +166
==========================================
+ Hits 40360 40465 +105
- Misses 8671 8732 +61
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks for sticking with this @hntd187
|
||
RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about some tests?
Perhaps something like
#[test]
fn project() {
let a: ArrayRef = Arc::new(Int32Array::from(vec![
Some(1),
None,
Some(3),
]));
let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"]));
let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"]));
let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), ("b", b.clone()), ("c", c.clone())])
.expect("valid conversion");
let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)])
.expect("valid conversion");
assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap());
}
arrow/src/record_batch.rs
Outdated
&self, | ||
indices: impl IntoIterator<Item = usize> + Clone, | ||
) -> Result<RecordBatch> { | ||
let projected_schema = self.schema.project(indices.clone())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now why you needed to make the iter Clone
which is kind of annoying 🤔
arrow/src/datatypes/schema.rs
Outdated
@@ -87,6 +87,24 @@ impl Schema { | |||
Self { fields, metadata } | |||
} | |||
|
|||
/// Returns a new schema with only the specified columns in the new schema | |||
/// This carries metadata from the parent schema over as well | |||
pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> Result<Schema> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I did something different in the ticket, but I think this interface is kind of annoying.
Namely, I couldn't pass in &vec![1, 2]
--> arrow/src/datatypes/schema.rs:405:40
|
405 | let projected: Schema = schema.project(&vec![0, 2]).unwrap();
| ^^^^^^^ expected `&{integer}`, found `usize`
What would you think about being less fancy and changing this (and RecordBatch
) to something like:
pub fn project(&self, indices: &[size]) -> Result<Schema> {
Which would then avoid the need for the clone on RecordBatch::project
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- thank you @hntd187
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Stephen Carman <hntd187@users.noreply.github.com>
…eturning a new schema with those columns only
Which issue does this PR close?
Closes #1014.
Rationale for this change
See #1014 but a lot of code can be simplified and also fix silent bugs with handling metadata.
What changes are included in this PR?
2 methods on Schema and RecordBatch to allow them to project on their schemas.
Are there any user-facing changes?