-
Notifications
You must be signed in to change notification settings - Fork 221
Get StructArray
's child array by column name
#416
Conversation
Codecov Report
@@ Coverage Diff @@
## main #416 +/- ##
=======================================
Coverage 80.96% 80.97%
=======================================
Files 347 347
Lines 22201 22215 +14
=======================================
+ Hits 17975 17988 +13
- Misses 4226 4227 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the proposal!
In general, I am opposed to some of these changes. What is added here can be achieved with one liners or introduces functionality that makes some assumptions about the struct.
In this context, both polars
and datafuse
follow the following pattern: arrow2
offers the basics, which is extended via traits specific to each of them.
In this case, the following can be done:
pub trait StructArrayExt {
fn column_names(&self) -> Vec<&str>;
fn column_by_name(&self, column_name: &str) -> Option<&ArrayRef>;
}
impl StructArrayExt for arrow2::arrow::StructArray {
fn column_by_name(&self, column_name: &str) -> Option<&ArrayRef> {
self.fields()
.iter()
.position(|c| c.name() == &column_name)
.map(|pos| self.value(pos))
}
fn column_names(&self) -> Vec<&str> {
self.fields().iter().map(|f| f.name.as_str()).collect()
}
}
The benefit of this approach is that we do only expose methods in arrow2
that we leave specialized APIs to dependencies that make assumptions about the usage of the Struct
.
} | ||
|
||
/// Return the number of fields in this struct array | ||
pub fn num_columns(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe num_fields
for consistency, since Structs have fields.
@@ -8,6 +8,7 @@ use crate::{ | |||
}; | |||
|
|||
use super::{ffi::ToFfi, new_empty_array, new_null_array, Array, FromFfi}; | |||
use crate::array::ArrayRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use Arc<dyn Array>
? ArrayRef
is only kept in the public API for compatiblity with DataFusion. For people not used to it, this adds an extra layer of indirection in reading code that is not necessary.
/// Returns the fields of this [`StructArray`]. | ||
pub fn fields(&self) -> &[Field] { | ||
Self::get_fields(&self.data_type) | ||
} | ||
|
||
/// Return field names in this struct array | ||
pub fn column_names(&self) -> Vec<&str> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation is O(F)
where F
is the number of fields and requires an allocation. Generally, it is more efficient to use self.fields().iter()
.
self.fields().iter().map(|f| f.name.as_str()).collect() | ||
} | ||
|
||
/// Return child array whose field name equals to column_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is susceptible of errors: it returns the first field, which for structs with duplicated names, can cause hard-to-debug errors. I would prefer to leave this to whoever enforces unique names (or is ok with this semantic).
Thanks @jorgecarleitao for the detailed explanation and review. It makes sense to me. I like the |
It will be nice to be able to access
StructArray
's child array by name.