Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Get StructArray's child array by column name #416

Closed
wants to merge 1 commit into from

Conversation

yjshen
Copy link
Contributor

@yjshen yjshen commented Sep 17, 2021

It will be nice to be able to access StructArray's child array by name.

@codecov
Copy link

codecov bot commented Sep 17, 2021

Codecov Report

Merging #416 (cddb512) into main (06c45f8) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #416   +/-   ##
=======================================
  Coverage   80.96%   80.97%           
=======================================
  Files         347      347           
  Lines       22201    22215   +14     
=======================================
+ Hits        17975    17988   +13     
- Misses       4226     4227    +1     
Impacted Files Coverage Δ
src/array/struct_.rs 44.08% <100.00%> (+6.73%) ⬆️
tests/it/array/growable/struct_.rs 100.00% <100.00%> (ø)
src/array/boolean/ffi.rs 0.00% <0.00%> (-5.89%) ⬇️
src/io/json/read/reader.rs 74.32% <0.00%> (-1.36%) ⬇️
src/io/json_integration/write.rs 6.25% <0.00%> (+6.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06c45f8...cddb512. Read the comment docs.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the proposal!

In general, I am opposed to some of these changes. What is added here can be achieved with one liners or introduces functionality that makes some assumptions about the struct.

In this context, both polars and datafuse follow the following pattern: arrow2 offers the basics, which is extended via traits specific to each of them.

In this case, the following can be done:

pub trait StructArrayExt {
    fn column_names(&self) -> Vec<&str>;
    fn column_by_name(&self, column_name: &str) -> Option<&ArrayRef>;
}

impl StructArrayExt for arrow2::arrow::StructArray {
    fn column_by_name(&self, column_name: &str) -> Option<&ArrayRef> {
        self.fields()
            .iter()
            .position(|c| c.name() == &column_name)
            .map(|pos| self.value(pos))
    }

    fn column_names(&self) -> Vec<&str> {
         self.fields().iter().map(|f| f.name.as_str()).collect()
    }
}

The benefit of this approach is that we do only expose methods in arrow2 that we leave specialized APIs to dependencies that make assumptions about the usage of the Struct.

}

/// Return the number of fields in this struct array
pub fn num_columns(&self) -> usize {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe num_fields for consistency, since Structs have fields.

@@ -8,6 +8,7 @@ use crate::{
};

use super::{ffi::ToFfi, new_empty_array, new_null_array, Array, FromFfi};
use crate::array::ArrayRef;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Arc<dyn Array>? ArrayRef is only kept in the public API for compatiblity with DataFusion. For people not used to it, this adds an extra layer of indirection in reading code that is not necessary.

/// Returns the fields of this [`StructArray`].
pub fn fields(&self) -> &[Field] {
Self::get_fields(&self.data_type)
}

/// Return field names in this struct array
pub fn column_names(&self) -> Vec<&str> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation is O(F) where F is the number of fields and requires an allocation. Generally, it is more efficient to use self.fields().iter().

self.fields().iter().map(|f| f.name.as_str()).collect()
}

/// Return child array whose field name equals to column_name
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is susceptible of errors: it returns the first field, which for structs with duplicated names, can cause hard-to-debug errors. I would prefer to leave this to whoever enforces unique names (or is ok with this semantic).

@yjshen
Copy link
Contributor Author

yjshen commented Sep 17, 2021

Thanks @jorgecarleitao for the detailed explanation and review. It makes sense to me. I like the StructArrayExt you proposed to put into DataFusion. I'll close this PR then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants