-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support Map literals in Substrait consumer and producer #11547
Conversation
31d914d
to
b820936
Compare
This and #11510 will conflict, once either one is merged I'll be happy to fix the other one. |
b820936
to
c7bad12
Compare
let mut values_hashes = vec![0u64; array.entries().len()]; | ||
create_hashes(array.entries().columns(), random_state, &mut values_hashes)?; | ||
|
||
// Combine the hashes for entries on each row with each other and previous hash for that row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted this logic from combining List and Struct hashing, the result seemed to make sense to me, but I'm not 100% confident in it
@@ -1773,6 +1773,7 @@ impl ScalarValue { | |||
} | |||
DataType::List(_) | |||
| DataType::LargeList(_) | |||
| DataType::Map(_, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works at least for the test case, given it re-uses arrow::compute::concat I'd hope it does the right thing overall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine I think because ScalarValue::Map
is implemented as a 1 row array
|
||
|
||
query ? | ||
VALUES (MAP(['a'], [1])), (MAP(['b'], [2])), (MAP(['c', 'a'], [3, 1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the changes in scalar/mod.rs
, this would error:
External error: query failed: DataFusion error: Internal error: Unsupported creation of Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) array from ScalarValue Some(Map([{}])).
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] VALUES(MAP([], []))
at test_files/map.slt:307
dfs_names, | ||
&mut entry_name_idx, | ||
)?; | ||
ScalarStructBuilder::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the most high-level way of creating the map I could think of, lmk if you have better ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also/alternatively, should I add this into ScalarValue? they could sit next to ScalarValue::new_list etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the most high-level way of creating the map I could think of, lmk if you have better ideas!
It looks good to me. If you intend to be more efficient, I suggest referring to how make_map_batch_internal
creates a MapArray
.
fn make_map_batch_internal( |
You need to partition the key and value pairs into two arrays, and build the MapArray
based on them. Maybe you can refer to plan_make_map
.
fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> { |
However, I think it just some improvements. We don't need to do that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmh, yeah I think I'll leave it as-is for now, as this code is relevant only for local relations (ie data encoded in the substrait message itself) I don't expect it to be very performance-sensitive, the data scale should hopefully always be small...
@alamb thanks for merging the other PR, this is now rebased on top and also ready for review! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me @Blizzara
I wonder if @goldmedal you might have some time to review this PR as well?
@@ -1773,6 +1773,7 @@ impl ScalarValue { | |||
} | |||
DataType::List(_) | |||
| DataType::LargeList(_) | |||
| DataType::Map(_, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine I think because ScalarValue::Map
is implemented as a 1 row array
#[test] | ||
// Tests actual values of hashes, which are different if forcing collisions | ||
#[cfg(not(feature = "force_hash_collisions"))] | ||
fn create_hashes_for_map_arrays() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help me undertand / verify this test if you could use a MapBuilder
or add a comment showing what MapArray was being built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! 06b7d3c
Sure, I'll review this tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Blizzara, Overall LGTM. I just leave some minor comments.
datafusion/common/src/hash_utils.rs
Outdated
assert_ne!(hashes[0], hashes[2]); // different key | ||
assert_ne!(hashes[0], hashes[3]); // different value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_ne!(hashes[0], hashes[2]); // different key | |
assert_ne!(hashes[0], hashes[3]); // different value | |
assert_ne!(hashes[0], hashes[2]); // different value | |
assert_ne!(hashes[0], hashes[3]); // different key |
I guess the comments are wrong.
The difference between Row 0
and Row 2
is the value of key2
: {'key2': 11}
in Row 0 and {'key2': 12}
in Row 2.
However, the difference between Row 0
and Row 3
is the key: key2
in Row 0 and key3
in Row 3.
Did I say that correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great catch! I must have confused myself, or changed it after writing 😅 fixed in 62149fa, thanks!
dfs_names, | ||
&mut entry_name_idx, | ||
)?; | ||
ScalarStructBuilder::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the most high-level way of creating the map I could think of, lmk if you have better ideas!
It looks good to me. If you intend to be more efficient, I suggest referring to how make_map_batch_internal
creates a MapArray
.
fn make_map_batch_internal( |
You need to partition the key and value pairs into two arrays, and build the MapArray
based on them. Maybe you can refer to plan_make_map
.
fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> { |
However, I think it just some improvements. We don't need to do that in this PR.
Thanks @goldmedal and @Blizzara -- 👌 very nice |
Which issue does this PR close?
Related to Map epic #11434
Rationale for this change
Substrait didn't support Map literals, since we previously didn't have Map ScalarValues. #11224 implemented ScalarValues (thanks!), so now we can add them into Substrait as well.
There were also couple gaps left by the implementation that I encountered while testing this, so I fixed those as well.
Also, there was a bug in the from_substrait_literal for lists containing structs, which I realized while implementing the map support.
What changes are included in this PR?
ScalarValue::iter_to_array
andcreate_hashes
Are these changes tested?
Tested with new roundtrip tests for the Map literals, and unit tests for hashing.
Also an existing UT for list literals is extended to cover the multiple structs values case.
I'd have added a sql roundtrip test for Map as well, but it seems that the
MAP
command turns into a ScalarFunction rather than ScalarValue. Maybe we'd need to run some round of optimizer to fold it, but I wonder why that is different from theSTRUCT
command which does turn into a struct literal?I.e. doing
roundtrip("VALUES (MAP(['k1', 'k2'], [true, CAST(NULL AS BOOLEAN)]))").await?;
results inError: Substrait("Only literal types can be aliased in Virtual Tables, got: ScalarFunction")
, while similar code for STRUCT works fine.Are there any user-facing changes?