Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support Map literals in Substrait consumer and producer #11547

Merged
merged 9 commits into from
Jul 23, 2024

Conversation

Blizzara
Copy link
Contributor

Which issue does this PR close?

Related to Map epic #11434

Rationale for this change

Substrait didn't support Map literals, since we previously didn't have Map ScalarValues. #11224 implemented ScalarValues (thanks!), so now we can add them into Substrait as well.

There were also couple gaps left by the implementation that I encountered while testing this, so I fixed those as well.

Also, there was a bug in the from_substrait_literal for lists containing structs, which I realized while implementing the map support.

What changes are included in this PR?

  • Implement Substrait roundtrip support for Map literals, incl. null and empty maps.
  • Fix field name handling for Substrait lists containing structs
  • Add Map support to ScalarValue::iter_to_array and create_hashes

Are these changes tested?

Tested with new roundtrip tests for the Map literals, and unit tests for hashing.
Also an existing UT for list literals is extended to cover the multiple structs values case.

I'd have added a sql roundtrip test for Map as well, but it seems that the MAP command turns into a ScalarFunction rather than ScalarValue. Maybe we'd need to run some round of optimizer to fold it, but I wonder why that is different from the STRUCT command which does turn into a struct literal?

I.e. doing roundtrip("VALUES (MAP(['k1', 'k2'], [true, CAST(NULL AS BOOLEAN)]))").await?; results in Error: Substrait("Only literal types can be aliased in Virtual Tables, got: ScalarFunction"), while similar code for STRUCT works fine.

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) substrait labels Jul 19, 2024
@Blizzara Blizzara marked this pull request as ready for review July 19, 2024 12:16
@Blizzara Blizzara changed the title Avo/substrait map literals feat: Support Map literals in Substrait consumer and producer Jul 19, 2024
@Blizzara
Copy link
Contributor Author

This and #11510 will conflict, once either one is merged I'll be happy to fix the other one.

@Blizzara Blizzara changed the title feat: Support Map literals in Substrait consumer and producer feat: support Map literals in Substrait consumer and producer Jul 19, 2024
let mut values_hashes = vec![0u64; array.entries().len()];
create_hashes(array.entries().columns(), random_state, &mut values_hashes)?;

// Combine the hashes for entries on each row with each other and previous hash for that row
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted this logic from combining List and Struct hashing, the result seemed to make sense to me, but I'm not 100% confident in it

@@ -1773,6 +1773,7 @@ impl ScalarValue {
}
DataType::List(_)
| DataType::LargeList(_)
| DataType::Map(_, _)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works at least for the test case, given it re-uses arrow::compute::concat I'd hope it does the right thing overall

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine I think because ScalarValue::Map is implemented as a 1 row array



query ?
VALUES (MAP(['a'], [1])), (MAP(['b'], [2])), (MAP(['c', 'a'], [3, 1]))
Copy link
Contributor Author

@Blizzara Blizzara Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the changes in scalar/mod.rs, this would error:

External error: query failed: DataFusion error: Internal error: Unsupported creation of Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) array from ScalarValue Some(Map([{}])).
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] VALUES(MAP([], []))
at test_files/map.slt:307

dfs_names,
&mut entry_name_idx,
)?;
ScalarStructBuilder::new()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the most high-level way of creating the map I could think of, lmk if you have better ideas!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also/alternatively, should I add this into ScalarValue? they could sit next to ScalarValue::new_list etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the most high-level way of creating the map I could think of, lmk if you have better ideas!

It looks good to me. If you intend to be more efficient, I suggest referring to how make_map_batch_internal creates a MapArray.

fn make_map_batch_internal(

You need to partition the key and value pairs into two arrays, and build the MapArray based on them. Maybe you can refer to plan_make_map.

fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> {

However, I think it just some improvements. We don't need to do that in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmh, yeah I think I'll leave it as-is for now, as this code is relevant only for local relations (ie data encoded in the substrait message itself) I don't expect it to be very performance-sensitive, the data scale should hopefully always be small...

@Blizzara
Copy link
Contributor Author

@alamb thanks for merging the other PR, this is now rebased on top and also ready for review! :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me @Blizzara

I wonder if @goldmedal you might have some time to review this PR as well?

@@ -1773,6 +1773,7 @@ impl ScalarValue {
}
DataType::List(_)
| DataType::LargeList(_)
| DataType::Map(_, _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine I think because ScalarValue::Map is implemented as a 1 row array

#[test]
// Tests actual values of hashes, which are different if forcing collisions
#[cfg(not(feature = "force_hash_collisions"))]
fn create_hashes_for_map_arrays() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help me undertand / verify this test if you could use a MapBuilder or add a comment showing what MapArray was being built

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! 06b7d3c

@goldmedal
Copy link
Contributor

I wonder if @goldmedal you might have some time to review this PR as well?

Sure, I'll review this tonight.

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blizzara, Overall LGTM. I just leave some minor comments.

Comment on lines 786 to 787
assert_ne!(hashes[0], hashes[2]); // different key
assert_ne!(hashes[0], hashes[3]); // different value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_ne!(hashes[0], hashes[2]); // different key
assert_ne!(hashes[0], hashes[3]); // different value
assert_ne!(hashes[0], hashes[2]); // different value
assert_ne!(hashes[0], hashes[3]); // different key

I guess the comments are wrong.
The difference between Row 0 and Row 2 is the value of key2: {'key2': 11} in Row 0 and {'key2': 12} in Row 2.
However, the difference between Row 0 and Row 3 is the key: key2 in Row 0 and key3 in Row 3.
Did I say that correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great catch! I must have confused myself, or changed it after writing 😅 fixed in 62149fa, thanks!

dfs_names,
&mut entry_name_idx,
)?;
ScalarStructBuilder::new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the most high-level way of creating the map I could think of, lmk if you have better ideas!

It looks good to me. If you intend to be more efficient, I suggest referring to how make_map_batch_internal creates a MapArray.

fn make_map_batch_internal(

You need to partition the key and value pairs into two arrays, and build the MapArray based on them. Maybe you can refer to plan_make_map.

fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> {

However, I think it just some improvements. We don't need to do that in this PR.

@alamb
Copy link
Contributor

alamb commented Jul 23, 2024

Thanks @goldmedal and @Blizzara -- 👌 very nice

@alamb alamb merged commit f80dde0 into apache:main Jul 23, 2024
24 checks passed
@Blizzara Blizzara deleted the avo/substrait-map-literals branch July 24, 2024 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants