feat: support Map literals in Substrait consumer and producer #11547

Blizzara · 2024-07-19T12:11:53Z

Which issue does this PR close?

Related to Map epic #11434

Rationale for this change

Substrait didn't support Map literals, since we previously didn't have Map ScalarValues. #11224 implemented ScalarValues (thanks!), so now we can add them into Substrait as well.

There were also couple gaps left by the implementation that I encountered while testing this, so I fixed those as well.

Also, there was a bug in the from_substrait_literal for lists containing structs, which I realized while implementing the map support.

What changes are included in this PR?

Implement Substrait roundtrip support for Map literals, incl. null and empty maps.
Fix field name handling for Substrait lists containing structs
Add Map support to ScalarValue::iter_to_array and create_hashes

Are these changes tested?

Tested with new roundtrip tests for the Map literals, and unit tests for hashing.
Also an existing UT for list literals is extended to cover the multiple structs values case.

I'd have added a sql roundtrip test for Map as well, but it seems that the MAP command turns into a ScalarFunction rather than ScalarValue. Maybe we'd need to run some round of optimizer to fold it, but I wonder why that is different from the STRUCT command which does turn into a struct literal?

I.e. doing roundtrip("VALUES (MAP(['k1', 'k2'], [true, CAST(NULL AS BOOLEAN)]))").await?; results in Error: Substrait("Only literal types can be aliased in Virtual Tables, got: ScalarFunction"), while similar code for STRUCT works fine.

Are there any user-facing changes?

Blizzara · 2024-07-19T12:24:52Z

This and #11510 will conflict, once either one is merged I'll be happy to fix the other one.

Blizzara · 2024-07-19T21:30:35Z

datafusion/common/src/hash_utils.rs

+    let mut values_hashes = vec![0u64; array.entries().len()];
+    create_hashes(array.entries().columns(), random_state, &mut values_hashes)?;
+
+    // Combine the hashes for entries on each row with each other and previous hash for that row


I adapted this logic from combining List and Struct hashing, the result seemed to make sense to me, but I'm not 100% confident in it

Blizzara · 2024-07-19T21:37:01Z

datafusion/common/src/scalar/mod.rs

@@ -1773,6 +1773,7 @@ impl ScalarValue {
            }
            DataType::List(_)
            | DataType::LargeList(_)
+            | DataType::Map(_, _)


this works at least for the test case, given it re-uses arrow::compute::concat I'd hope it does the right thing overall

this is fine I think because ScalarValue::Map is implemented as a 1 row array

Blizzara · 2024-07-19T21:37:28Z

datafusion/sqllogictest/test_files/map.slt

+
+
+query ?
+VALUES (MAP(['a'], [1])), (MAP(['b'], [2])), (MAP(['c', 'a'], [3, 1]))


Without the changes in scalar/mod.rs, this would error:

External error: query failed: DataFusion error: Internal error: Unsupported creation of Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) array from ScalarValue Some(Map([{}])). This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker [SQL] VALUES(MAP([], [])) at test_files/map.slt:307

Blizzara · 2024-07-19T21:39:09Z

datafusion/substrait/src/logical_plan/consumer.rs

+                        dfs_names,
+                        &mut entry_name_idx,
+                    )?;
+                    ScalarStructBuilder::new()


this was the most high-level way of creating the map I could think of, lmk if you have better ideas!

also/alternatively, should I add this into ScalarValue? they could sit next to ScalarValue::new_list etc

this was the most high-level way of creating the map I could think of, lmk if you have better ideas!

It looks good to me. If you intend to be more efficient, I suggest referring to how make_map_batch_internal creates a MapArray.

datafusion/datafusion/functions-array/src/map.rs

Line 80 in 77311a5

fn make_map_batch_internal(

You need to partition the key and value pairs into two arrays, and build the MapArray based on them. Maybe you can refer to plan_make_map.

datafusion/datafusion/functions-array/src/planner.rs

Line 104 in 77311a5

fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> {

However, I think it just some improvements. We don't need to do that in this PR.

Hmh, yeah I think I'll leave it as-is for now, as this code is relevant only for local relations (ie data encoded in the substrait message itself) I don't expect it to be very performance-sensitive, the data scale should hopefully always be small...

Blizzara · 2024-07-19T21:41:30Z

@alamb thanks for merging the other PR, this is now rebased on top and also ready for review! :)

alamb

Makes sense to me @Blizzara

I wonder if @goldmedal you might have some time to review this PR as well?

alamb · 2024-07-22T18:38:11Z

datafusion/common/src/scalar/mod.rs

@@ -1773,6 +1773,7 @@ impl ScalarValue {
            }
            DataType::List(_)
            | DataType::LargeList(_)
+            | DataType::Map(_, _)


this is fine I think because ScalarValue::Map is implemented as a 1 row array

alamb · 2024-07-22T18:40:22Z

datafusion/common/src/hash_utils.rs

+    #[test]
+    // Tests actual values of hashes, which are different if forcing collisions
+    #[cfg(not(feature = "force_hash_collisions"))]
+    fn create_hashes_for_map_arrays() {


It would help me undertand / verify this test if you could use a MapBuilder or add a comment showing what MapArray was being built

done! 06b7d3c

goldmedal · 2024-07-23T00:59:27Z

I wonder if @goldmedal you might have some time to review this PR as well?

Sure, I'll review this tonight.

goldmedal

Thanks @Blizzara, Overall LGTM. I just leave some minor comments.

goldmedal · 2024-07-23T13:02:10Z

datafusion/common/src/hash_utils.rs

+        assert_ne!(hashes[0], hashes[2]); // different key
+        assert_ne!(hashes[0], hashes[3]); // different value


Suggested change

assert_ne!(hashes[0], hashes[2]); // different key

assert_ne!(hashes[0], hashes[3]); // different value

assert_ne!(hashes[0], hashes[2]); // different value

assert_ne!(hashes[0], hashes[3]); // different key

I guess the comments are wrong.
The difference between Row 0 and Row 2 is the value of key2: {'key2': 11} in Row 0 and {'key2': 12} in Row 2.
However, the difference between Row 0 and Row 3 is the key: key2 in Row 0 and key3 in Row 3.
Did I say that correctly?

Yes, great catch! I must have confused myself, or changed it after writing 😅 fixed in 62149fa, thanks!

goldmedal · 2024-07-23T13:41:27Z

datafusion/substrait/src/logical_plan/consumer.rs

+                        dfs_names,
+                        &mut entry_name_idx,
+                    )?;
+                    ScalarStructBuilder::new()


this was the most high-level way of creating the map I could think of, lmk if you have better ideas!

It looks good to me. If you intend to be more efficient, I suggest referring to how make_map_batch_internal creates a MapArray.

datafusion/datafusion/functions-array/src/map.rs

Line 80 in 77311a5

fn make_map_batch_internal(

You need to partition the key and value pairs into two arrays, and build the MapArray based on them. Maybe you can refer to plan_make_map.

datafusion/datafusion/functions-array/src/planner.rs

Line 104 in 77311a5

fn plan_make_map(&self, args: Vec<Expr>) -> Result<PlannerResult<Vec<Expr>>> {

However, I think it just some improvements. We don't need to do that in this PR.

alamb · 2024-07-23T16:36:39Z

Thanks @goldmedal and @Blizzara -- 👌 very nice

github-actions bot added sqllogictest SQL Logic Tests (.slt) substrait labels Jul 19, 2024

Blizzara force-pushed the avo/substrait-map-literals branch from 31d914d to b820936 Compare July 19, 2024 12:12

Blizzara marked this pull request as ready for review July 19, 2024 12:16

Blizzara changed the title ~~Avo/substrait map literals~~ feat: Support Map literals in Substrait consumer and producer Jul 19, 2024

Blizzara changed the title ~~feat: Support Map literals in Substrait consumer and producer~~ feat: support Map literals in Substrait consumer and producer Jul 19, 2024

Blizzara added 4 commits July 19, 2024 23:23

implement Map literals/nulls conversions in Substrait

0b7b0a5

fix name handling for lists/maps containing structs

f36c96b

add hashing for map scalars

e673c27

add a test for creating a map in VALUES

c7bad12

Blizzara force-pushed the avo/substrait-map-literals branch from b820936 to c7bad12 Compare July 19, 2024 21:24

fix clipppy

0957739

Blizzara commented Jul 19, 2024

View reviewed changes

better test

3df03ae

Blizzara commented Jul 19, 2024

View reviewed changes

alamb approved these changes Jul 22, 2024

View reviewed changes

use MapBuilder in test

06b7d3c

goldmedal approved these changes Jul 23, 2024

View reviewed changes

Blizzara added 2 commits July 23, 2024 17:49

fix hash test

62149fa

remove unnecessary type variation checks from maps

9b5b1b5

alamb merged commit f80dde0 into apache:main Jul 23, 2024
24 checks passed

Blizzara deleted the avo/substrait-map-literals branch July 24, 2024 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Map literals in Substrait consumer and producer #11547

feat: support Map literals in Substrait consumer and producer #11547

Blizzara commented Jul 19, 2024

Blizzara commented Jul 19, 2024

Blizzara Jul 19, 2024

Blizzara Jul 19, 2024

alamb Jul 22, 2024

Blizzara Jul 19, 2024 •

edited

Loading

Blizzara Jul 19, 2024

Blizzara Jul 19, 2024

goldmedal Jul 23, 2024

Blizzara Jul 23, 2024

Blizzara commented Jul 19, 2024

alamb left a comment

alamb Jul 22, 2024

alamb Jul 22, 2024

Blizzara Jul 23, 2024

goldmedal commented Jul 23, 2024

goldmedal left a comment

goldmedal Jul 23, 2024

Blizzara Jul 23, 2024

goldmedal Jul 23, 2024

alamb commented Jul 23, 2024



		query ?
		VALUES (MAP(['a'], [1])), (MAP(['b'], [2])), (MAP(['c', 'a'], [3, 1]))

		assert_ne!(hashes[0], hashes[2]); // different key
		assert_ne!(hashes[0], hashes[3]); // different value

feat: support Map literals in Substrait consumer and producer #11547

feat: support Map literals in Substrait consumer and producer #11547

Conversation

Blizzara commented Jul 19, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Blizzara commented Jul 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara commented Jul 19, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmedal commented Jul 23, 2024

goldmedal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 23, 2024

Blizzara Jul 19, 2024 •

edited

Loading