ARROW-14658: [C++] Add basic support for nested field refs in scanning #11704

lidavidm · 2021-11-15T14:15:13Z

This implements the following:

Being able to project and filter on nested fields in the scanner/query engine.

Parquet, ORC, and Feather are supported/tested. For ORC and Feather, we will read the entire top-level column. (CSV does not support reading any nested types, though if it does in the future, it should behave the same as Feather/ORC.) For Parquet, we could materialize only the leaf nodes necessary for the projection, but without ARROW-1888 this will fail later on in the scanning pipeline, so we behave the same as Feather/ORC.

The following are not implemented:

Normally, the scanner can fill in a column of nulls if a requested column does not exist in a file. This is not supported for nested field refs because we need ARROW-1888 to be implemented.
A nested field ref cannot be used as a key/target of an aggregation or join. However, you can first project the nested fields into their own fields, then aggregate/join on them as usual.

This limitation is because the aggregate/join nodes currently compute a FieldPath to resolve a FieldRef, but then throw away the path, keeping only the first index. To implement this, we would need to store the FieldPath and use the struct_field kernel to resolve the actual array, however, this will have more overhead and we should be careful about regressions here, especially in the common case of no nested field refs.
Only FieldRefs consisting of field names are supported. For FieldRefs consisting of FieldPath (= a sequence of indices), the semantics are unclear. So far, the scanner is robust to individual files having fields in a different order than the overall dataset, but this won't work for FieldPath, so either we must require that the schema is consistent across files, or come up with some way to map file schemas onto the dataset schema so that indices have a consistent meaning.

github-actions · 2021-11-15T14:15:38Z

https://issues.apache.org/jira/browse/ARROW-14658

nealrichardson · 2021-11-15T14:29:47Z

A nested field ref cannot be used as a key/target of an aggregation or join.

If I project a = struct_col.some_nested_field in one step, can I then aggregate a?

lidavidm · 2021-11-15T14:37:02Z

A nested field ref cannot be used as a key/target of an aggregation or join.

If I project a = struct_col.some_nested_field in one step, can I then aggregate a?

Yes, you can (I just pushed a test to confirm that).

nealrichardson · 2021-11-15T14:38:05Z

A nested field ref cannot be used as a key/target of an aggregation or join.

If I project a = struct_col.some_nested_field in one step, can I then aggregate a?

Yes, you can (I just pushed a test to confirm that).

Excellent. Can you note that on the PR description then? (i.e. you can't directly use a nested field ref there but you can project and then use what you projected)

lidavidm · 2021-11-15T14:40:43Z

Done (also clarified the comment about CSV)

westonpace

Looks like a great addition. I have some minor nits on const auto * that you're welcome to ignore and a few questions. I think my biggest concern would be that the scanner should have a consistent output schema regardless of the format. But maybe I'm reading that test wrong.

westonpace · 2021-11-15T21:42:00Z

cpp/src/arrow/type.h

+  const std::vector<FieldRef>* nested_refs() const {
+    return util::holds_alternative<std::vector<FieldRef>>(impl_)
+               ? &util::get<std::vector<FieldRef>>(impl_)
+               : NULLPTR;
+  }


Why is the logic here different than the logic above in IsNested? I would expect this would be return IsNested() ? ...

Ah, this is because IsNested is checking whether it's either a FieldPath or a series of Names, but this accessor only wants the latter case. (I think the IsNested naming is a little unfortunate…)

westonpace · 2021-11-15T21:45:57Z

cpp/src/arrow/dataset/test_util.h

@@ -709,6 +808,35 @@ class FileFormatScanMixin : public FileFormatFixtureMixin<FormatHelper>,
      ASSERT_EQ(row_count, expected_rows());
    }
  }
+  void TestScanWithDuplicateColumn() {


Is there any particular reason to allow this?

Perhaps not - I just wanted to make sure I didn't break this inadvertently since it was working.

westonpace · 2021-11-15T21:52:40Z

cpp/src/arrow/dataset/test_util.h

+      ASSERT_EQ(row_count, expected_rows());
+    }
+    {
+      // File includes an extra child in struct2


It seems arbitrary that we can't handle this case but we're fine with a missing child. Though maybe I am reading the test incorrectly.

We can handle missing fields just fine because (once ARROW-1888 is implemented) we can synthesize a null child to stand in for it. But, we can't handle a duplicate name because it's ambiguous which child we're referring to.

Ah, the comment here is a little misleading. I'll edit it to reflect that it's a duplicate name.

westonpace · 2021-11-15T21:56:34Z

cpp/src/arrow/dataset/scanner_test.cc

+  auto batch_it = record_batches.begin();
+  for (int fragment_index = 0; fragment_index < 2; ++fragment_index) {
+    for (int batch_index = 0; batch_index < 2; ++batch_index) {
+      const auto& batch = *batch_it++;
+
+      // the scanned ExecBatches will begin with physical columns
+      batches.emplace_back(*batch);
+
+      // scanned batches will be augmented with fragment and batch indices
+      batches.back().values.emplace_back(fragment_index);
+      batches.back().values.emplace_back(batch_index);
+
+      // ... and with the last-in-fragment flag
+      batches.back().values.emplace_back(batch_index == 1);
+    }
+  }


This logic feels like it belongs in a helper method somewhere. Maybe a DatasetAndBatchesFromJSON

westonpace · 2021-11-15T22:03:28Z

cpp/src/arrow/dataset/scanner_test.cc

+      RecordBatchFromJSON(physical_schema, R"([{"a": 1,    "b": null,  "c": {"e": 0}},
+                                               {"a": 2,    "b": true,  "c": {"e": 1}}])"),
+      RecordBatchFromJSON(physical_schema, R"([{"a": null, "b": true,  "c": {"e": 2}},
+                                               {"a": 3,    "b": false, "c": {"e": null}}])"),
+      RecordBatchFromJSON(physical_schema, R"([{"a": null, "b": true,  "c": {"e": 4}},
+                                               {"a": 4,    "b": false, "c": {"e": 5}}])"),
+      RecordBatchFromJSON(physical_schema, R"([{"a": 5,    "b": null,  "c": {"e": 6}},
+                                               {"a": 6,    "b": false, "c": {"e": 7}},
+                                               {"a": 7,    "b": false, "c": {"e": null}}])"),


Nit: Add some top-level nulls? Or cases where c is null?

westonpace · 2021-11-15T22:26:08Z

cpp/src/arrow/dataset/file_parquet.cc

+    const std::unordered_map<std::string, const SchemaField*>& field_lookup,
+    const std::unordered_set<std::string>& duplicate_fields,
+    std::vector<int>* columns_selection) {
+  if (const auto* name = field_ref.name()) {


Suggested change

if (const auto* name = field_ref.name()) {

if (const std::string* name = field_ref.name()) {

Same optional nit as above.

westonpace · 2021-11-15T22:27:09Z

cpp/src/arrow/dataset/file_parquet.cc

+  if (const auto* refs = field_ref.nested_refs()) {
+    // Only supports a sequence of names
+    for (const auto& ref : *refs) {
+      if (const auto* name = ref.name()) {


Suggested change

if (const auto* name = ref.name()) {

if (const std::string* name = ref.name()) {

Of course, by this point, I think I'm unlikely to forget the rule 😆 Feel free to ignore these.

westonpace · 2021-11-15T22:27:40Z

cpp/src/arrow/dataset/file_parquet.cc

+  }
+
+  const SchemaField* field = nullptr;
+  if (const auto* refs = field_ref.nested_refs()) {


Suggested change

if (const auto* refs = field_ref.nested_refs()) {

if (const std::vector<FieldRef>* refs = field_ref.nested_refs()) {

westonpace · 2021-11-15T22:30:55Z

cpp/src/arrow/dataset/test_util.h

@@ -534,6 +543,8 @@ class FileFormatFixtureMixin : public ::testing::Test {
  std::shared_ptr<ScanOptions> opts_;
 };

+MATCHER(PointeesEquals, "") { return std::get<0>(arg)->Equals(*std::get<1>(arg)); }


This seems general enough to go in a test util file?

westonpace · 2021-11-15T22:36:28Z

cpp/src/arrow/dataset/test_util.h

+    if (fine_grained_selection) {
+      // Some formats, like Parquet, let you pluck only a part of a complex type
+      expected_schema = schema({
+          field("struct1", struct_({f32})),
+          field("struct2", struct_({i64, struct1})),
+      });
+    } else {
+      expected_schema = schema({struct1, struct2});
+    }


If I'm understanding this I'm not sure I like it. I would expect the resulting schema to be the same regardless of whether the underlying format supported partial projection or not. For formats that don't support partial projection I would expect it would be simulated by a full read and then a cast.

The overall schema will be the same once we pass through projection, i.e. the cast is done in the scanner instead of inside every file format. However, the tests here are reading from the fragment directly to check the physical schema, instead of the post-projection schema. I'll make sure both cases are covered in tests, though.

Ah…this is a little problematic since filtering/projection cast to the dataset schema first, and then we run into ARROW-1888 again as a result. I might go implement that first since this PR becomes a lot more useful with that.

(And, well, ARROW-1888 is a little easier with ARROW-7051…)

Yes, projection is now inside the exec plan. Also, projection doesn't occur until near the end of the exec plan (e.g. the filter step runs on the unprojected data). So it is important for the scan to do its own internal projection to the dataset schema.

If the blocker is ARROW-1888 I think it would be fine to implement this as-is with comments for a follow-up JIRA next to the test behavior we expect to change.

No tests will change (other than the one marked already), but as-is, you can't scan a Parquet dataset and project a nested field (since when we project from the specific schema to the dataset schema, we'll fail).

Alright, I've disabled fine-grained projection for now and marked it with ARROW-1888.

lidavidm · 2021-12-03T14:35:43Z

CC @westonpace if you have any final comments

kszucs · 2022-01-14T14:40:21Z

@lidavidm requires a rebase. It'd be a nice addition to 7.0, but if it won't make it then please postpone the jira to 8.0.

lidavidm · 2022-01-14T16:28:12Z

I'll try to rebase by EOD today, thanks for the ping.

lidavidm · 2022-01-14T18:45:07Z

Rebased, cc @westonpace if you have any final comments

westonpace

I think this is good. Some thoughts:

Projection foo out of struct: { "foo" int32 } yields a field named foo. I think my initial assumption would have been struct/foo but that begs the question "what happens if a field name includes the delimiter (/)?" so I think this is fine. Plus, users can always supply names when the project if they care.

I wonder if sometime down the road we might want to convert nested field refs to integer arrays at a relatively high level in the API (e.g. sanitizing user input). For example, FieldRef("struct", "foo") becomes [3, 0] (assuming "struct" is the fourth field in the schema and "foo" is the first field in the struct). This would allow users to potentially specify nested refs in the presence of duplicate fields (if the user is willing to specify the ref as an integer array). I think this is closer to how nested refs are going to be coming from Substrait as well. Then, if we only accept integer arrays at the lower levels, it simplifies the logic and avoids the risk of the quadratic-time mapping.

westonpace · 2022-01-14T21:13:54Z

cpp/src/arrow/dataset/file_csv_test.cc

+TEST_P(TestCsvFileFormatScan, ScanRecordBatchReaderWithDuplicateColumn) {
+  // The CSV reader rejects duplicate columns
+}


Nit: Can't we just omit this test? I don't suppose it causes any harm.

I meant this as 'documentation' so I've turned it into just a comment

lidavidm · 2022-01-14T21:56:44Z

I was thinking about the same thing in regards to indices. The current design, for better or worse, is "robust" to fields being reordered, but I don't think that was intentional and we may not want (or need) to have that property. If so I agree resolving fields to indices ASAP Is best.

ursabot · 2022-01-18T14:11:06Z

Benchmark runs are scheduled for baseline = 30ddc2f and contender = 5fb2243. 5fb2243 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Nov 15, 2021

lidavidm requested a review from westonpace November 15, 2021 19:04

westonpace reviewed Nov 15, 2021

View reviewed changes

lidavidm force-pushed the arrow-14658 branch from 7e5f8dd to ddc1eb4 Compare November 16, 2021 15:53

lidavidm force-pushed the arrow-14658 branch 2 times, most recently from 3bc7ac8 to 5ff1b98 Compare December 3, 2021 13:36

houqp mentioned this pull request Dec 9, 2021

Fix index out of bounds for stats on nested fields apache/datafusion#1392

Closed

lidavidm mentioned this pull request Jan 7, 2022

ARROW-13554: [C++] Remove deprecated Scanner::Scan #11991

Closed

ARROW-14658: [C++] Enable nested field refs in scanning

b12fb6e

lidavidm force-pushed the arrow-14658 branch from 5ff1b98 to b12fb6e Compare January 14, 2022 17:15

westonpace approved these changes Jan 14, 2022

View reviewed changes

ARROW-14658: [C++] Remove unnecessary test

677c6e0

lidavidm closed this in 5fb2243 Jan 18, 2022

lidavidm deleted the arrow-14658 branch January 18, 2022 14:11

asfimport mentioned this pull request Feb 11, 2022

[C++] Add basic support for nested field refs in scanning #30200

Closed

asfimport mentioned this pull request Sep 27, 2023

[C++][Dataset] Optimize Parquet column projection for subset of nested field #33167

Open

	if (const auto* name = field_ref.name()) {
	if (const std::string* name = field_ref.name()) {

	if (const auto* name = ref.name()) {
	if (const std::string* name = ref.name()) {

	if (const auto* refs = field_ref.nested_refs()) {
	if (const std::vector<FieldRef>* refs = field_ref.nested_refs()) {

ARROW-14658: [C++] Add basic support for nested field refs in scanning #11704

ARROW-14658: [C++] Add basic support for nested field refs in scanning #11704

Conversation

lidavidm commented Nov 15, 2021 • edited Loading

github-actions bot commented Nov 15, 2021

nealrichardson commented Nov 15, 2021 • edited Loading

lidavidm commented Nov 15, 2021

nealrichardson commented Nov 15, 2021

lidavidm commented Nov 15, 2021

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Dec 3, 2021

kszucs commented Jan 14, 2022

lidavidm commented Jan 14, 2022

lidavidm commented Jan 14, 2022

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Jan 14, 2022

ursabot commented Jan 18, 2022 • edited Loading

lidavidm commented Nov 15, 2021 •

edited

Loading

nealrichardson commented Nov 15, 2021 •

edited

Loading

westonpace Nov 16, 2021 •

edited

Loading

ursabot commented Jan 18, 2022 •

edited

Loading