Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add invariants spec #443

Merged
merged 5 commits into from
Jun 2, 2021
Merged

add invariants spec #443

merged 5 commits into from
Jun 2, 2021

Conversation

houqp
Copy link
Member

@houqp houqp commented May 30, 2021

Rationale for this change

Align expectations for #55.

Closes #211

What changes are included in this PR?

Datafusion invariants spec adapted from https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit#heading=h.aofypm3h2hs, which is based on the original invariants spec written by @jorgecarleitao at https://docs.google.com/document/d/1Asnz29uUS1t60QNbNBU9SiME274rja-hcDvX_RDraFU/edit#heading=h.vnlz1r2unc76.

Here is the rendered markdown file:
https://github.com/houqp/arrow-datafusion/blob/qp_spec/docs/specification/invariants.md

Are there any user-facing changes?

No.

@codecov-commenter
Copy link

codecov-commenter commented May 30, 2021

Codecov Report

Merging #443 (5ba60e2) into master (321fda4) will increase coverage by 0.67%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #443      +/-   ##
==========================================
+ Coverage   75.16%   75.84%   +0.67%     
==========================================
  Files         150      153       +3     
  Lines       25144    25872     +728     
==========================================
+ Hits        18899    19622     +723     
- Misses       6245     6250       +5     
Impacted Files Coverage Δ
datafusion/src/physical_plan/common.rs 84.21% <0.00%> (-2.00%) ⬇️
ballista/rust/scheduler/src/planner.rs 66.91% <0.00%> (-0.74%) ⬇️
datafusion/tests/sql.rs 99.27% <0.00%> (-0.62%) ⬇️
datafusion-cli/src/main.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/main.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/execution_loop.rs 0.00% <0.00%> (ø)
ballista/rust/executor/src/flight_service.rs 0.00% <0.00%> (ø)
datafusion/src/physical_plan/expressions/mod.rs 71.42% <0.00%> (ø)
ballista/rust/core/src/serde/logical_plan/mod.rs 99.40% <0.00%> (ø)
ballista/rust/core/src/serde/physical_plan/mod.rs 100.00% <0.00%> (ø)
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 321fda4...5ba60e2. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @houqp -- I think this is looking great.

docs/specification/invariants.md Outdated Show resolved Hide resolved
that knows how to derive its metadata and compute itself.

Note how the physical plane does not know how to derive field names: field
names are solely a property of the logical plane, as they are not needed in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the assertion that field names aren't needed for physical plans. Field names are part of the Schema that is passed around in Physical plans and the names are used to line up inputs to PhysicalExpr

Copy link
Member Author

@houqp houqp Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao please correct me if I am wrong. I think the focus on this statement is on generation of field names. In the current code base, physical field names are generated using metadata from the logical plane, more specifically from logical Exprs. The actual physical plan and physical expressions are not used to generate the field names.

names are used to line up inputs to PhysicalExpr

This used to be true at the time when Jorge wrote it, but with #55, we will be changing it to line up inputs using field index instead. But regardless we make this change or not, it doesn't change the fact that physical filed names are only generated using metadata from the logical plane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my understanding also: naming is in the logical plane; in the physical plane, people can use whatever they want for as long as the record batches respect the output schema declared on the corresponding logical nodes.

People can derive a debug format for the physical expressions, but the naming must be set beforehand, or we risk having a divergence between the names declared in the output schema of the logical node and the record batches returned by the physical node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misread the text to say the physical plans had no field names

docs/specification/invariants.md Outdated Show resolved Hide resolved
docs/specification/invariants.md Outdated Show resolved Hide resolved
plan(logical_plan).schema === logical_plan.physical_schema
```

Logical plan's physical schema is defined as logical schema with relation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this -- wouldn't it be more like "logical schema with relation qualifiers squashed into field names"?

So input where (relation: my_table, field: my_column) --> (my_table.my_column) ?

Or perhaps this is saying internally (my_table.my_column) is the field name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a little bit counter intuitive. Basically physical field names are not supposed to include relation qualifiers. Here is a more detailed write up on why this is the case at https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit#heading=h.vkwn4d4zskrs. In short, relations need to be decoupled from record batch outputs so we can load the same datafusion output into different relations later.

In the beginning of this spec's Notation section, we defined a physical schema as:

Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch.

The goal here is to make sure the physical schema derived from the logical plan will match exactly with the physical schema of the physical plan and record batches. And since physical plan and record batch schema fields are not supposed to contain relation qualifiers, we would need to do the same thing for logical plan's physical schema as well.

If you think the context on why relation should be tracked externally outside of physical schemas is worth documenting in this doc, I can add that google doc section as appendix as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good enough as written.

docs/specification/invariants.md Outdated Show resolved Hide resolved
houqp and others added 4 commits June 1, 2021 19:55
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, @houqp !

@alamb alamb merged commit ce95d3e into apache:master Jun 2, 2021
@houqp houqp deleted the qp_spec branch June 2, 2021 19:24
@houqp houqp added datafusion Changes in the datafusion crate documentation Improvements or additions to documentation labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add datafusion design goals / invariants to repository
4 participants