-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add invariants spec #443
add invariants spec #443
Conversation
Codecov Report
@@ Coverage Diff @@
## master #443 +/- ##
==========================================
+ Coverage 75.16% 75.84% +0.67%
==========================================
Files 150 153 +3
Lines 25144 25872 +728
==========================================
+ Hits 18899 19622 +723
- Misses 6245 6250 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @houqp -- I think this is looking great.
that knows how to derive its metadata and compute itself. | ||
|
||
Note how the physical plane does not know how to derive field names: field | ||
names are solely a property of the logical plane, as they are not needed in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the assertion that field names aren't needed for physical plans. Field names are part of the Schema
that is passed around in Physical plans and the names are used to line up inputs to PhysicalExpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao please correct me if I am wrong. I think the focus on this statement is on generation of field names. In the current code base, physical field names are generated using metadata from the logical plane, more specifically from logical Expr
s. The actual physical plan and physical expressions are not used to generate the field names.
names are used to line up inputs to PhysicalExpr
This used to be true at the time when Jorge wrote it, but with #55, we will be changing it to line up inputs using field index instead. But regardless we make this change or not, it doesn't change the fact that physical filed names are only generated using metadata from the logical plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my understanding also: naming is in the logical plane; in the physical plane, people can use whatever they want for as long as the record batches respect the output schema declared on the corresponding logical nodes.
People can derive a debug format for the physical expressions, but the naming must be set beforehand, or we risk having a divergence between the names declared in the output schema of the logical node and the record batches returned by the physical node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misread the text to say the physical plans had no field names
plan(logical_plan).schema === logical_plan.physical_schema | ||
``` | ||
|
||
Logical plan's physical schema is defined as logical schema with relation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this -- wouldn't it be more like "logical schema with relation qualifiers squashed into field names"?
So input where (relation: my_table, field: my_column)
--> (my_table.my_column)
?
Or perhaps this is saying internally (my_table.my_column)
is the field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a little bit counter intuitive. Basically physical field names are not supposed to include relation qualifiers. Here is a more detailed write up on why this is the case at https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit#heading=h.vkwn4d4zskrs. In short, relations need to be decoupled from record batch outputs so we can load the same datafusion output into different relations later.
In the beginning of this spec's Notation section, we defined a physical schema as:
Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch.
The goal here is to make sure the physical schema derived from the logical plan will match exactly with the physical schema of the physical plan and record batches. And since physical plan and record batch schema fields are not supposed to contain relation qualifiers, we would need to do the same thing for logical plan's physical schema as well.
If you think the context on why relation should be tracked externally outside of physical schemas is worth documenting in this doc, I can add that google doc section as appendix as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good enough as written.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot, @houqp !
Rationale for this change
Align expectations for #55.
Closes #211
What changes are included in this PR?
Datafusion invariants spec adapted from https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit#heading=h.aofypm3h2hs, which is based on the original invariants spec written by @jorgecarleitao at https://docs.google.com/document/d/1Asnz29uUS1t60QNbNBU9SiME274rja-hcDvX_RDraFU/edit#heading=h.vnlz1r2unc76.
Here is the rendered markdown file:
https://github.com/houqp/arrow-datafusion/blob/qp_spec/docs/specification/invariants.md
Are there any user-facing changes?
No.