-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0033: Graph Query #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the updated data model part.
The following diagram illustrates the model: | ||
For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*, | ||
where edges connect pairs of vertices and can be directed or undirected. | ||
Vertices and edges have *labels* (similar to the attribute names in struct). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have some clarifying text around what this actually means. I think it is different from the struct
example as there is a 1-to-1 correspondence there between attribute names and values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PartiQL struct does allow repeating attribute names (with different values at them), doesn't it? So, intuitively, these settings are about the same, in my view: neither a struct's attribute name nor a graph's label are a perfect "address". Of course, repeating an attribute in a struct seem to be a discouraged corner case, while repeating labels in a property graph is more like the norm.
Overall, I am not sure what can be said here without getting into the weeds...
Another wart here is that we (as well as GPML) allow multiple labels on a node/edge. I intentionally wrote line 68 to glide past this matter, since this is an intro/intuition text.
For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*, | ||
where edges connect pairs of vertices and can be directed or undirected. | ||
Vertices and edges have *labels* (similar to the attribute names in struct). | ||
There is also a **value at** each vertex and edge, which can be any PartiQL value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting wording choice, I am interested to get your take on the distinction from the original text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my language intuition, "value of x" is something that follows from what x is, from what is already known as x's intrinsic properties: "the value of this real estate", "the value of expression 5+6". In contrast, "value at x" would indicate that x and the value are independent things associated with each other by fiat.
Of these two concepts, the latter better fits the situation in this graph data model. Re which word choice captures it better, I'd accept a better judgment -- make your verdict!
Could a midway option, just with typesetting, value at vertex --> value at vertex, irk one less?
- **Nodes** is a finite set of the *nodes* of the graph; | ||
- **Edges** is a finite set of the *edges* of the graph; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to have a property that identifies each member of these sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not yet getting it. Do you hint at something this definition has but it is not necessary ("do we have to...")? Or you suggest something that is not here that should be added? My guess, something like a function node_id: Nodes --> Ids ? That would be extraneous, in my opinion so far.
RFCs/0025-graph-data-model.md
Outdated
The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known | ||
about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think | ||
of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some | ||
implementation-specific memory locations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh here is the notion of identity.
A possibly more succinct wording follows:
The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known | |
about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think | |
of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some | |
implementation-specific memory locations. | |
The members of the sets **Nodes** and **Edges** are some enumeration of unique identifiers such that the structure of the graph is defined in the functions **ends**, **labels**, and **pay**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following:
The members of the sets Nodes and Edges are abstract entities used for defining the structure of the graph with the functions ends, labels, and payload.
I am reluctant to commit to nodes and edges being identifiers, but ok with someone having this as an intuition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked through the graph query doc.
<path_pattern> ::= | ||
<restrictor>? [ <path_variable> '=' ]? <path_part>+ | ||
|
||
<path_variable> ::= <<identifier>> | ||
|
||
<path_part> ::= | ||
<node_pattern> | ||
| <edge_pattern> | ||
| <group_pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this allowed (and should it be)?
g MATCH (v1)(v2)(v3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corollary is g MATCH -><-
(probably unintended)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon thinking about this after our discussion, I am on a fence. On one hand, the GPML paper does not have straightforward examples with adjacent nodes or edges and it has only one example of a "bare edge" pattern (MATCH −[e:Transfer WHERE e.amount>5M]−>
on Page 9). On the other hand, the evaluation process outlined in Section 6 involves syntactic manipulations to deal with these situations that could also be applied in general, as syntactic sugar. I think these two rules should cover it:
-
When there are immediately adjacent edges, an anonymous node is inserted in between. (Also, a node is added in the beginning or the end of a path pattern if there isn't one there.)
With this,g MATCH -> <-
is understood asg MATCH () -> () <- ()
. -
When there are immediately adjacent nodes, they are combined into one.
With this,g MATCH ()(v)()
is understood asg MATCH (v)
.
Of course, the rule for the nodes would not, alone, deal with all the complexity that can come up inside edge patterns. Say, dealing with g MATCH (v1)(v2)(v3)
would either have to be prohibited or make use of ability (internal only?) to associate multiple variables with one node. Either way, this would be a post-parsing task, not to be expressed by the grammar.
The definitive answer would come from the SQL/PGQ spec -- I really can't guess which way they are choosing.
For now, I am inclined to leave things as they are, since partiql-kotlin parser follows the above grammar and it is in the realm of the plausible, provided this note makes sense. I guess, something like this note can appear in the "Alternatives Considered" section?
- If `x` is a path variable, it does not produce an attribute in the struct, | ||
according to this RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do path variables not bind to an attribute? Are they purely internal to the MATCH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, pattern variables are "internal" to a MATCH in the same sense as others (node and edge variables) -- in the sense that, for the outside of MATCH all their "graphiness" is stripped away and only payloads play role.
But for that, for a path variable specifically, there is no clear choice for that "payload residual":
- Overall, this probably should be some list drawn from the elements (nodes and edges) of the path.
- But should it contain only the elements of the path named by node/edge variables or include anonymous elements as well?
- Should it be an independent list, or be a super-structure that incorporates the above bindings for nodes/elements?
- In the former case, do we want to represent the fact that some parts of the path's list are also appearing at other variables? How to do that reliably?
- In the latter case, this would be a hierarchical structure (due to nested paths), making access to nodes and edges not as uniform as in the proposed (lines 417-421).
- Or should the structure bound to the path contain only the anonymous elements, because the named ones are represented already?
While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary and likely a misfit with what would be eventually coming in SQL/PGQ.
This was my reason for not proposing anything in this regard in this RFC. But perhaps a reference from here to the "To Be Resolved" section would be helpful (plus the section itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is now a section on this in Unresolved Questions, which might also be a tad more coherent then the above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to all comments. Some are to be addressed in text changes (work in progress), but a few would benefit from further discussion/guidance.
For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*, | ||
where edges connect pairs of vertices and can be directed or undirected. | ||
Vertices and edges have *labels* (similar to the attribute names in struct). | ||
There is also a **value at** each vertex and edge, which can be any PartiQL value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my language intuition, "value of x" is something that follows from what x is, from what is already known as x's intrinsic properties: "the value of this real estate", "the value of expression 5+6". In contrast, "value at x" would indicate that x and the value are independent things associated with each other by fiat.
Of these two concepts, the latter better fits the situation in this graph data model. Re which word choice captures it better, I'd accept a better judgment -- make your verdict!
Could a midway option, just with typesetting, value at vertex --> value at vertex, irk one less?
- **Nodes** is a finite set of the *nodes* of the graph; | ||
- **Edges** is a finite set of the *edges* of the graph; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am not yet getting it. Do you hint at something this definition has but it is not necessary ("do we have to...")? Or you suggest something that is not here that should be added? My guess, something like a function node_id: Nodes --> Ids ? That would be extraneous, in my opinion so far.
RFCs/0025-graph-data-model.md
Outdated
The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known | ||
about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think | ||
of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some | ||
implementation-specific memory locations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following:
The members of the sets Nodes and Edges are abstract entities used for defining the structure of the graph with the functions ends, labels, and payload.
I am reluctant to commit to nodes and edges being identifiers, but ok with someone having this as an intuition.
<path_pattern> ::= | ||
<restrictor>? [ <path_variable> '=' ]? <path_part>+ | ||
|
||
<path_variable> ::= <<identifier>> | ||
|
||
<path_part> ::= | ||
<node_pattern> | ||
| <edge_pattern> | ||
| <group_pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon thinking about this after our discussion, I am on a fence. On one hand, the GPML paper does not have straightforward examples with adjacent nodes or edges and it has only one example of a "bare edge" pattern (MATCH −[e:Transfer WHERE e.amount>5M]−>
on Page 9). On the other hand, the evaluation process outlined in Section 6 involves syntactic manipulations to deal with these situations that could also be applied in general, as syntactic sugar. I think these two rules should cover it:
-
When there are immediately adjacent edges, an anonymous node is inserted in between. (Also, a node is added in the beginning or the end of a path pattern if there isn't one there.)
With this,g MATCH -> <-
is understood asg MATCH () -> () <- ()
. -
When there are immediately adjacent nodes, they are combined into one.
With this,g MATCH ()(v)()
is understood asg MATCH (v)
.
Of course, the rule for the nodes would not, alone, deal with all the complexity that can come up inside edge patterns. Say, dealing with g MATCH (v1)(v2)(v3)
would either have to be prohibited or make use of ability (internal only?) to associate multiple variables with one node. Either way, this would be a post-parsing task, not to be expressed by the grammar.
The definitive answer would come from the SQL/PGQ spec -- I really can't guess which way they are choosing.
For now, I am inclined to leave things as they are, since partiql-kotlin parser follows the above grammar and it is in the realm of the plausible, provided this note makes sense. I guess, something like this note can appear in the "Alternatives Considered" section?
- If `x` is a path variable, it does not produce an attribute in the struct, | ||
according to this RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, pattern variables are "internal" to a MATCH in the same sense as others (node and edge variables) -- in the sense that, for the outside of MATCH all their "graphiness" is stripped away and only payloads play role.
But for that, for a path variable specifically, there is no clear choice for that "payload residual":
- Overall, this probably should be some list drawn from the elements (nodes and edges) of the path.
- But should it contain only the elements of the path named by node/edge variables or include anonymous elements as well?
- Should it be an independent list, or be a super-structure that incorporates the above bindings for nodes/elements?
- In the former case, do we want to represent the fact that some parts of the path's list are also appearing at other variables? How to do that reliably?
- In the latter case, this would be a hierarchical structure (due to nested paths), making access to nodes and edges not as uniform as in the proposed (lines 417-421).
- Or should the structure bound to the path contain only the anonymous elements, because the named ones are represented already?
While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary and likely a misfit with what would be eventually coming in SQL/PGQ.
This was my reason for not proposing anything in this regard in this RFC. But perhaps a reference from here to the "To Be Resolved" section would be helpful (plus the section itself).
Co-authored-by: Almann Goo <almann.goo@gmail.com>
- Capitalized section titles. - Renamed a data model function from **pay** to **payload**. - Removed an extraneous subsection level under Unresolved Questions. Also, moved content of Drawbacks section from RFC-33 to RFC-25.
…ther than the intended ideal as well.
64973ff
to
9d8fb00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great--based on previous discussion, I am good with what we have here.
No changes w.r.t. the approved PR #33, so merging without the usual ceremony.
[Draft]
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.