RFC-0033: Graph Query #33

vgapeyev · 2022-10-25T16:35:54Z

[Draft]

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

almann

Reviewed the updated data model part.

almann · 2022-11-07T23:07:36Z

RFCs/0025-graph-data-model.md

-The following diagram illustrates the model:
+For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*, 
+where edges connect pairs of vertices and can be directed or undirected.
+Vertices and edges have *labels* (similar to the attribute names in struct). 


I think we should have some clarifying text around what this actually means. I think it is different from the struct example as there is a 1-to-1 correspondence there between attribute names and values.

A PartiQL struct does allow repeating attribute names (with different values at them), doesn't it? So, intuitively, these settings are about the same, in my view: neither a struct's attribute name nor a graph's label are a perfect "address". Of course, repeating an attribute in a struct seem to be a discouraged corner case, while repeating labels in a property graph is more like the norm.
Overall, I am not sure what can be said here without getting into the weeds...

Another wart here is that we (as well as GPML) allow multiple labels on a node/edge. I intentionally wrote line 68 to glide past this matter, since this is an intro/intuition text.

almann · 2022-11-07T23:08:13Z

RFCs/0025-graph-data-model.md

+For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*, 
+where edges connect pairs of vertices and can be directed or undirected.
+Vertices and edges have *labels* (similar to the attribute names in struct). 
+There is also a **value at** each vertex and edge, which can be any PartiQL value.


This is an interesting wording choice, I am interested to get your take on the distinction from the original text.

In my language intuition, "value of x" is something that follows from what x is, from what is already known as x's intrinsic properties: "the value of this real estate", "the value of expression 5+6". In contrast, "value at x" would indicate that x and the value are independent things associated with each other by fiat.
Of these two concepts, the latter better fits the situation in this graph data model. Re which word choice captures it better, I'd accept a better judgment -- make your verdict!

Could a midway option, just with typesetting, value at vertex --> value at vertex, irk one less?

RFCs/0025-graph-data-model.md

almann · 2022-11-07T23:16:07Z

RFCs/0025-graph-data-model.md

+- **Nodes** is a finite set of the *nodes* of the graph;
+- **Edges** is a finite set of the *edges* of the graph;


Do we have to have a property that identifies each member of these sets?

Sorry, I am not yet getting it. Do you hint at something this definition has but it is not necessary ("do we have to...")? Or you suggest something that is not here that should be added? My guess, something like a function node_id: Nodes --> Ids ? That would be extraneous, in my opinion so far.

RFCs/0025-graph-data-model.md

almann · 2022-11-07T23:21:11Z

RFCs/0025-graph-data-model.md

+The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known
+about them is given by the functions **ends**, **labels**, and **pay**.  Intuitively, one can think
+of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some
+implementation-specific memory locations.


Oh here is the notion of identity.

A possibly more succinct wording follows:

Suggested change

The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known

about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think

of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some

implementation-specific memory locations.

The members of the sets **Nodes** and **Edges** are some enumeration of unique identifiers such that the structure of the graph is defined in the functions **ends**, **labels**, and **pay**.

How about the following:

The members of the sets Nodes and Edges are abstract entities used for defining the structure of the graph with the functions ends, labels, and payload.

I am reluctant to commit to nodes and edges being identifiers, but ok with someone having this as an intuition.

RFCs/0025-graph-data-model.md

almann

Worked through the graph query doc.

RFCs/0033-graph-query.md

almann · 2022-11-07T23:47:56Z

RFCs/0033-graph-query.md

+<path_pattern> ::=
+    <restrictor>?  [ <path_variable> '=' ]? <path_part>+ 
+
+<path_variable> ::=  <<identifier>>
+
+<path_part> ::=
+      <node_pattern>
+    | <edge_pattern>
+    | <group_pattern>


Is this allowed (and should it be)?

g MATCH (v1)(v2)(v3)

Corollary is g MATCH -><- (probably unintended)

Upon thinking about this after our discussion, I am on a fence. On one hand, the GPML paper does not have straightforward examples with adjacent nodes or edges and it has only one example of a "bare edge" pattern (MATCH −[e:Transfer WHERE e.amount>5M]−> on Page 9). On the other hand, the evaluation process outlined in Section 6 involves syntactic manipulations to deal with these situations that could also be applied in general, as syntactic sugar. I think these two rules should cover it:

When there are immediately adjacent edges, an anonymous node is inserted in between. (Also, a node is added in the beginning or the end of a path pattern if there isn't one there.)
With this, g MATCH -> <- is understood as g MATCH () -> () <- ().

When there are immediately adjacent nodes, they are combined into one.
With this, g MATCH ()(v)() is understood as g MATCH (v).

Of course, the rule for the nodes would not, alone, deal with all the complexity that can come up inside edge patterns. Say, dealing with g MATCH (v1)(v2)(v3) would either have to be prohibited or make use of ability (internal only?) to associate multiple variables with one node. Either way, this would be a post-parsing task, not to be expressed by the grammar.

The definitive answer would come from the SQL/PGQ spec -- I really can't guess which way they are choosing.

For now, I am inclined to leave things as they are, since partiql-kotlin parser follows the above grammar and it is in the realm of the plausible, provided this note makes sense. I guess, something like this note can appear in the "Alternatives Considered" section?

RFCs/0033-graph-query.md

almann · 2022-11-08T05:15:31Z

RFCs/0033-graph-query.md

+  - If `x` is a path variable, it does not produce an attribute in the struct,
+    according to this RFC.


Why do path variables not bind to an attribute? Are they purely internal to the MATCH?

Well, pattern variables are "internal" to a MATCH in the same sense as others (node and edge variables) -- in the sense that, for the outside of MATCH all their "graphiness" is stripped away and only payloads play role.
But for that, for a path variable specifically, there is no clear choice for that "payload residual":

Overall, this probably should be some list drawn from the elements (nodes and edges) of the path.

But should it contain only the elements of the path named by node/edge variables or include anonymous elements as well?

Should it be an independent list, or be a super-structure that incorporates the above bindings for nodes/elements?

In the former case, do we want to represent the fact that some parts of the path's list are also appearing at other variables? How to do that reliably?

In the latter case, this would be a hierarchical structure (due to nested paths), making access to nodes and edges not as uniform as in the proposed (lines 417-421).

Or should the structure bound to the path contain only the anonymous elements, because the named ones are represented already?

While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary and likely a misfit with what would be eventually coming in SQL/PGQ.

This was my reason for not proposing anything in this regard in this RFC. But perhaps a reference from here to the "To Be Resolved" section would be helpful (plus the section itself).

There is now a section on this in Unresolved Questions, which might also be a tad more coherent then the above comment.

vgapeyev

Responded to all comments. Some are to be addressed in text changes (work in progress), but a few would benefit from further discussion/guidance.

vgapeyev · 2022-11-08T20:46:17Z

RFCs/0025-graph-data-model.md

+For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*, 
+where edges connect pairs of vertices and can be directed or undirected.
+Vertices and edges have *labels* (similar to the attribute names in struct). 
+There is also a **value at** each vertex and edge, which can be any PartiQL value.


In my language intuition, "value of x" is something that follows from what x is, from what is already known as x's intrinsic properties: "the value of this real estate", "the value of expression 5+6". In contrast, "value at x" would indicate that x and the value are independent things associated with each other by fiat.
Of these two concepts, the latter better fits the situation in this graph data model. Re which word choice captures it better, I'd accept a better judgment -- make your verdict!

Could a midway option, just with typesetting, value at vertex --> value at vertex, irk one less?

RFCs/0025-graph-data-model.md

vgapeyev · 2022-11-08T20:57:46Z

RFCs/0025-graph-data-model.md

+- **Nodes** is a finite set of the *nodes* of the graph;
+- **Edges** is a finite set of the *edges* of the graph;


Sorry, I am not yet getting it. Do you hint at something this definition has but it is not necessary ("do we have to...")? Or you suggest something that is not here that should be added? My guess, something like a function node_id: Nodes --> Ids ? That would be extraneous, in my opinion so far.

RFCs/0025-graph-data-model.md

vgapeyev · 2022-11-08T21:09:17Z

RFCs/0025-graph-data-model.md

+The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known
+about them is given by the functions **ends**, **labels**, and **pay**.  Intuitively, one can think
+of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some
+implementation-specific memory locations.


How about the following:

The members of the sets Nodes and Edges are abstract entities used for defining the structure of the graph with the functions ends, labels, and payload.

I am reluctant to commit to nodes and edges being identifiers, but ok with someone having this as an intuition.

RFCs/0033-graph-query.md

vgapeyev · 2022-11-08T23:16:14Z

RFCs/0033-graph-query.md

+<path_pattern> ::=
+    <restrictor>?  [ <path_variable> '=' ]? <path_part>+ 
+
+<path_variable> ::=  <<identifier>>
+
+<path_part> ::=
+      <node_pattern>
+    | <edge_pattern>
+    | <group_pattern>


Upon thinking about this after our discussion, I am on a fence. On one hand, the GPML paper does not have straightforward examples with adjacent nodes or edges and it has only one example of a "bare edge" pattern (MATCH −[e:Transfer WHERE e.amount>5M]−> on Page 9). On the other hand, the evaluation process outlined in Section 6 involves syntactic manipulations to deal with these situations that could also be applied in general, as syntactic sugar. I think these two rules should cover it:

When there are immediately adjacent edges, an anonymous node is inserted in between. (Also, a node is added in the beginning or the end of a path pattern if there isn't one there.)
With this, g MATCH -> <- is understood as g MATCH () -> () <- ().

When there are immediately adjacent nodes, they are combined into one.
With this, g MATCH ()(v)() is understood as g MATCH (v).

Of course, the rule for the nodes would not, alone, deal with all the complexity that can come up inside edge patterns. Say, dealing with g MATCH (v1)(v2)(v3) would either have to be prohibited or make use of ability (internal only?) to associate multiple variables with one node. Either way, this would be a post-parsing task, not to be expressed by the grammar.

The definitive answer would come from the SQL/PGQ spec -- I really can't guess which way they are choosing.

For now, I am inclined to leave things as they are, since partiql-kotlin parser follows the above grammar and it is in the realm of the plausible, provided this note makes sense. I guess, something like this note can appear in the "Alternatives Considered" section?

RFCs/0033-graph-query.md

vgapeyev · 2022-11-09T01:46:23Z

RFCs/0033-graph-query.md

+  - If `x` is a path variable, it does not produce an attribute in the struct,
+    according to this RFC.


Well, pattern variables are "internal" to a MATCH in the same sense as others (node and edge variables) -- in the sense that, for the outside of MATCH all their "graphiness" is stripped away and only payloads play role.
But for that, for a path variable specifically, there is no clear choice for that "payload residual":

Overall, this probably should be some list drawn from the elements (nodes and edges) of the path.

But should it contain only the elements of the path named by node/edge variables or include anonymous elements as well?

Should it be an independent list, or be a super-structure that incorporates the above bindings for nodes/elements?

In the former case, do we want to represent the fact that some parts of the path's list are also appearing at other variables? How to do that reliably?

In the latter case, this would be a hierarchical structure (due to nested paths), making access to nodes and edges not as uniform as in the proposed (lines 417-421).

Or should the structure bound to the path contain only the anonymous elements, because the named ones are represented already?

While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary and likely a misfit with what would be eventually coming in SQL/PGQ.

This was my reason for not proposing anything in this regard in this RFC. But perhaps a reference from here to the "To Be Resolved" section would be helpful (plus the section itself).

Co-authored-by: Almann Goo <almann.goo@gmail.com>

- Capitalized section titles. - Renamed a data model function from **pay** to **payload**. - Removed an extraneous subsection level under Unresolved Questions. Also, moved content of Drawbacks section from RFC-33 to RFC-25.

…ther than the intended ideal as well.

almann

Looks great--based on previous discussion, I am good with what we have here.

No changes w.r.t. the approved PR #33, so merging without the usual ceremony.

vgapeyev added 3 commits October 24, 2022 09:53

Minor editorial edits in graph data model RFC-0025.

d7a69cc

Adds more a formal exposition in the graph data model RFC-0025.

5e67b5b

Initial outline/draft of Graph Query RFC.

be6a9b2

vgapeyev changed the title ~~[draft] Rfc graph query~~ [draft] RFC-0033: Graph Query Oct 25, 2022

vgapeyev added 3 commits October 25, 2022 09:40

Added RFC number: 0033

faabceb

Grammar discussion and file rename to RFC number.

6ea5f3e

Section relating to PGML and SQL/PGQ.

0b55b20

almann added the RFC label Oct 27, 2022

vgapeyev added 3 commits November 3, 2022 18:59

Grammar and evaluation sections Query RFC mostly finished.

74a114d

Swap section order.

ce989cd

Tidy up.

aeb5254

vgapeyev changed the title ~~[draft] RFC-0033: Graph Query~~ RFC-0033: Graph Query Nov 4, 2022

A better citation.

106135b

almann reviewed Nov 7, 2022

View reviewed changes

almann requested changes Nov 8, 2022

View reviewed changes

vgapeyev commented Nov 9, 2022

View reviewed changes

vgapeyev and others added 7 commits November 8, 2022 17:51

Apply suggestions from code review

8716db6

Co-authored-by: Almann Goo <almann.goo@gmail.com>

Straightforward suggestions from the review.

dc0d0cb

- Capitalized section titles. - Renamed a data model function from **pay** to **payload**. - Removed an extraneous subsection level under Unresolved Questions. Also, moved content of Drawbacks section from RFC-33 to RFC-25.

A couple rewrites suggested in the review and more tweaks for clarity.

1f857aa

Defer support of graphical predicates to Unresolved Questions.

3ace887

Path bindings as another unresolved question.

46c15cf

Reduced the grammar section to describe the proposed grammar only, ra…

0699656

…ther than the intended ideal as well.

Added an example of pattern-match computation.

9d8fb00

vgapeyev force-pushed the rfc-graph-query branch from 64973ff to 9d8fb00 Compare November 11, 2022 01:38

almann mentioned this pull request Nov 11, 2022

RFC-0025: Graph Data Model #25

Closed

almann approved these changes Nov 11, 2022

View reviewed changes

vgapeyev merged commit c343f4a into rfc-graph-model Nov 11, 2022

vgapeyev deleted the rfc-graph-query branch November 11, 2022 19:58

vgapeyev restored the rfc-graph-query branch November 11, 2022 20:00

vgapeyev mentioned this pull request Nov 11, 2022

Graph model and query RFCs #34

Merged

vgapeyev added a commit that referenced this pull request Nov 11, 2022

Merge pull request #34 from partiql/rfc-graph-model

3c6956e

No changes w.r.t. the approved PR #33, so merging without the usual ceremony.

vgapeyev deleted the rfc-graph-query branch November 11, 2022 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-0033: Graph Query #33

RFC-0033: Graph Query #33

vgapeyev commented Oct 25, 2022

almann left a comment

almann Nov 7, 2022

vgapeyev Nov 8, 2022

almann Nov 7, 2022

vgapeyev Nov 8, 2022

almann Nov 7, 2022

vgapeyev Nov 8, 2022 •

edited

Loading

almann Nov 7, 2022

vgapeyev Nov 8, 2022 •

edited

Loading

almann left a comment

almann Nov 7, 2022

almann Nov 8, 2022

vgapeyev Nov 8, 2022

almann Nov 8, 2022

vgapeyev Nov 9, 2022

vgapeyev Nov 10, 2022

vgapeyev left a comment

vgapeyev Nov 8, 2022

vgapeyev Nov 8, 2022 •

edited

Loading

vgapeyev Nov 8, 2022 •

edited

Loading

vgapeyev Nov 8, 2022

vgapeyev Nov 9, 2022

almann left a comment

		- Nodes is a finite set of the nodes of the graph;
		- Edges is a finite set of the edges of the graph;

		- If `x` is a path variable, it does not produce an attribute in the struct,
		according to this RFC.

RFC-0033: Graph Query #33

RFC-0033: Graph Query #33

Conversation

vgapeyev commented Oct 25, 2022

almann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgapeyev Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgapeyev Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

almann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgapeyev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vgapeyev Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

vgapeyev Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

almann left a comment

Choose a reason for hiding this comment

vgapeyev Nov 8, 2022 •

edited

Loading

vgapeyev Nov 8, 2022 •

edited

Loading

vgapeyev Nov 8, 2022 •

edited

Loading

vgapeyev Nov 8, 2022 •

edited

Loading