Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0033: Graph Query #33

Merged
merged 17 commits into from
Nov 11, 2022
Merged

RFC-0033: Graph Query #33

merged 17 commits into from
Nov 11, 2022

Conversation

vgapeyev
Copy link
Contributor

[Draft]

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@vgapeyev vgapeyev changed the title [draft] Rfc graph query [draft] RFC-0033: Graph Query Oct 25, 2022
@almann almann added the RFC label Oct 27, 2022
@vgapeyev vgapeyev changed the title [draft] RFC-0033: Graph Query RFC-0033: Graph Query Nov 4, 2022
Copy link
Contributor

@almann almann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the updated data model part.

The following diagram illustrates the model:
For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*,
where edges connect pairs of vertices and can be directed or undirected.
Vertices and edges have *labels* (similar to the attribute names in struct).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some clarifying text around what this actually means. I think it is different from the struct example as there is a 1-to-1 correspondence there between attribute names and values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PartiQL struct does allow repeating attribute names (with different values at them), doesn't it? So, intuitively, these settings are about the same, in my view: neither a struct's attribute name nor a graph's label are a perfect "address". Of course, repeating an attribute in a struct seem to be a discouraged corner case, while repeating labels in a property graph is more like the norm.
Overall, I am not sure what can be said here without getting into the weeds...

Another wart here is that we (as well as GPML) allow multiple labels on a node/edge. I intentionally wrote line 68 to glide past this matter, since this is an intro/intuition text.

For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*,
where edges connect pairs of vertices and can be directed or undirected.
Vertices and edges have *labels* (similar to the attribute names in struct).
There is also a **value at** each vertex and edge, which can be any PartiQL value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting wording choice, I am interested to get your take on the distinction from the original text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my language intuition, "value of x" is something that follows from what x is, from what is already known as x's intrinsic properties: "the value of this real estate", "the value of expression 5+6". In contrast, "value at x" would indicate that x and the value are independent things associated with each other by fiat.
Of these two concepts, the latter better fits the situation in this graph data model. Re which word choice captures it better, I'd accept a better judgment -- make your verdict!

Could a midway option, just with typesetting, value at vertex --> value at vertex, irk one less?

RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
Comment on lines +125 to +126
- **Nodes** is a finite set of the *nodes* of the graph;
- **Edges** is a finite set of the *edges* of the graph;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to have a property that identifies each member of these sets?

Copy link
Contributor Author

@vgapeyev vgapeyev Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not yet getting it. Do you hint at something this definition has but it is not necessary ("do we have to...")? Or you suggest something that is not here that should be added? My guess, something like a function node_id: Nodes --> Ids ? That would be extraneous, in my opinion so far.

RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
Comment on lines 136 to 139
The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known
about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think
of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some
implementation-specific memory locations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh here is the notion of identity.

A possibly more succinct wording follows:

Suggested change
The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known
about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think
of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some
implementation-specific memory locations.
The members of the sets **Nodes** and **Edges** are some enumeration of unique identifiers such that the structure of the graph is defined in the functions **ends**, **labels**, and **pay**.

Copy link
Contributor Author

@vgapeyev vgapeyev Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following:

The members of the sets Nodes and Edges are abstract entities used for defining the structure of the graph with the functions ends, labels, and payload.

I am reluctant to commit to nodes and edges being identifiers, but ok with someone having this as an intuition.

RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
Copy link
Contributor

@almann almann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked through the graph query doc.

RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
Comment on lines +186 to +194
<path_pattern> ::=
<restrictor>? [ <path_variable> '=' ]? <path_part>+

<path_variable> ::= <<identifier>>

<path_part> ::=
<node_pattern>
| <edge_pattern>
| <group_pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this allowed (and should it be)?

g MATCH (v1)(v2)(v3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corollary is g MATCH -><- (probably unintended)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon thinking about this after our discussion, I am on a fence. On one hand, the GPML paper does not have straightforward examples with adjacent nodes or edges and it has only one example of a "bare edge" pattern (MATCH −[e:Transfer WHERE e.amount>5M]−> on Page 9). On the other hand, the evaluation process outlined in Section 6 involves syntactic manipulations to deal with these situations that could also be applied in general, as syntactic sugar. I think these two rules should cover it:

  • When there are immediately adjacent edges, an anonymous node is inserted in between. (Also, a node is added in the beginning or the end of a path pattern if there isn't one there.)
    With this, g MATCH -> <- is understood as g MATCH () -> () <- ().

  • When there are immediately adjacent nodes, they are combined into one.
    With this, g MATCH ()(v)() is understood as g MATCH (v).

Of course, the rule for the nodes would not, alone, deal with all the complexity that can come up inside edge patterns. Say, dealing with g MATCH (v1)(v2)(v3) would either have to be prohibited or make use of ability (internal only?) to associate multiple variables with one node. Either way, this would be a post-parsing task, not to be expressed by the grammar.

The definitive answer would come from the SQL/PGQ spec -- I really can't guess which way they are choosing.

For now, I am inclined to leave things as they are, since partiql-kotlin parser follows the above grammar and it is in the realm of the plausible, provided this note makes sense. I guess, something like this note can appear in the "Alternatives Considered" section?

RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
Comment on lines +423 to +424
- If `x` is a path variable, it does not produce an attribute in the struct,
according to this RFC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do path variables not bind to an attribute? Are they purely internal to the MATCH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, pattern variables are "internal" to a MATCH in the same sense as others (node and edge variables) -- in the sense that, for the outside of MATCH all their "graphiness" is stripped away and only payloads play role.
But for that, for a path variable specifically, there is no clear choice for that "payload residual":

  • Overall, this probably should be some list drawn from the elements (nodes and edges) of the path.
  • But should it contain only the elements of the path named by node/edge variables or include anonymous elements as well?
  • Should it be an independent list, or be a super-structure that incorporates the above bindings for nodes/elements?
    • In the former case, do we want to represent the fact that some parts of the path's list are also appearing at other variables? How to do that reliably?
    • In the latter case, this would be a hierarchical structure (due to nested paths), making access to nodes and edges not as uniform as in the proposed (lines 417-421).
  • Or should the structure bound to the path contain only the anonymous elements, because the named ones are represented already?

While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary and likely a misfit with what would be eventually coming in SQL/PGQ.

This was my reason for not proposing anything in this regard in this RFC. But perhaps a reference from here to the "To Be Resolved" section would be helpful (plus the section itself).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now a section on this in Unresolved Questions, which might also be a tad more coherent then the above comment.

Copy link
Contributor Author

@vgapeyev vgapeyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to all comments. Some are to be addressed in text changes (work in progress), but a few would benefit from further discussion/guidance.

For the graph data type, we model something very similar. A graph is a collection of *vertices* and *edges*,
where edges connect pairs of vertices and can be directed or undirected.
Vertices and edges have *labels* (similar to the attribute names in struct).
There is also a **value at** each vertex and edge, which can be any PartiQL value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my language intuition, "value of x" is something that follows from what x is, from what is already known as x's intrinsic properties: "the value of this real estate", "the value of expression 5+6". In contrast, "value at x" would indicate that x and the value are independent things associated with each other by fiat.
Of these two concepts, the latter better fits the situation in this graph data model. Re which word choice captures it better, I'd accept a better judgment -- make your verdict!

Could a midway option, just with typesetting, value at vertex --> value at vertex, irk one less?

RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
Comment on lines +125 to +126
- **Nodes** is a finite set of the *nodes* of the graph;
- **Edges** is a finite set of the *edges* of the graph;
Copy link
Contributor Author

@vgapeyev vgapeyev Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not yet getting it. Do you hint at something this definition has but it is not necessary ("do we have to...")? Or you suggest something that is not here that should be added? My guess, something like a function node_id: Nodes --> Ids ? That would be extraneous, in my opinion so far.

RFCs/0025-graph-data-model.md Outdated Show resolved Hide resolved
Comment on lines 136 to 139
The inhabitants of the sets **Nodes** and **Edges** are understood abstractly; all that is known
about them is given by the functions **ends**, **labels**, and **pay**. Intuitively, one can think
of graph nodes and edges as uninterpreted identifiers, perhaps corresponding to some
implementation-specific memory locations.
Copy link
Contributor Author

@vgapeyev vgapeyev Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following:

The members of the sets Nodes and Edges are abstract entities used for defining the structure of the graph with the functions ends, labels, and payload.

I am reluctant to commit to nodes and edges being identifiers, but ok with someone having this as an intuition.

RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
Comment on lines +186 to +194
<path_pattern> ::=
<restrictor>? [ <path_variable> '=' ]? <path_part>+

<path_variable> ::= <<identifier>>

<path_part> ::=
<node_pattern>
| <edge_pattern>
| <group_pattern>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon thinking about this after our discussion, I am on a fence. On one hand, the GPML paper does not have straightforward examples with adjacent nodes or edges and it has only one example of a "bare edge" pattern (MATCH −[e:Transfer WHERE e.amount>5M]−> on Page 9). On the other hand, the evaluation process outlined in Section 6 involves syntactic manipulations to deal with these situations that could also be applied in general, as syntactic sugar. I think these two rules should cover it:

  • When there are immediately adjacent edges, an anonymous node is inserted in between. (Also, a node is added in the beginning or the end of a path pattern if there isn't one there.)
    With this, g MATCH -> <- is understood as g MATCH () -> () <- ().

  • When there are immediately adjacent nodes, they are combined into one.
    With this, g MATCH ()(v)() is understood as g MATCH (v).

Of course, the rule for the nodes would not, alone, deal with all the complexity that can come up inside edge patterns. Say, dealing with g MATCH (v1)(v2)(v3) would either have to be prohibited or make use of ability (internal only?) to associate multiple variables with one node. Either way, this would be a post-parsing task, not to be expressed by the grammar.

The definitive answer would come from the SQL/PGQ spec -- I really can't guess which way they are choosing.

For now, I am inclined to leave things as they are, since partiql-kotlin parser follows the above grammar and it is in the realm of the plausible, provided this note makes sense. I guess, something like this note can appear in the "Alternatives Considered" section?

RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
RFCs/0033-graph-query.md Outdated Show resolved Hide resolved
Comment on lines +423 to +424
- If `x` is a path variable, it does not produce an attribute in the struct,
according to this RFC.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, pattern variables are "internal" to a MATCH in the same sense as others (node and edge variables) -- in the sense that, for the outside of MATCH all their "graphiness" is stripped away and only payloads play role.
But for that, for a path variable specifically, there is no clear choice for that "payload residual":

  • Overall, this probably should be some list drawn from the elements (nodes and edges) of the path.
  • But should it contain only the elements of the path named by node/edge variables or include anonymous elements as well?
  • Should it be an independent list, or be a super-structure that incorporates the above bindings for nodes/elements?
    • In the former case, do we want to represent the fact that some parts of the path's list are also appearing at other variables? How to do that reliably?
    • In the latter case, this would be a hierarchical structure (due to nested paths), making access to nodes and edges not as uniform as in the proposed (lines 417-421).
  • Or should the structure bound to the path contain only the anonymous elements, because the named ones are represented already?

While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary and likely a misfit with what would be eventually coming in SQL/PGQ.

This was my reason for not proposing anything in this regard in this RFC. But perhaps a reference from here to the "To Be Resolved" section would be helpful (plus the section itself).

vgapeyev and others added 7 commits November 8, 2022 17:51
Co-authored-by: Almann Goo <almann.goo@gmail.com>
- Capitalized section titles.
- Renamed a data model function from **pay** to **payload**.
- Removed an extraneous subsection level under Unresolved Questions.

Also, moved content of Drawbacks section from RFC-33 to RFC-25.
Copy link
Contributor

@almann almann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great--based on previous discussion, I am good with what we have here.

@vgapeyev vgapeyev merged commit c343f4a into rfc-graph-model Nov 11, 2022
@vgapeyev vgapeyev deleted the rfc-graph-query branch November 11, 2022 19:58
@vgapeyev vgapeyev restored the rfc-graph-query branch November 11, 2022 20:00
vgapeyev added a commit that referenced this pull request Nov 11, 2022
No changes w.r.t. the approved PR #33, so merging without the usual ceremony.
@vgapeyev vgapeyev deleted the rfc-graph-query branch November 11, 2022 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants