Create column lineage endpoint proposal #2077

julienledem · 2022-08-19T01:42:51Z

Problem

This is the proposal document for #2045

codecov · 2022-08-19T01:47:32Z

Codecov Report

Merging #2077 (071fdc1) into main (54d0e58) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main    #2077   +/-   ##
=========================================
  Coverage     75.14%   75.14%           
  Complexity     1023     1023           
=========================================
  Files           202      202           
  Lines          4836     4836           
  Branches        392      392           
=========================================
  Hits           3634     3634           
  Misses          762      762           
  Partials        440      440

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Signed-off-by: Julien Le Dem <julien@apache.org>

proposals/2045-column-lineage-endpoint.md

wslulciuc · 2022-08-19T18:57:40Z

proposals/2045-column-lineage-endpoint.md

+### Use cases
+ - Find the current upstream dependencies of a column. A column in a dataset is derived from columns in upstream datasets.
+ - See column-level lineage in the dataset level lineage when available.
+ - Retrieve point-in-time upstream lineage for a dataset or a column. What did the lineage look like yesterday compared to today?


What did the lineage look like yesterday compared to today?

The compare use case needs a proposal on it's own 😉

fair enough, I'm hoping someone else can take over that part and go in the details

I'm thinking to start with just point-in-time upstream lineage. And have compare later

I think this proposal should be limited to point-in-time within column-level lineage. We should leave compare feature and also point-in-time for lineage endpoint which has nothing to do with column level.

wslulciuc · 2022-08-19T20:00:07Z

proposals/2045-column-lineage-endpoint.md

+    "fields": [{
+		...
+    }],
+>   columnLineage: {


Calls to GET /lineage returns an array of nodes. Meaning, we'll want to add columnLineage to DatasetData. We can use the generated classes in the openlineage-java lib. for column level lineage (vs maintaining our own). Which, I think we still need to generate @julienledem? I don't see them in the javadocs for OpenLineage.

I think yes, we would reuse the columnLineage facet object. The OL javadoc needs to be updated. It is not an automated process at the moment

wslulciuc · 2022-08-19T22:45:57Z

proposals/2045-column-lineage-endpoint.md

+### add a column-level-lineage endpoint:
+
+```
+GET /column-lineage?nodeId=dataset:food_delivery:public.delivery_7_days&column=a


Given that we have the following endpoint to query lineage:

GET /lineage?nodeId=<node-id>

I'm not sure there's much advantage to defining a separate endpoint for column-level lineage. Although a new endpoint would contextualize the API call; with proper documentation, we can extend out current lineage endpoint to support columns:

GET /lineage?nodeId=<node-id>,column=<column>

If the query param column is present, the backend will assume the nodeId to be a dataset node ID and return an upstream lineage graph with only dataset-to-dataset relationships. That said, column-level lineage is an upstream query from the origin <node-id> (as defined in this proposal). The /lineage call assumes both upstream and downstream lineage. To further contextualize the call, we should (and would prefer) an upstream specific lineage endpoint:

GET /lineage/upstream?nodeId=<node-id> GET /lineage/downstream?nodeId=<node-id> # Add for completeness

On the backend, these calls would be handled differently. When querying for upstream lineage, the graph returned would consists of only nodes upstream of <node-id>; similarly for upstream lineage, only nodes downstream.

LineageService.upstreamOf(NodeID) LineageService.downstreamOf(NodeID)

You can then recursively follow the in edges to traverse the upstream graph consisting of job-to-dataset relationships:

{ . . "inEdges": [{ "origin": "job:{namespace}:{job}", "destination": "dataset:{namespace}:{dataset}" }], "outEdges": [{ "origin": "job:{namespace}:{job}", "destination": "dataset:{namespace}:{dataset}" }] }

For column-level lineage, the in / out node edges in the upstream lineage graph would contain both job and dataset node IDs, though only dataset nodes would be present. This means, the in / out edges would still be a job-to-dataset relationships, but now you wouldn't be able to recursively follow the in edges as before given that the job nodes in the returned upstream graph aren't present; and though the dataset node contains column-level lineage metadata via columnLineage, the in/out edges of the node doesn't feel consistent.

By consistent, I mean that backend can assist in better representing the dataset-to-dataset relationship (or rather dataset-column-to-dataset-column relationship) on a given dataset for a particular column by defining the following node ID:

dataset:{namespace}:{dataset}#{field}

Note: As an alternative, we can use datasetField:{namespace}:{dataset}:{field}.

For example, with the node ID defined, an upstream lineage call would now be:

GET /lineage/upstream?nodeId=dataset:my-namespace:my-dataset#my-field

{ . . "inEdges": [{ "origin": "dataset:my-namespace:my-dataset#my-field", "destination": "dataset:my-namespace:my-other-dataset#my-other-field" }], "outEdges": [{ "origin": "dataset:my-namespace:my-other-dataset#my-other-field", "destination": "dataset:my-namespace:some-other-dataset#some-other-field" }] }

I'm proposing a different endpoint fot /column-lineage because the payload would be different, containing only datasets. I was considering that the columnLineage facet was already providing edges and that the inEdges and outEdges fields of the lineage graph became unnecessary.

To me /upstream or /downstream is not an endpoint as they are more of a filter on the lineage than a different result.

I was considering that the columnLineage facet was already providing edges and that the inEdges and outEdges fields of the lineage graph became unnecessary

I would then change the payload from a graph consisting of nodes (with in/out edges that aren't really relevant), to more an array of datasets objects that don't have in / out edges as much of the metadata that is relevant for lineage, wouldn't apply here.

My thinking is this: the lineage call returns a set of nodes, but doesn't specify if they all have to be datasets, or all have to be jobs. It's generic in that way. What matters are the nodeIDs and that the origin and destination point to a node in the return node set. Adding a new node type datasetField would fulfill the API contract. But, like you said, whether a query is upstream or downstream can be a backend implementation that can be based on the node type:

GET /lineage?nodeId=datasetField:{namespace}:{dataset}:{field}

Basically, I think column-level lineage should still be represented via a graph data structure. If we will only be using columnLineage to establish relationships between datasets, then it's less of a graph and more of a list of objects that are assumed to be connected.

Ohh man, it's great discussion although it took me 10 times reading to get to know what are you talking about.

I tried to include the initial idea of Julien mixed with the feedback of Willy.

Some clue design decisions:

existing lineage endpoint will be enriched with column lineage as-is (column lineage facet included within dataset)

new column-lineage will return column lineage graph with edges between dataset fields. It will reuse existing Graph data structure with new new dataset_field node type.

Jobs won't be included in the graph, as a single job may have tons of edges to the fields. Edges will connect dataset_field nodes directly.

Other:

graph depth can be controlled by url parameter,

downstream lineage will be turned off by default and can be turned on when requested,

depth of a returned graph can be controlled by URL parameter.

Thanks for the update @pawel-big-lebowski This looks good to me. I left a minor comment bellow

proposals/2045-column-lineage-endpoint.md

julienledem

This looks good to me, just a minor comment on the column-lineage payload

julienledem · 2022-08-30T21:26:52Z

proposals/2045-column-lineage-endpoint.md

+         "inEdges": [
+          {
+             "origin": "datasetField:db1:table1:a",
+             "destination": "datasetField:DBA:tableA:columnA"
+          },
+          {
+             "origin": "datasetField:db1:table1:a",
+             "destination": "datasetField:DBB:tableB:columnB"
+          },
+          {
+             "origin": "datasetField:db1:table1:a",
+             "destination": "datasetField:DBB:tableB:columnC"
+          }
+         ],


This is redundant with inputFields. namespace and name in inputfields are always matching origin in inEdges. Maybe we combine those?

I considered it may be useful to present inputFields on WEB UI. If so, it is beneficial to have this information redundant to avoid parsing edges to inputFields.

julienledem · 2022-08-30T21:32:34Z

I can't approve my own PR, but I approve @pawel-big-lebowski 's changes to it :)

proposals/2045-column-lineage-endpoint.md

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

julienledem · 2022-09-10T00:52:37Z

Are we ready to merge this? Did you want to add something?

Create column lineage endpoint proposal

bd9cea9

add more details

258f38c

Signed-off-by: Julien Le Dem <julien@apache.org>

julienledem mentioned this pull request Aug 19, 2022

Add column level lineage endpoint proposal #2045

Closed

wslulciuc reviewed Aug 19, 2022

View reviewed changes

pawel-big-lebowski reviewed Aug 25, 2022

View reviewed changes

proposals/2045-column-lineage-endpoint.md Outdated Show resolved Hide resolved

boring-cyborg bot added docs proposal labels Aug 30, 2022

pawel-big-lebowski force-pushed the column-lineage-proposal branch from d27e259 to 6278ed0 Compare August 30, 2022 08:40

mzareba382 mentioned this pull request Aug 30, 2022

Model and store column lineage in Marquez DB #2096

Merged

11 tasks

julienledem commented Aug 30, 2022

View reviewed changes

pawel-big-lebowski force-pushed the column-lineage-proposal branch from 6278ed0 to 57b00d0 Compare September 5, 2022 07:00

pawel-big-lebowski self-assigned this Sep 6, 2022

mobuchowski reviewed Sep 8, 2022

View reviewed changes

proposals/2045-column-lineage-endpoint.md Outdated Show resolved Hide resolved

pawel-big-lebowski force-pushed the column-lineage-proposal branch from 57b00d0 to 9d91962 Compare September 9, 2022 12:10

extend existing proposal with comment suggestion

c4fdaea

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski force-pushed the column-lineage-proposal branch from 9d91962 to c4fdaea Compare September 9, 2022 12:19

pawel-big-lebowski requested a review from mobuchowski September 9, 2022 12:25

mobuchowski approved these changes Sep 9, 2022

View reviewed changes

Merge branch 'main' into column-lineage-proposal

071fdc1

pawel-big-lebowski merged commit 00226b2 into main Sep 12, 2022

pawel-big-lebowski deleted the column-lineage-proposal branch September 12, 2022 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create column lineage endpoint proposal #2077

Create column lineage endpoint proposal #2077

julienledem commented Aug 19, 2022

codecov bot commented Aug 19, 2022 •

edited

Loading

wslulciuc Aug 19, 2022

julienledem Aug 19, 2022

julienledem Aug 19, 2022

pawel-big-lebowski Aug 30, 2022

wslulciuc Aug 19, 2022

julienledem Aug 19, 2022

wslulciuc Aug 19, 2022 •

edited

Loading

julienledem Aug 19, 2022

wslulciuc Aug 20, 2022 •

edited

Loading

pawel-big-lebowski Aug 30, 2022 •

edited

Loading

julienledem Aug 30, 2022

julienledem left a comment

julienledem Aug 30, 2022

pawel-big-lebowski Sep 8, 2022

julienledem commented Aug 30, 2022

julienledem commented Sep 10, 2022

Create column lineage endpoint proposal #2077

Create column lineage endpoint proposal #2077

Conversation

julienledem commented Aug 19, 2022

Problem

codecov bot commented Aug 19, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wslulciuc Aug 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wslulciuc Aug 20, 2022 • edited Loading

Choose a reason for hiding this comment

pawel-big-lebowski Aug 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Aug 30, 2022

julienledem commented Sep 10, 2022

codecov bot commented Aug 19, 2022 •

edited

Loading

wslulciuc Aug 19, 2022 •

edited

Loading

wslulciuc Aug 20, 2022 •

edited

Loading

pawel-big-lebowski Aug 30, 2022 •

edited

Loading