Support streaming jobs in Marquez #2682

pawel-big-lebowski · 2023-11-15T07:18:28Z

Problem

Currently, job_version is created once the job finishes. This assumption no longer holds for streaming jobs which can run for days or weeks while continuously writing output datasets.

It makes sense in this case, to create job version and identify input/output datasets at the beginning of the job and modify it on the fly if a job changes.

In particular, a lineage endpoint shall include streaming jobs in lineage graph as they're running and datasets being written by streaming jobs should be available (and have dataset version modified) once the job is started.

Solution

add extra logic to create job_version entry and its references before the job completes,
make sure new version is written only once,
support JobTypeJobFacet to determine if a job is batch or streaming
store job characteristics in jobs table.

Note: All database schema changes require discussion. Please link the issue for context.

One-line summary:

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

codecov · 2023-11-15T07:27:55Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (b73fb15) 84.15% compared to head (a4e6fe7) 84.24%.

Files	Patch %	Lines
...pi/src/main/java/marquez/db/mappers/JobMapper.java	93.54%	1 Missing and 1 partial ⚠️
api/src/main/java/marquez/db/JobVersionDao.java	97.77%	0 Missing and 1 partial ⚠️
api/src/main/java/marquez/db/OpenLineageDao.java	95.45%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2682      +/-   ##
============================================
+ Coverage     84.15%   84.24%   +0.09%     
- Complexity     1390     1405      +15     
============================================
  Files           249      249              
  Lines          6322     6371      +49     
  Branches        286      291       +5     
============================================
+ Hits           5320     5367      +47     
- Misses          850      851       +1     
- Partials        152      153       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

api/src/main/java/marquez/db/JobVersionDao.java

api/src/main/java/marquez/db/OpenLineageDao.java

api/src/main/java/marquez/service/models/Job.java

api/src/test/resources/open_lineage/event_full.json

api/src/main/java/marquez/common/Utils.java

wslulciuc · 2023-12-12T21:24:09Z

api/src/main/java/marquez/db/OpenLineageDao.java

+   *
+   * <p>In this case, a job version is created based on the list of input and output datasets
+   * referenced by this job. If a job starts with inputs:{A,B} and outputs:{C}, new job version is
+   * created immediately at job start. If a following event produces inputs:{A}, outputs:{C}, then


If a job has {A,B} as initial inputs, the only {A} shouldn't that be a new job version? Given that all inputs/outputs are expected when a job run has been started, we should create a job version anytime the inputs or outputs change and associated the run with the new job version.

Existing algorithm to compute version of the job relies on all the inputs & datasets for the particular run. We should not modify it, as this would cause new job version for all the existing jobs, but we could create a separate version to evaluate version of the streaming job if we wanted.

However, when looking into the approach, I found it useful. It's a cumulative approach, where new job version is created if a new input/output dataset is involved in processing. If some dataset was included in the past events, but is no longer present, the version does not get change.

The benefit of this assumption is that we don't require producer to emit all the datasets all the time. If you emit amount of bytes written into output dataset, without containing input dataset in the event, it doesn't mean there is new job version without the inputs.

Is it OK?

I do feel if a stream is removed, you'll want to remove that edge from the current lineage graph. But, I also understand the limitations here, as you mentioned, with bytes written to an output dataset present with no input datasets specified; that said, and after thinking it through, this may be a noop and the logic you have is reasonable.

For example, let's say we have a streaming job X with inputs {A,B} and outputs C. The job runs and is marked as RUNNING with the run ID 74f117af-eb90-4ffd-98e1-c1bc215934df. To change the inputs from {A,B} to {B} (or similarly for the outputs), the user will have to redeploy the job (with new code possibly) and therefore be associated with a new run ID. So, what I think you have is logically correct given how streaming jobs are deployed. For batch jobs, versioning is more straight forward as we version from run-to-run.

api/src/main/java/marquez/db/OpenLineageDao.java

api/src/main/java/marquez/service/OpenLineageService.java

api/src/test/java/marquez/service/LineageServiceTest.java

api/src/test/java/marquez/service/OpenLineageServiceIntegrationTest.java

wslulciuc

We finally have support for streaming jobs! 💯 💯 💯

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

netlify · 2023-12-13T08:56:04Z

✅ Deploy Preview for peppy-sprite-186812 canceled.

Name	Link
🔨 Latest commit	`a4e6fe7`
🔍 Latest deploy log	https://app.netlify.com/sites/peppy-sprite-186812/deploys/65797fd55a323b00083b748d

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

boring-cyborg bot added the api API layer changes label Nov 15, 2023

pawel-big-lebowski force-pushed the static/job-version-mapping branch 2 times, most recently from d7166cc to 17810f2 Compare November 16, 2023 07:17

pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch from 8e55ff4 to b4590d0 Compare November 16, 2023 09:28

boring-cyborg bot added the docs label Nov 16, 2023

pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch 6 times, most recently from 0f4c077 to 6008834 Compare November 20, 2023 10:14

pawel-big-lebowski marked this pull request as ready for review November 20, 2023 10:30

wslulciuc reviewed Nov 28, 2023

View reviewed changes

pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch 3 times, most recently from 8efc020 to 2ff9048 Compare December 1, 2023 13:52

pawel-big-lebowski requested a review from wslulciuc December 4, 2023 07:35

pawel-big-lebowski force-pushed the static/job-version-mapping branch from 17810f2 to c00b988 Compare December 5, 2023 11:34

pawel-big-lebowski force-pushed the stream/support-streaming-jobs branch 3 times, most recently from e90d7ec to 9d2eea8 Compare December 5, 2023 14:57