Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support inputFacets and outputFacets #2417

Merged
merged 1 commit into from
Apr 11, 2023

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Feb 11, 2023

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

Marquez does not support inputFacets and outputFacets sent in Ol event. They're not saved nor exposed in api.

Closes: #2320

Solution

We do have dataset_facets table designed for this but we don't store inputFacets and outputFacets there.
Although inputFacets and outputFacets contained within Openlineage event as dataset properties, we want to expose them in api as part of a run bcz they describe specific runs rather than datasets or dataset's versions.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added api API layer changes client/java labels Feb 11, 2023
@codecov
Copy link

codecov bot commented Feb 11, 2023

Codecov Report

Merging #2417 (3264a59) into main (c038f73) will increase coverage by 0.24%.
The diff coverage is 96.55%.

❗ Current head 3264a59 differs from pull request most recent head feba65f. Consider uploading reports for the commit feba65f to get more accurate results

@@             Coverage Diff              @@
##               main    #2417      +/-   ##
============================================
+ Coverage     83.60%   83.85%   +0.24%     
- Complexity     1213     1234      +21     
============================================
  Files           231      233       +2     
  Lines          5520     5629     +109     
  Branches        266      269       +3     
============================================
+ Hits           4615     4720     +105     
- Misses          762      766       +4     
  Partials        143      143              
Impacted Files Coverage Δ
api/src/main/java/marquez/db/Columns.java 79.74% <ø> (ø)
api/src/main/java/marquez/db/DatasetDao.java 98.64% <ø> (ø)
...pi/src/main/java/marquez/db/DatasetVersionDao.java 95.83% <ø> (ø)
api/src/main/java/marquez/db/RunDao.java 92.40% <ø> (ø)
...pi/src/main/java/marquez/db/mappers/RunMapper.java 90.21% <91.83%> (+0.21%) ⬆️
...n/java/marquez/common/models/RunDatasetFacets.java 100.00% <100.00%> (ø)
api/src/main/java/marquez/db/DatasetFacetsDao.java 100.00% <100.00%> (ø)
api/src/main/java/marquez/db/OpenLineageDao.java 96.41% <100.00%> (+0.11%) ⬆️
...main/java/marquez/service/models/LineageEvent.java 95.23% <100.00%> (+0.71%) ⬆️
api/src/main/java/marquez/service/models/Run.java 92.30% <100.00%> (ø)
... and 2 more

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski force-pushed the inputFacets-write-and-expose branch 2 times, most recently from 857f23e to ede31df Compare February 11, 2023 15:30
@boring-cyborg boring-cyborg bot added the docs label Feb 11, 2023
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review February 13, 2023 07:17
@pawel-big-lebowski pawel-big-lebowski self-assigned this Feb 13, 2023
@Nullable final Map<String, Object> facets) {
@Nullable final Map<String, Object> facets,
@Nullable final List<RunDatasetFacets> inputFacets,
@Nullable final List<RunDatasetFacets> outputFacets) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class Run defines a list of inputVersions and inputVersions as DatasetVersionId entries. Given that a class DatasetVersionId is needed for RunDatasetFacets (along with the in/out facets) can we define a class DatasetVersionIdAndFacets as:

class DatasetVersionIdAndFacets {
  final DatasetVersionId datasetVersionId;
  final ImmutableMap<String, Object> facets;
}

Then, update Run to use DatasetVersionIdAndFacets for inputVersions and outputVersions?

public class Run {
  .
  final List<DatasetVersionIdAndFacets> inputVersions;
  List<DatasetVersionIdAndFacets> outputVersions;
  .
  .

It just seems RunDatasetFacets is really a DatasetVersionId with facets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind also updating the spec/openapi.yml with the additional properties?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wslulciuc I agree that merging these two entries is fine, but I would like to avoid And in class name. Could we stay within RunDatasetFacets which contain both dataset version id's and facets? I've modified the code to return Run in that form.

Additionally, I've included changes in spec.
Screenshot 2023-03-03 at 09 29 36

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wslulciuc can we proceed with PR?

Copy link
Member

@wslulciuc wslulciuc Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Sorry, I can't hide it. I've been heads down focusing on other things and this PR totally slipped off my radar. Ok, so final thoughts here: we'll want to avoid being overly specific and use facets in RunDatasetFacets. We'll mostly likely add additional properties other than just datasetVersionId and facets, so let's just go with either:

DatasetVersion

Or, we can be more specific (as in/out dataset versions will most likely contain different properties at some point):

InputDatasetVersion
OutDatasetVersion

Thoughts?

@pawel-big-lebowski pawel-big-lebowski force-pushed the inputFacets-write-and-expose branch 2 times, most recently from ce7d31e to 3264a59 Compare March 3, 2023 08:30
@boring-cyborg boring-cyborg bot added the spec label Mar 3, 2023
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawel-big-lebowski, approving as to not block the PR based on our offline discussion (i.e. you're cool with my suggested naming). Great work, and merge when ready 👍

@pawel-big-lebowski pawel-big-lebowski force-pushed the inputFacets-write-and-expose branch 3 times, most recently from e5d48fe to 7b52941 Compare April 11, 2023 09:33
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski merged commit 2a41675 into main Apr 11, 2023
@pawel-big-lebowski pawel-big-lebowski deleted the inputFacets-write-and-expose branch April 11, 2023 09:57
Xavier-Cliquennois pushed a commit to Xavier-Cliquennois/marquez that referenced this pull request Jul 26, 2023
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Signed-off-by: Xavier-Cliquennois <xavier.cliquennois@wearegraphite.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset metrics facets aren't being returned by GET Datasets API
2 participants