-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizing current runs query for lieage api #2211
optimizing current runs query for lieage api #2211
Conversation
Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md). |
6ca59e1
to
bc2349d
Compare
fea1168
to
b39fcc5
Compare
@@ -268,7 +268,7 @@ public void testLineageWithDeletedDataset() { | |||
.hasSize(0); | |||
runAssert | |||
.extracting(Run::getOutputVersions, InstanceOfAssertFactories.list(DatasetVersionId.class)) | |||
.hasSize(1); | |||
.hasSize(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a little misleading since we're not asking the database for this information anymore. Should we just remove this assertion or change the type definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and had initially removed this completely but then added it back again, because getInputVersions also had an assertion. I guess i should remove both of them now.
runAssert | ||
.extracting(Run::getOutputVersions, InstanceOfAssertFactories.list(DatasetVersionId.class)) | ||
.hasSize(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are asserting on the Lineage
object returned from lineageService.lineage
call, and it is the response payload of GET lineage API. Does this mean that API response is being changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't like all the information being returned here, but breaking API compatibility is not good. If we want a lighter-weight version of the lineage API, I think it's better to either include an optional parameter to exclude the superfluous data or to create a new API and deprecate the old one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR and added a withRunFacets
flag to get runs with all the superfluous data and this flag always true in Marquez apis, so no affect on the API compatibility.
601278b
to
52ebaa3
Compare
@@ -46,7 +46,7 @@ public LineageService(LineageDao delegate, JobDao jobDao) { | |||
this.jobDao = jobDao; | |||
} | |||
|
|||
public Lineage lineage(NodeId nodeId, int depth) { | |||
public Lineage lineage(NodeId nodeId, int depth, boolean withRunFacets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change, but I worry that we'll want to add more options to this method (e.g., include job facets? dataset facets? exclude runs altogether?). I don't think we should take this on now, but let's add a TODO to make the input parameters here more easily extendable so that we can add those other options later one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't agree more. I had small aversion to adding a flag to make this work, but there was no other better option. I also thought in future if more changes like this come along that alter api significantly, we could add these as options to api query parameters, or create more broken apis to get specific data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODO
@SqlQuery( | ||
"SELECT DISTINCT on(r.job_name, r.namespace_name) r.*, jv.version as job_version\n" | ||
+ " FROM runs_view r\n" | ||
+ " INNER JOIN job_versions jv ON jv.uuid=r.job_version_uuid\n" | ||
+ " INNER JOIN jobs_view j ON j.uuid=jv.job_uuid\n" | ||
+ " WHERE j.uuid in (<jobUuid>) OR j.symlink_target_uuid IN (<jobUuid>)\n" | ||
+ " ORDER BY r.job_name, r.namespace_name, created_at DESC\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the more readable syntax variant as we update our queries.
"""
SELECT DISTINCT on(r.job_name, r.namespace_name) r.*, jv.version as job_version
FROM runs_view
INNER JOIN job_versions jv ON jv.uuid=r.job_version_uuid
INNER JOIN jobs_view j ON j.uuid=jv.job_uuid
WHERE j.uuid in (<jobUuid>) OR j.symlink_target_uuid IN (<jobUuid>)
ORDER BY r.job_name, r.namespace_name, created_at DESC
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the syntax as you asked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much!
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
Signed-off-by: Prachi Mishra <prachi.mishra@astronomer.io>
4662e51
to
a330beb
Compare
Great job! Congrats on your first merged pull request in the Marquez project! |
Problem
Introduce a simpler alternate getCurrentRuns query that gets only simple runs from DB without the additional data from tables such as run_args, job_context, facets, input/output versions etc which required the extra table joins in the old getCurrentRuns query. This new getCurrentRuns DAO is NOT being used in Marquez as of now.
Closes: #4425
Solution
getCurrentRuns
DAO is renamed togetCurrentRunsWithFacets
without any change to the sql query .getCurrentRuns
and is NOT called from /lineage api as of now so NO change is required to the /lineage api response spec.withRunFacets
is also introduced as parameter to lineage api, which is always set totrue
to callgetCurrentRunsWithFacets
, so that /lineage api and hence all the tests still call the old DAO .Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)