Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lineage metrics to FileSystems #32090

Merged
merged 3 commits into from
Aug 14, 2024
Merged

Add Lineage metrics to FileSystems #32090

merged 3 commits into from
Aug 14, 2024

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Aug 6, 2024

This PR depends on #32068

  • read covered by FileBasedSource.split
  • readAll covered by ReadAllViaFileBasedSourceTransform . It is based on FileBasedSink, but it does not split, instead creating a reader and read directly inside a DoFn
  • write covered by FileBasedSink

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn
Copy link
Contributor Author

Abacn commented Aug 6, 2024

Tested on TextIOIT:

Lineage.query(SOURCE) and (SINK) both get

    [gcs:temp-storage-for-perf-tests.temp/_1722976503531-00004-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00001-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00003-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00002-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00000-of-00005]
    [gcs:temp-storage-for-perf-tests.temp/_1722976503531-00004-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00001-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00003-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00002-of-00005, gcs:temp-storage-for-perf-tests.temp/_1722976503531-00000-of-00005]

@Abacn Abacn marked this pull request as ready for review August 6, 2024 20:41
Copy link
Contributor

github-actions bot commented Aug 6, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@Abacn
Copy link
Contributor Author

Abacn commented Aug 7, 2024

FQN requires segments having reserved characters (e.g. colon, dot, white space) to be wrapped with backtick. Since this is pretty common in file paths, added corresponding logic and refactor to handle them

@Abacn
Copy link
Contributor Author

Abacn commented Aug 8, 2024

R: @rohitsinha54 @robertwb

Abacn added a commit to Abacn/beam that referenced this pull request Aug 8, 2024
* Introduce metric.Lineage StringSet wrapper
  Reflect Java SDK apache#32090

* Direct Read

* Export Read

* ReadAllFromBigQuery

* FILE_LOAD Write
Copy link
Contributor

github-actions bot commented Aug 8, 2024

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@Abacn Abacn mentioned this pull request Aug 13, 2024
3 tasks
@@ -624,6 +625,11 @@ protected S3ResourceId matchNewResource(String singleResourceSpec, boolean isDir
return S3ResourceId.fromUri(singleResourceSpec);
}

@Override
protected void reportLineage(S3ResourceId resourceId, Lineage lineage) {
lineage.add("s3", ImmutableList.of(resourceId.getBucket(), resourceId.getKey()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the relative path not possible here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GCS there is a GcsPath that also handles relative path. GcsResourceId is a wrapper of it so in theory there is possibility encounter relative path, that's why I added a warning in GcsFileSystem.reportLineage.

Current codepath should never encounter relative gcs path in reportLineage as the resourceId parsed in are all matched result that was assembled from GcsPath.fromObject(storageObject), where storageObject comes from List API call response, which then resolved to full path.

For s3 FileSystem it's not possible. S3ResourceId stores the absolute path directly (there is no equivalent of GcsPath here). There is essentially single entrance to new an S3ResourceId object which is here:

static S3ResourceId fromComponents(String scheme, String bucket, String key) {

and it explicitly add a "/" in key.

damccorm added a commit that referenced this pull request Aug 14, 2024
* Add Lineage metrics to Python BigQueryIO

* Introduce metric.Lineage StringSet wrapper
  Reflect Java SDK #32090

* Direct Read

* Export Read

* ReadAllFromBigQuery

* FILE_LOAD Write

* fix lint; add tests

* Consistent metrics name

* Update sdks/python/apache_beam/metrics/metric.py

Co-authored-by: Danny McCormick <dannymccormick@google.com>

---------

Co-authored-by: Danny McCormick <dannymccormick@google.com>
@Abacn
Copy link
Contributor Author

Abacn commented Aug 14, 2024

PreCommit Java known flaky test under org.apache.beam.runners.fnexecution not related to this change

@Abacn Abacn merged commit 2c49243 into apache:master Aug 14, 2024
28 of 29 checks passed
@Abacn Abacn deleted the lineagegcs branch August 14, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants