Skip to content
This repository has been archived by the owner on May 1, 2024. It is now read-only.

Tasks for loading GA data into Snowflake (PART 1) #721

Merged
merged 1 commit into from
Apr 23, 2019

Conversation

pwnage101
Copy link
Contributor

@pwnage101 pwnage101 commented Apr 19, 2019

This is the first part of the work to add a pipeline to load GA360 data into Snowflake.

This part of the code change DOES NOT depend on a Luigi upgrade, and is ready for review/merging now.

Other PRs:

Analytics Pipeline Pull Request

Make sure that the following steps are done before merging:

  • If you have a migration please contact data engineering team before merging.
  • Before merging run full acceptance tests suite and provide URL for the acceptance tests run.
  • A member of data engineering team has approved the pull request.

@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch 2 times, most recently from b51a455 to 35301c3 Compare April 19, 2019 17:11
@pwnage101 pwnage101 mentioned this pull request Apr 19, 2019
3 tasks
Copy link
Contributor

@brianhw brianhw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not done a full review, but I expect the S3 changes to not work generally. Run acceptance tests, for example.

edx/analytics/tasks/common/snowflake_load.py Show resolved Hide resolved
edx/analytics/tasks/util/s3_util.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #721 into master will decrease coverage by <.01%.
The diff coverage is 65.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #721      +/-   ##
==========================================
- Coverage   75.18%   75.18%   -0.01%     
==========================================
  Files         203      203              
  Lines       22888    22911      +23     
==========================================
+ Hits        17209    17226      +17     
- Misses       5679     5685       +6
Impacted Files Coverage Δ
edx/analytics/tasks/common/bigquery_load.py 25.75% <0%> (ø) ⬆️
edx/analytics/tasks/util/tests/test_s3_util.py 100% <100%> (ø) ⬆️
...lytics/tasks/warehouse/load_warehouse_snowflake.py 47.71% <100%> (ø) ⬆️
edx/analytics/tasks/common/snowflake_load.py 31.39% <33.33%> (+1.58%) ⬆️
edx/analytics/tasks/util/s3_util.py 65.27% <91.66%> (+1.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6db66ef...9f64895. Read the comment docs.

@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch from 35301c3 to d8d2b04 Compare April 19, 2019 18:37
@pwnage101
Copy link
Contributor Author

@pwnage101
Copy link
Contributor Author

pwnage101 commented Apr 22, 2019

Acceptance tests passed.

One test failed initially, but it was apparently flaky because I re-ran that module and it passed: http://jenkins-ci.analytics.edx.org/view/ad-hoc/job/edx-analytics-pipeline-acceptance-test-manual/1729/console

edx/analytics/tasks/util/s3_util.py Outdated Show resolved Hide resolved
@@ -162,3 +159,29 @@ def open(self, mode='r'):
if not hasattr(self, 's3_client'):
self.s3_client = ScalableS3Client()
return AtomicS3File(safe_path, self.s3_client, policy=DEFAULT_KEY_ACCESS_POLICY)


def canonicalize_s3_url(url):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to provide some unit tests for this, since you're modifying test_s3_util.py?

edx/analytics/tasks/common/snowflake_load.py Show resolved Hide resolved
Copy link
Contributor

@brianhw brianhw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my remaining suggestions are optional. 👍

edx/analytics/tasks/util/s3_util.py Outdated Show resolved Hide resolved
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch 4 times, most recently from d692598 to d5719e3 Compare April 23, 2019 17:52
@pwnage101
Copy link
Contributor Author

it's actually good that I added the canonicalize_s3_url unit test because it caught what would have been a major flaw: it would have rejected "s3" as a URL scheme due to a code typo.

@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch from d5719e3 to d3c5ac1 Compare April 23, 2019 18:56
This is part 1 of the GA loading pipeline which DOES NOT depend on a
Luigi upgrade.

DE-1374 (PART 1)
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch from d3c5ac1 to 9f64895 Compare April 23, 2019 19:12
@pwnage101 pwnage101 merged commit 12686d0 into master Apr 23, 2019
@nedbat nedbat deleted the pwnage101/ga-snowflake-DE-1374-part1 branch December 13, 2023 18:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants