Skip to content
This repository has been archived by the owner on May 1, 2024. It is now read-only.

Tasks for loading GA data into Snowflake (PART 2) #722

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pwnage101
Copy link
Contributor

@pwnage101 pwnage101 commented Apr 19, 2019

This is the second part of the work to add a pipeline to load GA360 data into Snowflake.

This part of the code change DOES depend on a Luigi upgrade, and is NOT ready for merging until we have done the following things:

  • Upgrade luigi/boto and tested all the rest of the pipelines for regressions.
    • edx-specific changes in our current luigi fork are rebased on top of luigi>=2.7.6
  • Upgrade the BigQuery loading tasks to take advantage of new API methods in google-cloud-bigquery==1.11.2

Other PRs:

Analytics Pipeline Pull Request

Make sure that the following steps are done before merging:

  • If you have a migration please contact data engineering team before merging.
  • Before merging run full acceptance tests suite and provide URL for the acceptance tests run.
  • A member of data engineering team has approved the pull request.

@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part2 branch from 2064e1e to 4f49352 Compare April 19, 2019 17:06
@pwnage101 pwnage101 changed the base branch from master to pwnage101/ga-snowflake-DE-1374-part1 April 19, 2019 17:09
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch from b51a455 to 35301c3 Compare April 19, 2019 17:11
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part2 branch 2 times, most recently from 4e568e8 to ce63f5a Compare April 19, 2019 17:24
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch from 35301c3 to d8d2b04 Compare April 19, 2019 18:37
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part2 branch from ce63f5a to 3fae201 Compare April 19, 2019 18:39
Copy link
Contributor

@brianhw brianhw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments.

@@ -4,23 +4,23 @@
-r base.txt

argparse==1.2.1 # Python Software Foundation License
boto3==1.4.8 # Apache 2.0
boto3==1.9.131 # Apache 2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for choosing 1.4.8, as I recall, was that it was the smallest change needed to get support for regions in Europe, that open-source users needed.

requirements/default.in Show resolved Hide resolved
graphitesend==0.10.0 # Apache
html5lib==1.0b3 # MIT
isoweek==1.3.3 # BSD
numpy==1.11.3 # BSD
paypalrestsdk==1.9.0 # Paypal SDK License
psycopg2==2.6.2 # LGPL
psycopg2==2.8.1 # LGPL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what actually requires this, but I assume it's for vertica-python. If so, perhaps it doesn't need to be pinned in the *.in file, but only pinned in *.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it were for vertica-python, why would that make pinning this in default.in not required?

I only upgraded it because I literally could not run make upgrade to even generate the *.txt files on Ubuntu 18.04 due to a bug somewhere in the dependency resolution in pip-tools or in some random setup.py.

requirements/default.in Show resolved Hide resolved
edx/analytics/tasks/util/url.py Outdated Show resolved Hide resolved
"""
Overriding so that we can pass `self.fs` to the new marker.
"""
return GCSMarkerTarget(path=path, client=self.fs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what this is doing. We get a GCSMarkerTarget somehow, and then use it to clone other GCSMarkerTarget? Some kind of "clone" operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I introduced new_with_credentials() to MarkerMixin to allow markers to construct new markers (something we already do) but optionally also passing along any credentials which would otherwise get dropped in the case of GCSMarkerTarget. The result is typically a marker target which is the exact same in every way, except with /_SUCCESS appended to the path attribute which, in retrospect, should never exist as an object and I probably should not have built on top of that functionality.

The more correct thing to do would be to construct a new target object which is the non-marker analogy of the current class. What do you think is the best way to handle this?

@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part2 branch from 667625d to 00e65aa Compare April 22, 2019 14:11
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part1 branch 6 times, most recently from d3c5ac1 to 9f64895 Compare April 23, 2019 19:12
@pwnage101 pwnage101 changed the base branch from pwnage101/ga-snowflake-DE-1374-part1 to master April 24, 2019 18:46
This is part 2 of the GA loading pipeline which DOES depend on a Luigi
upgrade.

DE-1374 (PART 2)
@pwnage101 pwnage101 force-pushed the pwnage101/ga-snowflake-DE-1374-part2 branch from 00e65aa to 38e313a Compare April 24, 2019 18:50
@nedbat
Copy link
Contributor

nedbat commented Jan 9, 2024

This PR is now really old. You have a few more old PRs in flight: https://github.com/pulls?q=is%3Aopen+is%3Apr+archived%3Afalse+author%3Apwnage101+sort%3Aupdated-asc+org%3Aedx+org%3Aopenedx Do you want to keep them all open?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants