Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update get_data.sh #81

Merged
merged 2 commits into from
Jun 3, 2024
Merged

Update get_data.sh #81

merged 2 commits into from
Jun 3, 2024

Conversation

conceptofmind
Copy link
Contributor

Only the most recent dump should be used which is now 2024-05-06. You are making numerous duplicates of the same data otherwise. This is confirmed by Mike Lissner.

I will update a script to include the rest of the Spark code next.

Only the most recent dump should be used. You are making numerous duplicates of the same data.
@conceptofmind
Copy link
Contributor Author

The previous dumps also contain data that was purposefully removed in the newer dumps so the diffs should not be included.

@blester125
Copy link
Collaborator

LGTM, lets fix the lint error and add a comment about why only the newest data is needed so we don't forget and think that more dates is an easy way to get more data lol

@conceptofmind
Copy link
Contributor Author

conceptofmind commented Jun 3, 2024

Will do. Other parts of the script need to be changed too but this is the most glaring issue. Will need to fix the rest of the text columns.

@conceptofmind
Copy link
Contributor Author

Added comments:

# Only download the data from most recent CL dump
# The newest dump contains the previous dumps data 
# Differences from the previous data should not be included

And the lint should be ok for the other file now. Used black/isort.

@blester125 blester125 merged commit 1ef9a1c into r-three:main Jun 3, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants