-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk data export script bugfixes #4223
Conversation
…traint violations
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! At a glance, this looks like it's doing a couple things:
-
Fixing/tweaking the escape character. @quevon24 will know if this is a good idea.
-
Adding a bunch of tables. I'm happy to make sure we have all the financial disclosure tables, but we don't release the party and PACER filings tables in the bulk data. Would you mind tweaking the PR to remove those?
Certainly happy to, I didn't realize. To make sure I understand, do you mean all the |
Yes, exactly, thank you! |
@mlissner, updated pushed! I'm happy to make further changes if you spot any other problems. When you're satisfied and merge this in, would it be possible for you to run the export script to update the public S3 datafiles? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more, if you don't mind?
Sorry to go back and forth on this. It's hard reading the diff and the comments. I didn't catch at first that you removed a few tables. Any reason for that? Ideally, we'd just want to add missing tables, without removing the ones we already have, no? |
Apologies, when I asked if you meant all the disclosures_*` I thought you meant all in the script, not just any of the ones I had added. I just put them back in now. The delta now is +4 tables in total from original to this version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...maybe one more round...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this looks right. The last thing I want to check, sorry, is whether the change to quoting makes sense to the dev that last worked on this. I'll get him to chime in, and if he approves, let's merge.
@quevon24 can you review the quoting change in this and see what you think? |
Sounds good, thanks @mlissner. @quevon24, for context of why I'm requesting it change - there is a column in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hopperj good idea to change the quote character, the idea of forcing the quote character was to be able to differentiate the value that goes in each column, since that caused errors when trying to import it. A character that is difficult to appear will help reduce possible errors.
Thank you for noticing that some tables were missing in the schema 👍
Everything looks good @mlissner
Great. Set for auto-merge! Thank you @hopperj for the enhancements and getting it through review! |
Thanks to you both for providing this data, I'm happy to help! Will it be possible to run a data export to S3 in the near term? |
Just a heads up, there's another hurdle, I fear, that may be coming once we export this: #4241 |
Adding missing tables, setting import order to avoid foreign key constraint violations, and changing quote character for CSV data to '`' so it doesn't cause parsing errors with quotes in the XML data column.