Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk data export script bugfixes #4223

Merged
merged 10 commits into from
Jul 25, 2024
Merged

Conversation

hopperj
Copy link
Contributor

@hopperj hopperj commented Jul 19, 2024

Adding missing tables, setting import order to avoid foreign key constraint violations, and changing quote character for CSV data to '`' so it doesn't cause parsing errors with quotes in the XML data column.

@CLAassistant
Copy link

CLAassistant commented Jul 19, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! At a glance, this looks like it's doing a couple things:

  1. Fixing/tweaking the escape character. @quevon24 will know if this is a good idea.

  2. Adding a bunch of tables. I'm happy to make sure we have all the financial disclosure tables, but we don't release the party and PACER filings tables in the bulk data. Would you mind tweaking the PR to remove those?

@hopperj
Copy link
Contributor Author

hopperj commented Jul 19, 2024

Thanks! At a glance, this looks like it's doing a couple things:

  1. Fixing/tweaking the escape character. @quevon24 will know if this is a good idea.
  2. Adding a bunch of tables. I'm happy to make sure we have all the financial disclosure tables, but we don't release the party and PACER filings tables in the bulk data. Would you mind tweaking the PR to remove those?

Certainly happy to, I didn't realize. To make sure I understand, do you mean all the disclosures_* and people_db_party tables?

@mlissner
Copy link
Member

Yes, exactly, thank you!

@hopperj
Copy link
Contributor Author

hopperj commented Jul 22, 2024

@mlissner, updated pushed! I'm happy to make further changes if you spot any other problems. When you're satisfied and merge this in, would it be possible for you to run the export script to update the public S3 datafiles?

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more, if you don't mind?

scripts/make_bulk_data.sh Outdated Show resolved Hide resolved
scripts/make_bulk_data.sh Outdated Show resolved Hide resolved
scripts/make_bulk_data.sh Outdated Show resolved Hide resolved
@mlissner
Copy link
Member

Sorry to go back and forth on this. It's hard reading the diff and the comments. I didn't catch at first that you removed a few tables. Any reason for that?

Ideally, we'd just want to add missing tables, without removing the ones we already have, no?

@hopperj
Copy link
Contributor Author

hopperj commented Jul 23, 2024

Apologies, when I asked if you meant all the disclosures_*` I thought you meant all in the script, not just any of the ones I had added. I just put them back in now. The delta now is +4 tables in total from original to this version.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...maybe one more round...?

scripts/make_bulk_data.sh Outdated Show resolved Hide resolved
scripts/make_bulk_data.sh Outdated Show resolved Hide resolved
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks right. The last thing I want to check, sorry, is whether the change to quoting makes sense to the dev that last worked on this. I'll get him to chime in, and if he approves, let's merge.

@mlissner
Copy link
Member

@quevon24 can you review the quoting change in this and see what you think?

@mlissner mlissner requested a review from quevon24 July 24, 2024 15:56
@hopperj
Copy link
Contributor Author

hopperj commented Jul 24, 2024

Sounds good, thanks @mlissner. @quevon24, for context of why I'm requesting it change - there is a column in opinions that contains full XML which uses double quote for attribute values and can be hundreds of lines long. This makes parsing/ingesting back into postgres extremely difficult/impossible. When ingesting this file postgres threw syntax errors related to this. I chose the '`' character simply because I expect it won't appear in the data values, although there are probably other choices that would work fine too.

Copy link
Member

@quevon24 quevon24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hopperj good idea to change the quote character, the idea of ​​forcing the quote character was to be able to differentiate the value that goes in each column, since that caused errors when trying to import it. A character that is difficult to appear will help reduce possible errors.

Thank you for noticing that some tables were missing in the schema 👍

Everything looks good @mlissner

@mlissner mlissner enabled auto-merge July 25, 2024 15:01
@mlissner
Copy link
Member

Great. Set for auto-merge! Thank you @hopperj for the enhancements and getting it through review!

@hopperj
Copy link
Contributor Author

hopperj commented Jul 25, 2024

Thanks to you both for providing this data, I'm happy to help! Will it be possible to run a data export to S3 in the near term?

@mlissner mlissner merged commit 9b9bc1b into freelawproject:main Jul 25, 2024
9 checks passed
@mlissner
Copy link
Member

Just a heads up, there's another hurdle, I fear, that may be coming once we export this: #4241

@hopperj hopperj deleted the export-fix branch July 26, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants