Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Data shell script not working #4222

Open
khanken opened this issue Jul 19, 2024 · 7 comments
Open

Bulk Data shell script not working #4222

khanken opened this issue Jul 19, 2024 · 7 comments

Comments

@khanken
Copy link

khanken commented Jul 19, 2024

load-bulk-data-2024-05-07.sh is not working for me:

  1. There are 80 tables from schema-2024-05-07.sql; there are quite fewer csv files than the tables. Many tables do not have respect csv files to load its data.
  2. load-bulk-data-2024-05-07.sh tries to load some csv files to tables that do not exist in schema-2024-05-07.sql. For example, ERROR: relation "public.disclosures_financialdisclosure" does not exist.
  3. The shell script often loads the tables with a foreign key that is referencing a table which has not been loaded yet. The sequence of the scripts need to be fixed.
  4. file "people-db-races-2024-05-06.csv.bz2" is empty. Since it appears to be a lookup table, I tried to get the data from models.py. But the primary key in the table is integer, the definition in the models.py is a char. Could we please upload the correct file to S3?

Thank you so much for all you have done! I really appreciate it!

@mlissner
Copy link
Member

Thanks for reporting this. We don't export all the tables, but we do figure it's useful to have a fairly complete schema. The ordering is definitely an issue. If that's something you're game to fix, we'd welcome that.

There's a PR from a few minutes ago that may have some of these fixes too: #4223.

I think it fixes the missing race table, and the missing schema files. The author mentioned the issue with the foreign keys being out of order, but I don't think their PR has the fix for that yet.

@hopperj
Copy link
Contributor

hopperj commented Jul 19, 2024

@khanken my MR should fix your first 3 issues, although I don't think it will help with the 4th. From what I can tell the load-bulk-data-2024-05-07.sh script does tables in order of how they are defined in the array, so I have ordered them in a way that shouldn't trigger any FK errors when the load-bulk-data-2024-05-07.sh script is run.

@khanken
Copy link
Author

khanken commented Jul 19, 2024 via email

@khanken
Copy link
Author

khanken commented Jul 19, 2024 via email

@mlissner
Copy link
Member

P.S. What is the minimum hardware requirements on running this database?

I think it's around 500GB, but honestly, we have lots of other stuff in our DB, so it's hard to say. It takes a big machine though.

@khanken
Copy link
Author

khanken commented Jul 22, 2024 via email

@mlissner
Copy link
Member

You can chunk on your side, if that's helpful. I think we'd prefer it that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants