Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of bulk data csv fields makes it difficult (impossible?) to import #4241

Closed
mlissner opened this issue Jul 25, 2024 · 5 comments
Closed

Comments

@mlissner
Copy link
Member

Lots of people are using the bulk data, so I'm not sure how this hasn't been fixed, but one of our users is reporting that our CSVs cannot be imported using the schema we provide because the columns in the CSVs are in a different order than the schema.

I think we should fix that.

I imagine a related issue will be that the schema has all of our columns while the CSV has a subset.

I'm not sure the fix for this. Maybe we need to provide better schemas? Maybe we need to re-order the CSV fields properly?

The error they get is:

 Conversion Error: CSV Error on Line: 2

Original Line: "10838944","2019-01-21 09:17:15.707272+00","2022-02-04 21:50:01.416445+00","9","","Susan Illston","","","2021-01-21 17:13:19.363539+00",,,,,,"1998-09-03","1999-09-14","1999-09-14","","Advent Software Inc. v. Stratum Business","","advent-software-inc-v-stratum-business","3:98-cv-03398","9803398","119636","","840 Trademark","","Federal question","","","","","","",,"t","2021-01-21 17:13:19.309558+00","0",,"f",,"1588","cand","19579836",,

Error when converting column "source". Could not convert string "9" to 'DATE'
@rbpasker
Copy link

I reported this.

since I couldn't reorder the 6MM columns in the CSV file, I reordered the ordered of the columns in CREATE TABLE to match the order of the columns in the CSV (see below)

the alternative solution would be to re-order the columns in the script that creates the bulk data files to match the order in the DDL:

docket_fields='(id, date_created, date_modified, source, appeal_from_str,

CREATE TABLE dockets (
id integer ,
date_created timestamp with time zone ,
date_modified timestamp with time zone ,
source smallint ,
appeal_from_str text ,
assigned_to_str text ,
referred_to_str text ,
panel_str text ,
date_last_index timestamp with time zone,
date_cert_granted date,
date_cert_denied date,
date_argued date,
date_reargued date,
date_reargument_denied date,
date_filed date,
date_terminated date,
date_last_filing date,
case_name_short text ,
case_name text ,
case_name_full text ,
slug character varying(75) ,
docket_number text,
docket_number_core character varying(20) ,
pacer_case_id character varying(100),
cnature_of_suit character varying(1000) ,
cause character varying(2000) ,
jury_demand character varying(500) ,
jurisdiction_type character varying(100) ,
appellate_fee_status text ,
appellate_case_type_information text ,
mdl_status character varying(100) ,
filepath_local character varying(1000) ,
filepath_ia character varying(1000) ,
filepath_ia_json character varying(1000) ,
ia_upload_failure_count smallint,
ia_needs_upload boolean,
ia_date_first_change timestamp with time zone,
view_count integer ,
date_blocked date,
blocked boolean ,
appeal_from_id character varying(15),
assigned_to_id integer,
court_id character varying(15) ,
idb_data_id integer,
originating_court_information_id integer,
referred_to_id integer
);

@rbpasker
Copy link

maybe add a COPY FROM script restore_bulk_data.sh with the proper column order?

https://www.postgresql.org/docs/current/sql-copy.html

If a column list is specified, COPY TO copies only the data in the specified columns to the file. For COPY FROM, each field in the file is inserted, in order, into the specified column. Table columns not specified in the COPY FROM column list will receive their default values.

@mlissner
Copy link
Member Author

Actually, wouldn't load_bulk_data.sh (in the bulk directory) do this?

@rbpasker
Copy link

rbpasker commented Jul 25, 2024

Ok, well that would do it.

I followed these instructions:

https://www.courtlistener.com/help/api/bulk-data/#formats

And I guess I missed the fact that it lists the column names.

This could be cleared up by better documentation.

@mlissner
Copy link
Member Author

That's generally good, so it sounds like the real issue here is a documentation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants