Order of bulk data csv fields makes it difficult (impossible?) to import #4241

mlissner · 2024-07-25T15:06:49Z

Lots of people are using the bulk data, so I'm not sure how this hasn't been fixed, but one of our users is reporting that our CSVs cannot be imported using the schema we provide because the columns in the CSVs are in a different order than the schema.

I think we should fix that.

I imagine a related issue will be that the schema has all of our columns while the CSV has a subset.

I'm not sure the fix for this. Maybe we need to provide better schemas? Maybe we need to re-order the CSV fields properly?

The error they get is:

 Conversion Error: CSV Error on Line: 2

Original Line: "10838944","2019-01-21 09:17:15.707272+00","2022-02-04 21:50:01.416445+00","9","","Susan Illston","","","2021-01-21 17:13:19.363539+00",,,,,,"1998-09-03","1999-09-14","1999-09-14","","Advent Software Inc. v. Stratum Business","","advent-software-inc-v-stratum-business","3:98-cv-03398","9803398","119636","","840 Trademark","","Federal question","","","","","","",,"t","2021-01-21 17:13:19.309558+00","0",,"f",,"1588","cand","19579836",,

Error when converting column "source". Could not convert string "9" to 'DATE'

The text was updated successfully, but these errors were encountered:

rbpasker · 2024-07-25T15:15:22Z

I reported this.

since I couldn't reorder the 6MM columns in the CSV file, I reordered the ordered of the columns in CREATE TABLE to match the order of the columns in the CSV (see below)

the alternative solution would be to re-order the columns in the script that creates the bulk data files to match the order in the DDL:

courtlistener/scripts/make_bulk_data.sh

Line 45 in 8c7f674

docket_fields='(id, date_created, date_modified, source, appeal_from_str,

CREATE TABLE dockets (
id integer ,
date_created timestamp with time zone ,
date_modified timestamp with time zone ,
source smallint ,
appeal_from_str text ,
assigned_to_str text ,
referred_to_str text ,
panel_str text ,
date_last_index timestamp with time zone,
date_cert_granted date,
date_cert_denied date,
date_argued date,
date_reargued date,
date_reargument_denied date,
date_filed date,
date_terminated date,
date_last_filing date,
case_name_short text ,
case_name text ,
case_name_full text ,
slug character varying(75) ,
docket_number text,
docket_number_core character varying(20) ,
pacer_case_id character varying(100),
cnature_of_suit character varying(1000) ,
cause character varying(2000) ,
jury_demand character varying(500) ,
jurisdiction_type character varying(100) ,
appellate_fee_status text ,
appellate_case_type_information text ,
mdl_status character varying(100) ,
filepath_local character varying(1000) ,
filepath_ia character varying(1000) ,
filepath_ia_json character varying(1000) ,
ia_upload_failure_count smallint,
ia_needs_upload boolean,
ia_date_first_change timestamp with time zone,
view_count integer ,
date_blocked date,
blocked boolean ,
appeal_from_id character varying(15),
assigned_to_id integer,
court_id character varying(15) ,
idb_data_id integer,
originating_court_information_id integer,
referred_to_id integer
);

rbpasker · 2024-07-25T15:31:37Z

maybe add a COPY FROM script restore_bulk_data.sh with the proper column order?

https://www.postgresql.org/docs/current/sql-copy.html

If a column list is specified, COPY TO copies only the data in the specified columns to the file. For COPY FROM, each field in the file is inserted, in order, into the specified column. Table columns not specified in the COPY FROM column list will receive their default values.

mlissner · 2024-07-25T15:39:19Z

Actually, wouldn't load_bulk_data.sh (in the bulk directory) do this?

rbpasker · 2024-07-25T15:52:33Z

Ok, well that would do it.

I followed these instructions:

https://www.courtlistener.com/help/api/bulk-data/#formats

And I guess I missed the fact that it lists the column names.

This could be cleared up by better documentation.

mlissner · 2024-07-25T15:54:04Z

That's generally good, so it sounds like the real issue here is a documentation issue.

mlissner mentioned this issue Jul 25, 2024

Bulk data export script bugfixes #4223

Merged

mlissner added the documentation label Aug 5, 2024

mlissner closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of bulk data csv fields makes it difficult (impossible?) to import #4241

Order of bulk data csv fields makes it difficult (impossible?) to import #4241

mlissner commented Jul 25, 2024

rbpasker commented Jul 25, 2024

rbpasker commented Jul 25, 2024

mlissner commented Jul 25, 2024

rbpasker commented Jul 25, 2024 •

edited

Loading

mlissner commented Jul 25, 2024

Order of bulk data csv fields makes it difficult (impossible?) to import #4241

Order of bulk data csv fields makes it difficult (impossible?) to import #4241

Comments

mlissner commented Jul 25, 2024

rbpasker commented Jul 25, 2024

rbpasker commented Jul 25, 2024

mlissner commented Jul 25, 2024

rbpasker commented Jul 25, 2024 • edited Loading

mlissner commented Jul 25, 2024

rbpasker commented Jul 25, 2024 •

edited

Loading