Skip to content

Tutorial: Using CSVkit

Amanda on Rajje edited this page Mar 8, 2019 · 2 revisions

Once you've installed CSVkit you probably want to use it. I already wrote up one walk through that drills down into MapPluto. Here's another:

The NYC DOB publishes regular updates on complaints they've handled. The data on their site is pure PDF, but the city's data portal has the data as a CSV. Go ahead and download it. I did.

I was going to use it for a pivot table exercise, but I more or less crashed Calc trying to load all 30M. So I'm examining it with CSVkit. First I want to know what columns it contains:

amanda@mona:~$ csvcut -n /tmp/DOB_Complaints_Received.csv
  1: Complaint Number
  2: Status
  3: Date Entered
  4: House Number
  5: House Street
  6: BIN
  7: Community Board
  8: Special District
  9: Complaint Category
 10: Unit
 11: Disposition Date
 12: Disposition Code
 13: Inspection Date
 14: DOBRunDate

I'm primarily interested in open complaints, but I don't know how the DOB data characterizes those. I'm also cur So ...

amanda@mona:~$ csvstat -c 2,9 /tmp/DOB_Complaints_Received.csv
  2. Status
    <type 'unicode'>
    Nulls: False
    Values: ACTIVE, CLOSED
  9. Complaint Category
    <type 'unicode'>
    Nulls: False
    Unique values: 99
    5 most frequent values:
	    45:	47643
	    05:	32629
	    63:	13148
	    4B:	12592
	    59:	8326
    Max length: 2

Row count: 224448

So now I know I'm looking for "ACTIVE" in column 2 (I know that there aren't any NULL values, aka empty cells, and I know every row is either "ACTIVE" or "CLOSED" And I could just do csvgrep -c 2 -m "ACTIVE" /tmp/DOB_Complaints_Received.csv to see only the open cases. I don't just want to dump that into stdout, though, so I'll redirect the output to a new file:

csvgrep -c 2 -m "ACTIVE" /tmp/DOB_Complaints_Received.csv > Desktop/DOB_complaints_active.csv

I have to look up the complaint codes to know what they mean, but I can do that. When I ran csvstat on my unfiltered data, these were my top values:

Code Complaint Count Complaint
45 47643 ILLEGAL CONVERSION
05 32629 PERMIT - NONE (BUILDING/ PA/ DEMO ETC.)
63 13148 ELEVATOR - DEFECTIVE / INOPERATIVE
4B 12592 SEP - PROFESSIONAL CERTIFICATION COMPLIANCE AUDIT
59 8326 ELECTRICAL WIRING – DEFECTIVE / EXPOSED , IN PROGRESS

After filtering out the closed complaints,, I can run csvstat again:

amanda@mona:~$ csvstat -c 9 Desktop/DOB_complaints_active.csv
9. Complaint Category
<type 'unicode'>
Nulls: False
Unique values: 86
5 most frequent values:
	45:	27875
	05:	16393
	63:	7301
	4B:	7065
	59:	3899
Max length: 2

 Row count: 113909

The number of complaints is a lot smaller, but the top complaints haven't changed. And my file is about half the size it was, which means I can actually open it in Calc (or Excel, if that's your jam) and do more with it.

Mike Stucka provided another CSVkit tutorial:

Picture this: Medicaid just dumped another big hunk of data, big enough to break Access and probably large enough to slow down folks that only want to work with their state's data. What are you going to do?

An easy fix is csvkit, which is a beautiful little hunk o' software that makes it easy to slice your data. So start by installing CSVkit

Then, from the command-line: mv Medicaid-Hugely-Long-Filename-Really-Oh-My-God-Make-It-Stop13.txt medpuf13.txt

(If we lost you already you might need some command line basics or command line basics)

We're going to use csvcut to figure out what this file is. We're working with tab delimited, not comma delimited files, so we need the -t flag. And let's just have csvkit tell us what our columns are with the -n flag.

csvcut -t -n medpuf13.txt

column 12 is the provider state. Take a quick look to see how that state is actually listed:

csvcut -t -c 12 medpuf13.txt    

Use ctrlc to stop it if you don't want to watch the whole thing.

It's an abbreviation, such as FR for the American state of Freedonia. So let's just grab all of the one state's stuff, telling it to look in our column of interest for our matching pattern of interest in our file of interest, then dump the stuff to a new file. It's one command:

csvgrep -t -c 12 -m "FR" medpuf13.txt > medpuf13freedonia.txt

It's going to take a number of minutes to process the original file, which is around 2gb, before bringing you back to the command line. A dump of a state might run about 8mb per 1 million residents. Be wary of providers with offices in adjoining states.

But you just reduced your file to a manageable size.