-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial: Using CSVkit
Once you've installed CSVkit you probably want to use it. I already wrote up one walk through that drills down into MapPluto. Here's another:
The NYC DOB publishes regular updates on complaints they've handled. The data on their site is pure PDF, but the city's data portal has the data as a CSV. Go ahead and download it. I did.
I was going to use it for a pivot table exercise, but I more or less crashed Calc trying to load all 30M. So I'm examining it with CSVkit. First I want to know what columns it contains:
amanda@mona:~$ csvcut -n /tmp/DOB_Complaints_Received.csv
1: Complaint Number
2: Status
3: Date Entered
4: House Number
5: House Street
6: BIN
7: Community Board
8: Special District
9: Complaint Category
10: Unit
11: Disposition Date
12: Disposition Code
13: Inspection Date
14: DOBRunDate
I'm primarily interested in open complaints, but I don't know how the DOB data characterizes those. I'm also cur So ...
amanda@mona:~$ csvstat -c 2,9 /tmp/DOB_Complaints_Received.csv
2. Status
<type 'unicode'>
Nulls: False
Values: ACTIVE, CLOSED
9. Complaint Category
<type 'unicode'>
Nulls: False
Unique values: 99
5 most frequent values:
45: 47643
05: 32629
63: 13148
4B: 12592
59: 8326
Max length: 2
Row count: 224448
So now I know I'm looking for "ACTIVE" in column 2 (I know that there aren't any NULL values, aka empty cells, and I know every row is either "ACTIVE" or "CLOSED" And I could just do csvgrep -c 2 -m "ACTIVE" /tmp/DOB_Complaints_Received.csv
to see only the open cases. I don't just want to dump that into stdout, though, so I'll redirect the output to a new file:
csvgrep -c 2 -m "ACTIVE" /tmp/DOB_Complaints_Received.csv > Desktop/DOB_complaints_active.csv
I have to look up the complaint codes to know what they mean, but I can do that. When I ran csvstat
on my unfiltered data, these were my top values:
Code | Complaint Count | Complaint |
---|---|---|
45 | 47643 | ILLEGAL CONVERSION |
05 | 32629 | PERMIT - NONE (BUILDING/ PA/ DEMO ETC.) |
63 | 13148 | ELEVATOR - DEFECTIVE / INOPERATIVE |
4B | 12592 | SEP - PROFESSIONAL CERTIFICATION COMPLIANCE AUDIT |
59 | 8326 | ELECTRICAL WIRING – DEFECTIVE / EXPOSED , IN PROGRESS |
After filtering out the closed complaints,, I can run csvstat
again:
amanda@mona:~$ csvstat -c 9 Desktop/DOB_complaints_active.csv
9. Complaint Category
<type 'unicode'>
Nulls: False
Unique values: 86
5 most frequent values:
45: 27875
05: 16393
63: 7301
4B: 7065
59: 3899
Max length: 2
Row count: 113909
The number of complaints is a lot smaller, but the top complaints haven't changed. And my file is about half the size it was, which means I can actually open it in Calc (or Excel, if that's your jam) and do more with it.
Picture this: Medicaid just dumped another big hunk of data, big enough to break Access and probably large enough to slow down folks that only want to work with their state's data. What are you going to do?
An easy fix is csvkit, which is a beautiful little hunk o' software that makes it easy to slice your data. So start by installing CSVkit
Then, from the command-line: mv Medicaid-Hugely-Long-Filename-Really-Oh-My-God-Make-It-Stop13.txt medpuf13.txt
(If we lost you already you might need some command line basics or command line basics)
We're going to use csvcut
to figure out what this file is. We're working with tab delimited, not comma delimited files, so we need the -t
flag. And let's just have csvkit tell us what our columns are with the -n
flag.
csvcut -t -n medpuf13.txt
column 12 is the provider state. Take a quick look to see how that state is actually listed:
csvcut -t -c 12 medpuf13.txt
Use ctrlc to stop it if you don't want to watch the whole thing.
It's an abbreviation, such as FR for the American state of Freedonia. So let's just grab all of the one state's stuff, telling it to look in our column of interest for our matching pattern of interest in our file of interest, then dump the stuff to a new file. It's one command:
csvgrep -t -c 12 -m "FR" medpuf13.txt > medpuf13freedonia.txt
It's going to take a number of minutes to process the original file, which is around 2gb, before bringing you back to the command line. A dump of a state might run about 8mb per 1 million residents. Be wary of providers with offices in adjoining states.
But you just reduced your file to a manageable size.