-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial
Author: Rodrigo Botafogo
MDArray-jCSV (jCSV for short) is the first and only (as far as I know) multidimensional CSV reader. Multidimensional? Yes... jCSV can read multidimensional data, also known sometimes as "panel data".
From Wikipedia: “In statistics and econometrics, the term panel data refers to multi-dimensional data frequently involving measurements over time. Panel data contain observations of multiple phenomena obtained over multiple time periods for the same firms or individuals. In biostatistics, the term longitudinal data is often used instead, wherein a subject or cluster constitutes a panel member or individual in a longitudinal study.” jCSV makes this definition a bit less strict as it can read observations of multiple phenomena obtained over multiple time periods for multiple firms or individuals.
Other than reading panel data, jCSV is also a very powerful and feature packed CSV reader. The CSV file format is a common format for data exchange between diverse applications. It is widely used; however, suprisingly, there aren't that many good libraries for CSV reading and writing. In Ruby there are a couple of well known libraries to accomplish this task. First, there is the standard Ruby CSV that comes with any Ruby implementation. This library according to Smarter CSV (https://github.com/tilo/smarter_csv) has the following limitations:
"Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq).
In order to eliminate those restrictions, smarter_csv was developed. Although it does remove those restrictions it removes support for Arrays of Arrays. Altough such format is really 'very close to metal' in some cases this is actually what is needed. This format is less memory intensive than the 'hash' approach from smarter_csv and it might make it easier to put the date in a simple table. When reading scientific data, such as an matrix or multidimensional array, it might also be better to remove headers and informational columns and read the actual data as just a plain array.
jCSV was developed to be the "ultimate" CSV reader (and soon writer). It tries to merge all the good features of standard Ruby CSV library, smarter_csv, and other CSV libraries from other languages. jCSV is based on Super CSV (http://super-csv.github.io/super-csv/index.html), a java CSV library. According to Super CSV web page its motivation is "for Super CSV is to be the foremost, fastest, and most programmer-friendly, free CSV package for Java". jCSV motivation is to bring this view to the Ruby world, and since we are in Ruby, make it even easier and more programmer-friendly.
jCSV reading features are:
-
Reads data as lists (Array of Arrays);
-
Reads data as maps (Array of hashes);
-
Reads multidimensional (panel) data to lists or hashes;
-
Reads multidimensional data to vectors, i.e., a multidimensional array (MDArray);
-
When reading panel data, use dimensions as keys, allowing random access to any row in the data by use of the key. For instance, if first_name, last_name are dimensions, then one can access data by doing data[“John.Smith”];
-
Read panel data with the ‘critbit’ reader which automagically sorts keys and allows for prefix retrieval of data, i.e., doing data.each(“D”) { } will retrieve all names starting with “D” and give it to the block;
-
When reading panel data, organize data as maps of maps (deep_map);
-
Able to read files with headers or no-headers;
-
When the file has no-headers, allow the user to provide headers so that reading can be done either as array of arrays, array of hashes, or multidimensional with keys;
-
Able to process large CSV-files;
-
Able to chunk the input from the CSV file to avoid loading the whole CSV file into memory;
-
Able to treat the file as an enumerator, so that reading more data can be done at any time during the script execution, it can be stopped and restarted at any time;
-
Able to pass a block to the read method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
-
Allows a bit more flexible input format, where comments are possible, and col_sep, row_sep can be set to any character sequence, including control characters;
-
Able to re-map CSV "column names" to Hash-keys of your choice (normalization);
-
Able to ignore "columns" in the input (delete columns);
-
Able to change columns´ order, when reading to an Array of Arrays;
-
Provide dozens of filters/validators for the data;
-
Filters can be chained allowing for complex data manipulation. For instance, suppose one column can have empty values or dollar values. If it is a dollar values, then it should be a float. Consider that the data is stored using a Brazilian locale format, i.e., decimal separator is ‘,’ and grouping is ‘.’ (the reverse of US locale). Suppose also that the value should be in the range of US$ 1.000,00 and US$ 2.000,00 and finally suppose that we actually want to see this data not as dollar amounts but as Brazilian Reais, converted with the day´s current rate.
Then this sequence of filters should do it:Jcsv.optional >> Jcsv.float(locale: Brazil) >> Jcsv.in_range(1000, 2000) >> Jcsv.dynamic { |value| rate * value }
-
Date can be parsed by any of Ruby DateTime formats: httpdate, iso8601, jd, etc.;
-
Can filter data by any of the Ruby String methods: :[], :reverse, :gsub, :prepend, etc.
In this section we will read the following 'customer.csv' file. Some things should be observed in the records of this data:
-
It has a header;
-
4 rows of data, all with 10 columns;
-
Records can have line break in them;
-
The mailingAddress column contains data that spans multiple lines
-
The favouriteQuote column contains data with escaped quotes.
customerNo,firstName,lastName,birthDate,mailingAddress,married,numberOfKids,favouriteQuote,email,loyaltyPoints 1,John,Dunbar,13/06/1945,"1600 Amphitheatre Parkway Mountain View, CA 94043 United States",,,"""May the Force be with you."" - Star Wars",jdunbar@gmail.com,0
2,Bob,Down,25/02/1919,"1601 Willow Rd. Menlo Park, CA 94025 United States",Y,0,"""Frankly, my dear, I don't give a damn."" - Gone With The Wind",bobdown@hotmail.com,123456
3,Alice,Wunderland,08/08/1985,"One Microsoft Way Redmond, WA 98052-6399 United States",Y,0,"""Play it, Sam. Play ""As Time Goes By."""" - Casablanca",throughthelookingglass@yahoo.com,2255887799
4,Bill,Jobs,10/07/1973,"2701 San Tomas Expressway Santa Clara, CA 95050 United States",Y,3,"""You've got to ask yourself one question: ""Do I feel lucky?"" Well, do ya, punk?"" - Dirty Harry",billy34@hotmail.com,36
The simplest way of reading a csv file is as a list or an array of arrays.
Reading in this way is a simple call to Jcsv.reader with the filename, in
this case the file is called 'customer.csv'. In the next
examples we will parse a CSV file and show all the features and options that can be
applied for changing the parsing. The file we are reading has headers.
Headers are converted from string to symbol.
require 'jcsv'
require 'pp' # only needed for pretty printing
# Create a new reader by passing the filename to be parsed
reader = Jcsv.reader("../data/customer.csv")
# now read the whole csv file and stores it in the 'content' variable
content = reader.read
When reading a file with headers, the 'headers' instance variable from reader has the headers read from the file. Headers are converted to symbols:
> p reader.headers
[:customer_no, :first_name, :last_name, :birth_date, :mailing_address, :married, :number_of_kids, :favourite_quote, :email, :loyalty_points]
We now take a look at the content of the file and should note that line breaks were read as \n and quotes are properly escaped with ". The 'content' variable has an array of arrays and each line is an array.
> content.each do |row|
+ p row
+ end
["1", "John", "Dunbar", "13/06/1945", "1600 Amphitheatre Parkway\nMountain View, CA 94043\nUnited States", nil, nil, "\"May the Force be with you.\" - Star Wars", "jdunbar@gmail.com", "0"]
["2", "Bob", "Down", "25/02/1919", "1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States", "Y", "0", "\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind", "bobdown@hotmail.com", "123456"]
["3", "Alice", "Wunderland", "08/08/1985", "One Microsoft Way\nRedmond, WA 98052-6399\nUnited States", "Y", "0", "\"Play it, Sam. Play \"As Time Goes By.\"\" - Casablanca", "throughthelookingglass@yahoo.com", "2255887799"]
["4", "Bill", "Jobs", "10/07/1973", "2701 San Tomas Expressway\nSanta Clara, CA 95050\nUnited States", "Y", "3", "\"You've got to ask yourself one question: \"Do I feel lucky?\" Well, do ya, punk?\" - Dirty Harry", "billy34@hotmail.com", "36"]
Options can be passed to method reader, changing the behavior of the reader. 'strings_as_key' when true will not convert headers to symbol. Note also, that headers are read imediately by method reader without the need to call any other methods:
> reader = Jcsv.reader("../data/customer.csv", strings_as_keys: true)
+ p reader.headers
["customerNo", "firstName", "lastName", "birthDate", "mailingAddress", "married", "numberOfKids", "favouriteQuote", "email", "loyaltyPoints"]
One very interesting feature of Ruby CSV libraries is the ability to give a block to the parser and process the data as it is being read. In jCSV this can also be accomplished:
# read lines and pass them to a block for processing. The block receives line_no (last line
# of the record),
# row_no, row and the headers.
# Read file 'customer.csv'. File has headers (this is the default) and we keep the keys as string
reader = Jcsv.reader("../data/customer.csv", headers: true, strings_as_keys: true)
reader.read do |line_no, row_no, row, headers|
puts "line number: #{line_no}, row number: #{row_no}"
headers.each_with_index do |head, i|
puts "#{head}: #{row[i]}"
end
puts
end
line number: 4, row number: 2
customerNo: 1
firstName: John
lastName: Dunbar
birthDate: 13/06/1945
mailingAddress: 1600 Amphitheatre Parkway
Mountain View, CA 94043
United States
married:
numberOfKids:
favouriteQuote: "May the Force be with you." - Star Wars
email: jdunbar@gmail.com
loyaltyPoints: 0
line number: 7, row number: 3
customerNo: 2
firstName: Bob
lastName: Down
birthDate: 25/02/1919
mailingAddress: 1601 Willow Rd.
Menlo Park, CA 94025
United States
married: Y
numberOfKids: 0
favouriteQuote: "Frankly, my dear, I don't give a damn." - Gone With The Wind
email: bobdown@hotmail.com
loyaltyPoints: 123456
line number: 10, row number: 4
customerNo: 3
firstName: Alice
lastName: Wunderland
birthDate: 08/08/1985
mailingAddress: One Microsoft Way
Redmond, WA 98052-6399
United States
married: Y
numberOfKids: 0
favouriteQuote: "Play it, Sam. Play "As Time Goes By."" - Casablanca
email: throughthelookingglass@yahoo.com
loyaltyPoints: 2255887799
line number: 13, row number: 5
customerNo: 4
firstName: Bill
lastName: Jobs
birthDate: 10/07/1973
mailingAddress: 2701 San Tomas Expressway
Santa Clara, CA 95050
United States
married: Y
numberOfKids: 3
favouriteQuote: "You've got to ask yourself one question: "Do I feel lucky?" Well, do ya, punk?" - Dirty Harry
email: billy34@hotmail.com
loyaltyPoints: 36
A powerful feature of jCSV is the ability to filter and transform data cells. In the next example we define a default_filter for every cell in the dataset: 'default_filter: Jcsv.not_nil'. With this filter, no cell can be nil. Looking back at our data, we can see that row number 2 has an empty field for 'married' and 'number of kids". So, we should expect this reading to fail.
> parser = Jcsv.reader("../data/customer.csv", default_filter: Jcsv.not_nil)
+ parser.read
Jcsv::ConstraintViolation: Empty value found:
{lineNo=4, rowNo=2, columnNo=6, rowSource=[1, John, Dunbar, 13/06/1945, 1600 Amphitheatre Parkway
Mountain View, CA 94043
United States, null, null, "May the Force be with you." - Star Wars, jdunbar@gmail.com, 0]}
As we can see, this parsing dies on record number 2 with a Constraint violation, since we have a default filter of not_nil and in this record two fields are nil (empty). In order to properly handle this issue, we will add filters to our parser. First we make :number_of_kids and :married as optional; however, if this field is filled, then :number_of_kids should be and integer and :married should be a boolean. In order to add this filters, we chain them with: Jcsv.optional >> Jcsv.int and Jcsv.optional >> Jcsv.bool:
parser = Jcsv.reader("../data/customer.csv", default_filter: Jcsv.not_nil)
# Add filters, so that we get 'objects' instead of strings for filtered fields
parser.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:customer_no => Jcsv.int,
:birth_date => Jcsv.date("dd/MM/yyyy")}
content = parser.read
Let's take a look at the second record. We can see that :customer_no is an integer, in this case 2, since we have selected the second record, :married is now true and the :number_of_kids is the integer 0.
> pp content[1]
[2,
"Bob",
"Down",
#<DateTime: 1919-02-25T00:00:00+00:00 ((2422015j,0s,0n),+0s,2299161j)>,
"1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States",
true,
0,
"\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind",
"bobdown@hotmail.com",
"123456"]
As with super_csv, jCSV also supports data chunking, by passing as argument the chunk_size.
# Read chunks of the file. In this case, we are breaking the file in chunks of 2
reader = Jcsv.reader("../data/customer.csv", chunk_size: 2)
# Add filters, so that we get 'objects' instead of strings for filtered fields
reader.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:customer_no => Jcsv.int}
content = reader.read
> pp content[0]
[[1,
"John",
"Dunbar",
"13/06/1945",
"1600 Amphitheatre Parkway\nMountain View, CA 94043\nUnited States",
nil,
nil,
"\"May the Force be with you.\" - Star Wars",
"jdunbar@gmail.com",
"0"],
[2,
"Bob",
"Down",
"25/02/1919",
"1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States",
true,
0,
"\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind",
"bobdown@hotmail.com",
"123456"]]
Note now that content[0] is an array of array. The external array is of size 2, the chunk size and each sub array is a row. Note also, that the filters have properly converted strings into objects. Remember that method read can receive a block. Let's show an example with chunk_size of 3. In this example, the first chunk will be of size 3 and the second of size 1, since there are no more records after that. When reading chunks, in the given blocks, 'line_no' and 'row_no' are the last line and row read respectively.
# Read chunks of the file. In this case, we are breaking the file in chunks of 3
reader = Jcsv.reader("../data/customer.csv", chunk_size: 3)
# Add filters, so that we get 'objects' instead of strings for filtered fields
reader.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:customer_no => Jcsv.int}
> reader.read do |line_no, row_no, chunk, headers|
+ puts "line number: #{line_no}, row number: #{row_no}"
+ pp chunk
+ puts
+ end
line number: 10, row number: 4
[[1,
"John",
"Dunbar",
"13/06/1945",
"1600 Amphitheatre Parkway\nMountain View, CA 94043\nUnited States",
nil,
nil,
"\"May the Force be with you.\" - Star Wars",
"jdunbar@gmail.com",
"0"],
[2,
"Bob",
"Down",
"25/02/1919",
"1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States",
true,
0,
"\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind",
"bobdown@hotmail.com",
"123456"],
[3,
"Alice",
"Wunderland",
"08/08/1985",
"One Microsoft Way\nRedmond, WA 98052-6399\nUnited States",
true,
0,
"\"Play it, Sam. Play \"As Time Goes By.\"\" - Casablanca",
"throughthelookingglass@yahoo.com",
"2255887799"]]
line number: 13, row number: 5
[[4,
"Bill",
"Jobs",
"10/07/1973",
"2701 San Tomas Expressway\nSanta Clara, CA 95050\nUnited States",
true,
3,
"\"You've got to ask yourself one question: \"Do I feel lucky?\" Well, do ya, punk?\" - Dirty Harry",
"billy34@hotmail.com",
"36"]]
Ruby has a very interesting feature called Enumerator. jCSV supports the use of enumerators, allowing for partial file read. Let's first give an example of using enumerators, and then we will show an example of partially reading a CSV file. In order to get an enumerator on the reader we call method each without any blocks:
reader = Jcsv.reader("../data/customer.csv", chunk_size: 2)
# Add filters, so that we get 'objects' instead of strings for filtered fields
reader.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:customer_no => Jcsv.int}
# Method each without a block returns an enumerator
enum = reader.each
# read the first chunk. Chunk is of size 2
chunk = enum.next
The 'chunk' variable above is an array with the following elements: line_no, row_no, chunk data, and headers. In this example, at this point we have read only two records.
> pp chunk
[7,
3,
[[1,
"John",
"Dunbar",
"13/06/1945",
"1600 Amphitheatre Parkway\nMountain View, CA 94043\nUnited States",
nil,
nil,
"\"May the Force be with you.\" - Star Wars",
"jdunbar@gmail.com",
"0"],
[2,
"Bob",
"Down",
"25/02/1919",
"1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States",
true,
0,
"\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind",
"bobdown@hotmail.com",
"123456"]],
[:customer_no,
:first_name,
:last_name,
:birth_date,
:mailing_address,
:married,
:number_of_kids,
:favourite_quote,
:email,
:loyalty_points]]
In order to read the other two records we need to call method 'next' again:
> pp enum.next
[13,
5,
[[3,
"Alice",
"Wunderland",
"08/08/1985",
"One Microsoft Way\nRedmond, WA 98052-6399\nUnited States",
true,
0,
"\"Play it, Sam. Play \"As Time Goes By.\"\" - Casablanca",
"throughthelookingglass@yahoo.com",
"2255887799"],
[4,
"Bill",
"Jobs",
"10/07/1973",
"2701 San Tomas Expressway\nSanta Clara, CA 95050\nUnited States",
true,
3,
"\"You've got to ask yourself one question: \"Do I feel lucky?\" Well, do ya, punk?\" - Dirty Harry",
"billy34@hotmail.com",
"36"]],
[:customer_no,
:first_name,
:last_name,
:birth_date,
:mailing_address,
:married,
:number_of_kids,
:favourite_quote,
:email,
:loyalty_points]]
We now write a small script that will look for a record that has "Bob" on the :firstname.
When this happens, the script terminates and no more reading needs to be done. For large
CSV files, breaking reading when the required data is read is a very useful feature. Note
that we could do the same thing using a block and breaking out of the block; however, using
enumerator is more flexible than blocks as we could read part of the file, do some
processing, wait for user input, and then continue reading.
Remember that row is an array with line_no, row_no, row data, and headers. So, to get :firstname we need to read row[2][1].
reader = Jcsv.reader("../data/customer.csv")
# Add filters, so that we get 'objects' instead of strings for filtered fields
reader.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:customer_no => Jcsv.int}
# Method each without a block returns an enumerator
enum = reader.each
begin
row = enum.next
end while row[2][1] != "Bob"
p row
[7, 3, [2, "Bob", "Down", "25/02/1919", "1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States", true, 0, "\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind", "bobdown@hotmail.com", "123456"], [:customer_no, :first_name, :last_name, :birth_date, :mailing_address, :married, :number_of_kids, :favourite_quote, :email, :loyalty_points]]
Sometimes, a CSV file contains columns that are of no interest, and thus, reading them just consumes memory without any benefit. jCSV allows skipping such columns, by defining a mapping. Bellow an example where the columns :customer_no, :mailing_address and :favourite_quote are not read:
reader = Jcsv.reader("../data/customer.csv")
# Add mapping. When column is mapped to false, it will not be retrieved from the
# file, improving time and speed efficiency
reader.mapping = {:customer_no => false, :mailing_address => false,
:favourite_quote => false}
Note that the headers in the block do not show any of the removed columns, although the reader.headers still has all headers.
> p reader.headers
[:customer_no, :first_name, :last_name, :birth_date, :mailing_address, :married, :number_of_kids, :favourite_quote, :email, :loyalty_points]
> reader.read do |line_no, row_no, row, headers|
+ p headers
+ p row
+ end
[:first_name, :last_name, :birth_date, :married, :number_of_kids, :email, :loyalty_points]
["John", "Dunbar", "13/06/1945", nil, nil, "jdunbar@gmail.com", "0"]
[:first_name, :last_name, :birth_date, :married, :number_of_kids, :email, :loyalty_points]
["Bob", "Down", "25/02/1919", "Y", "0", "bobdown@hotmail.com", "123456"]
[:first_name, :last_name, :birth_date, :married, :number_of_kids, :email, :loyalty_points]
["Alice", "Wunderland", "08/08/1985", "Y", "0", "throughthelookingglass@yahoo.com", "2255887799"]
[:first_name, :last_name, :birth_date, :married, :number_of_kids, :email, :loyalty_points]
["Bill", "Jobs", "10/07/1973", "Y", "3", "billy34@hotmail.com", "36"]
jCSV also allows for reordering the columns of the CSV file. This is also done through a mapping:
reader = Jcsv.reader("../data/customer.csv")
# Add filters...
reader.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:customer_no => Jcsv.int}
# Mapping allows reordering of columns. In this example, column 0 (:customer_no)
# in the csv file will be loaded in position 2 (3rd column); column 1 (:first_name)
# in the csv file will be loaded in position 0 (1st column); column 2 on the csv file
# will not be loaded (false); column 4 (:birth_date) will be loaded on position 3,
# and so on.
# When reordering columns, care should be taken to get the mapping right or unexpected
# behaviour could result.
reader.mapping = {:customer_no => 2, :first_name => 0, :last_name => false,
:birth_date => 3, :mailing_address => false, :married => false,
:number_of_kids => false, :favourite_quote => false, :email => 1,
:loyalty_points => 4}
> reader.read do |line_no, row_no, row, headers|
+ p headers
+ p row
+ end
[:first_name, :email, :customer_no, :birth_date, :loyalty_points]
["John", "jdunbar@gmail.com", 1, "13/06/1945", "0"]
[:first_name, :email, :customer_no, :birth_date, :loyalty_points]
["Bob", "bobdown@hotmail.com", 2, "25/02/1919", "123456"]
[:first_name, :email, :customer_no, :birth_date, :loyalty_points]
["Alice", "throughthelookingglass@yahoo.com", 3, "08/08/1985", "2255887799"]
[:first_name, :email, :customer_no, :birth_date, :loyalty_points]
["Bill", "billy34@hotmail.com", 4, "10/07/1973", "36"]
In this section we show how to read data into an array of maps (hashes) instead as into an array of arrays. Reading to map is very easy and only requires passing one argument: 'format: :map'.
reader = Jcsv.reader("../data/customer.csv", format: :map)
# map is an array of hashes
map = reader.read
pp map
[{:customer_no=>"1",
:first_name=>"John",
:last_name=>"Dunbar",
:birth_date=>"13/06/1945",
:mailing_address=>
"1600 Amphitheatre Parkway\nMountain View, CA 94043\nUnited States",
:married=>nil,
:number_of_kids=>nil,
:favourite_quote=>"\"May the Force be with you.\" - Star Wars",
:email=>"jdunbar@gmail.com",
:loyalty_points=>"0"},
{:customer_no=>"2",
:first_name=>"Bob",
:last_name=>"Down",
:birth_date=>"25/02/1919",
:mailing_address=>"1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States",
:married=>"Y",
:number_of_kids=>"0",
:favourite_quote=>
"\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind",
:email=>"bobdown@hotmail.com",
:loyalty_points=>"123456"},
{:customer_no=>"3",
:first_name=>"Alice",
:last_name=>"Wunderland",
:birth_date=>"08/08/1985",
:mailing_address=>"One Microsoft Way\nRedmond, WA 98052-6399\nUnited States",
:married=>"Y",
:number_of_kids=>"0",
:favourite_quote=>"\"Play it, Sam. Play \"As Time Goes By.\"\" - Casablanca",
:email=>"throughthelookingglass@yahoo.com",
:loyalty_points=>"2255887799"},
{:customer_no=>"4",
:first_name=>"Bill",
:last_name=>"Jobs",
:birth_date=>"10/07/1973",
:mailing_address=>
"2701 San Tomas Expressway\nSanta Clara, CA 95050\nUnited States",
:married=>"Y",
:number_of_kids=>"3",
:favourite_quote=>
"\"You've got to ask yourself one question: \"Do I feel lucky?\" Well, do ya, punk?\" - Dirty Harry",
:email=>"billy34@hotmail.com",
:loyalty_points=>"36"}]
In order to get the :loyalty_points for the second customer we do:
> p map[1][:loyalty_points]
"123456"
Reading to maps support most of the same arguments as reading to lists. Bellow we read with chunk_size 2, and strings as key:
reader = Jcsv.reader("../data/customer.csv", format: :map, chunk_size: 2,
strings_as_keys: true)
map = reader.read
With chunk_size 2, we have in variable map two chunks, each of size 2. Let's take a look at the first element of the second chunk, i.e., the third row of data:
> pp map[1][0]
{"customerNo"=>"3",
"firstName"=>"Alice",
"lastName"=>"Wunderland",
"birthDate"=>"08/08/1985",
"mailingAddress"=>"One Microsoft Way\nRedmond, WA 98052-6399\nUnited States",
"married"=>"Y",
"numberOfKids"=>"0",
"favouriteQuote"=>"\"Play it, Sam. Play \"As Time Goes By.\"\" - Casablanca",
"email"=>"throughthelookingglass@yahoo.com",
"loyaltyPoints"=>"2255887799"}
Filters and mappings are also supported for maps. Note that we introduce some new filters: Jcsv.long and Jcsv.date. jCSV support filters Jcsv.int, Jcsv.long and Jcsv.double although 'int', 'long' and 'double' are not Ruby types. jCSV also support filters Jcsv.float and Jcsv.fixnum. A Jcsv.int filter will convert the data to a fixnum but it will raise and exception if the size of the int is larger that a Java int. The same happens with Jcsv.long, and Jcsv.double.
We also show how to rename columns by using a mapping, for instance, we want column :number_of_kids to be mapped to :numero_criancas which is the same label but in portuguese. Note also that we map :loyalty_points to the string with white spaces "pontos fielidade". Finally, columns :customer_no, :mailing_address and :favourite_quote are droped.
# type is :map. Rows are hashes. Set the default filter to not_nil. That is, all
# fields are required unless explicitly set to optional.
reader = Jcsv.reader("../data/customer.csv", format: :map, default_filter: Jcsv.not_nil)
# Set numberOfKids and married as optional, otherwise an exception will be raised
reader.filters = {:number_of_kids => Jcsv.optional >> Jcsv.int,
:married => Jcsv.optional >> Jcsv.bool,
:loyalty_points => Jcsv.long,
:customerno => Jcsv.int,
:birth_date => Jcsv.date("dd/MM/yyyy")}
# When parsing to map, it is possible to make a mapping. If column name is :false
# the column will be removed from the returned row
reader.mapping = {:number_of_kids => :numero_criancas,
:married => "casado",
:loyalty_points => "pontos fidelidade",
:customer_no => false,
:mailing_address => false,
:favourite_quote => false}
reader.read do |line_no, row_no, row|
pp row
end
{:first_name=>"John",
:last_name=>"Dunbar",
:birth_date=>
#<DateTime: 1945-06-13T00:00:00+00:00 ((2431620j,0s,0n),+0s,2299161j)>,
"casado"=>nil,
:numero_criancas=>nil,
:email=>"jdunbar@gmail.com",
"pontos fidelidade"=>0}
{:first_name=>"Bob",
:last_name=>"Down",
:birth_date=>
#<DateTime: 1919-02-25T00:00:00+00:00 ((2422015j,0s,0n),+0s,2299161j)>,
"casado"=>true,
:numero_criancas=>0,
:email=>"bobdown@hotmail.com",
"pontos fidelidade"=>123456}
{:first_name=>"Alice",
:last_name=>"Wunderland",
:birth_date=>
#<DateTime: 1985-08-08T00:00:00+00:00 ((2446286j,0s,0n),+0s,2299161j)>,
"casado"=>true,
:numero_criancas=>0,
:email=>"throughthelookingglass@yahoo.com",
"pontos fidelidade"=>2255887799}
{:first_name=>"Bill",
:last_name=>"Jobs",
:birth_date=>
#<DateTime: 1973-07-10T00:00:00+00:00 ((2441874j,0s,0n),+0s,2299161j)>,
"casado"=>true,
:numero_criancas=>3,
:email=>"billy34@hotmail.com",
"pontos fidelidade"=>36}
Reading as map also supports reading as enumerator:
reader = Jcsv.reader("../data/customer.csv", chunk_size: 2, format: :map)
# Add filters, so that we get 'objects' instead of strings for filtered fields
reader.filters = {"numberOfKids" => Jcsv.optional >> Jcsv.int,
"married" => Jcsv.optional >> Jcsv.bool,
"customerNo" => Jcsv.int}
enum = reader.each
chunk = enum.next
> pp chunk[2][1]
{:customer_no=>"2",
:first_name=>"Bob",
:last_name=>"Down",
:birth_date=>"25/02/1919",
:mailing_address=>"1601 Willow Rd.\nMenlo Park, CA 94025\nUnited States",
:married=>true,
:number_of_kids=>"0",
:favourite_quote=>
"\"Frankly, my dear, I don't give a damn.\" - Gone With The Wind",
:email=>"bobdown@hotmail.com",
:loyalty_points=>"123456"}
From Wikipedia:
"A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.
In a data warehouse, dimensions provide structured labeling information to otherwise unordered numeric measures. The dimension is a data set composed of individual, non-overlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labelling."
Data often has dimensions, but they are just treated as labels for the data in a column of the CSV file.
The following excerpt shows data from an experiment in which patients with epilepsy were given either a placebo or Progabide to check the effect of this medicament in their seizure rate during a four week treatment period (data from R).
Clearly, treatment is a dimension in this data, as a patient is either given a placebo or Progabide. The patient id (first column) can also be considered a dimension. In this experiment there were 59 patients, during a 4 week period, thus this dataset has 236 rows.
"patient","treatment","base","age","seizure.rate","period","subject"
"1","placebo",11,31,5,"1","1"
"110","placebo",11,31,3,"2","1"
"112","placebo",11,31,3,"3","1"
"114","placebo",11,31,3,"4","1"
"29","Progabide",76,18,11,"1","29"
"291","Progabide",76,18,14,"2","29"
"292","Progabide",76,18,9,"3","29"
"293","Progabide",76,18,8,"4","29"
"30","Progabide",38,32,8,"1","30"
Let's now read this dataset and see how dimensions can help understand this data and organize it. Four dimensions are set for this dataset: patient, subject, treatment and period and it will be read as an array of maps:
reader = Jcsv.reader("../data/epilepsy.csv", format: :map,
dimensions: [:patient, :subject, :treatment, :period])
treatment = reader.read
> treatment.first(5).each do |row|
+ pp row
+ end
["1.1.placebo.1", {:base=>"11", :age=>"31", :"seizure.rate"=>"5"}]
["110.1.placebo.2", {:base=>"11", :age=>"31", :"seizure.rate"=>"3"}]
["112.1.placebo.3", {:base=>"11", :age=>"31", :"seizure.rate"=>"3"}]
["114.1.placebo.4", {:base=>"11", :age=>"31", :"seizure.rate"=>"3"}]
["2.2.placebo.1", {:base=>"11", :age=>"30", :"seizure.rate"=>"3"}]
Observe that each row has a 'key' composed of all the dimensions concatenated with '.' and the other columns are still processed as in a regular map reader. Note that dimensions allows retrieval of rows by keys:
> p treatment["112.1.placebo.3"]
+ p treatment["481.48.Progabide.2"]
{:base=>"11", :age=>"31", :"seizure.rate"=>"3"}
{:base=>"11", :age=>"25", :"seizure.rate"=>"1"}
Dimensions' elements can be accessed by accessing reader's 'dimensions' instance variable and getting the labels variable. We will not show dimensions label's for :patient and :subject as those are large sets.
> pp reader.dimensions[:treatment].labels
+ pp reader.dimensions[:period].labels
{"placebo"=>0, "Progabide"=>1}
{"1"=>0, "2"=>1, "3"=>2, "4"=>3}
Since getting the labels for dimensions is quite often necessary, there is a shortcut for getting it. Note that dimension :data represents the headers of the data dimension.
> pp reader[:treatment]
+ pp reader[:period]
+ pp reader[:_data_]
["placebo", "Progabide"]
["1", "2", "3", "4"]
[:base, :age, :"seizure.rate"]
It is also important to note that the dimensions we have defined on the epilepsy data are not ideal since :patient is actually dimensions by itself and :subject is numbered from 1 to 59; it would be better if subjects were numbered as first, second, third, etc. receiving placebo, then first, second, third, etc. receiving Progabide. If this was the case, then we would know how to retrieve a patient's data. For instance we could get treatment['placebo.4.1'] would be the first week of the fourth patient taking placebo.
Let's now take a look a balanced panel data from Wikipedia (https://en.wikipedia.org/wiki/Panel_data)
person,year,income,age,sex
1,2001,1300,27,1
1,2002,1600,28,1
1,2003,2000,29,1
2,2001,2000,38,2
2,2002,2300,39,2
2,2003,2400,40,2
We observe that person and year are dimensions of the data and income, age and sex are actual data. Next we show an unbalaced panel data:
person,year,income,age,sex
1,2001,1600,23,1
1,2002,1500,24,1
2,2001,1900,41,2
2,2002,2000,42,2
2,2003,2100,43,2
3,2002,3300,34,1
We can read both the balanced and the unbalanced panel data without any problem:
reader = Jcsv.reader("../data/balanced_panel.csv", format: :map,
dimensions: [:person, :year])
bp = reader.read
pp bp
{"1.2001"=>{:income=>"1300", :age=>"27", :sex=>"1"},
"1.2002"=>{:income=>"1600", :age=>"28", :sex=>"1"},
"1.2003"=>{:income=>"2000", :age=>"29", :sex=>"1"},
"2.2001"=>{:income=>"2000", :age=>"38", :sex=>"2"},
"2.2002"=>{:income=>"2300", :age=>"39", :sex=>"2"},
"2.2003"=>{:income=>"2400", :age=>"40", :sex=>"2"}}
reader = Jcsv.reader("../data/unbalanced_panel.csv", format: :map,
dimensions: [:person, :year])
bp = reader.read
pp bp
Warning reading row: [2, 2003, 2100, 43, 2] in field 'year'. Dimension 'year' is frozen when adding label '2003'.
{"1.2001"=>{:income=>"1600", :age=>"23", :sex=>"1"},
"1.2002"=>{:income=>"1500", :age=>"24", :sex=>"1"},
"2.2001"=>{:income=>"1900", :age=>"41", :sex=>"2"},
"2.2002"=>{:income=>"2000", :age=>"42", :sex=>"2"},
"2.2003"=>{:income=>"2100", :age=>"43", :sex=>"2"},
"3.2002"=>{:income=>"3300", :age=>"34", :sex=>"1"}}
Note that in the last example, we got a 'warning' saying that dimension 'year' is frozen. We will talk more about frozen dimensions in a later section of this document. Although we got a warning, reading proceeded without any problem.
As we've seen on the previous section, dimensions help us access data by keys; however, if we wanted to see all the data from patients taking 'Progabide', we would probably have to look at our dimensions and write a loop to get the desired data.
jCSV provides another way of reading the data the helps with this problem: deep_map. Bellow we read the epilepsy data seting deep_map to true and chunk_size to :all. When chunk_size is :all, the whole data file is read in one large chunk.
reader = Jcsv.reader("../data/epilepsy.csv", format: :map, chunk_size: :all,
dimensions: [:treatment, :subject, :period], deep_map: true)
# remove the :patient field from the data, as this field is already given by the
# :subject field.
reader.mapping = {:patient => false}
treatment = reader.read[0]
First let's understand the directive chunk_size :all: this indicates that the file should be read in one large chunk. When reading chunks, each chunk is an array, of arrays of the given chunk size. When reading chunk_size :all, the returned data is in an array that has an array with all row, this is why we have treatment above to be reader.read[0].
The attentive reader might ask: "why do we need chunk_size :all, since when no chunk size is given the whole file is read anyway?". When no chunk_size is given, the data is read one line at a time and the reader has "no memory" of what was read previously. With chunk_size :all all the data is part of one large dataset and this allows the construction of deep maps.
The treatment variable above is a hash that has two entries: 'placebo' and 'Progabide'. The placebo entry has, as entries, the elements from the second dimension, which is subject. So treatment["placebo"]["1"] shows the data for all four periods os treatment for subject "1".
> pp treatment["placebo"]["1"]
{"1"=>{:base=>"11", :age=>"31", :"seizure.rate"=>"5"},
"2"=>{:base=>"11", :age=>"31", :"seizure.rate"=>"3"},
"3"=>{:base=>"11", :age=>"31", :"seizure.rate"=>"3"},
"4"=>{:base=>"11", :age=>"31", :"seizure.rate"=>"3"}}
If we wanted to get all patient's data that took 'Progabide' then treatment['Progabide'] would do the trick. We will not print it as this a rather large output.
As we can see, on the first period, subject "1" had base = "11", age = "31", seizure.rate = "5". On the second period it's seizure.rate was "3". We can get this from our treatment variable with:
> p treatment["placebo"]["1"]["2"][:"seizure.rate"]
"3"
jCSV assumes that dimensions should be organized from slowest to fast changing in the file. The example bellow shows a CSV file and the proper way of organizing dimensions. Note that Dim 1 is the slowest to change, then Dim 2 and finally Dim3:
# File GoodOrder.csv
Dim 1 Dim 2 Dim 3 Data
A X K 1
A X J 2
A Y H 3
A Y G 4
B X K 5
B X J 6
B Y H 7
B Y G 8
C X K 9
C X J 10
C Y H 11
C Y G 12
reader = Jcsv.reader("../data/GoodOrder.csv", format: :map, chunk_size: :all,
col_sep: ";", dimensions: [:dim_1, :dim_2, :dim_3],
deep_map: true)
table = reader.read[0]
> pp table
{"A"=>
{"X"=>{"K"=>{:data=>"1"}, "J"=>{:data=>"2"}},
"Y"=>{"H"=>{:data=>"3"}, "G"=>{:data=>"4"}}},
"B"=>
{"X"=>{"K"=>{:data=>"5"}, "J"=>{:data=>"6"}},
"Y"=>{"H"=>{:data=>"7"}, "G"=>{:data=>"8"}}},
"C"=>
{"X"=>{"K"=>{:data=>"9"}, "J"=>{:data=>"10"}},
"Y"=>{"H"=>{:data=>"11"}, "G"=>{:data=>"12"}}}}
Note that we can input the dimensions in any order in the dimension directive. In the next example, we have dim_2 as the first dimension. If dim_2 were the slowest changing dimension, then this would be the "right" way of writing the dimensions directive. Note, however, that since dim_2 is not the slowest changing dimensions when reading this file we will get some warnings:
reader = Jcsv.reader("../data/GoodOrder.csv", format: :map, chunk_size: :all, col_sep: ";",
dimensions: [:dim_2, :dim_1, :dim_3], deep_map: true)
table = reader.read[0]
Warning reading row: [C, X, K, 9] in field 'dim_1'. Dimension 'dim_1' is frozen when adding label 'C'.
In this example we get a message saying that dimension 'dim_1' is frozen. What does that mean? As we explained above, the CSV reader expects slower changing dimensions to be read first. The proper order of reading dimensions is thus, dim_1, dim_2, dim_3. A dimension becomes frozen whenever it cycles back to its first element. In this example, dim_2 is X, X, Y, Y when it cycles back to X it becomes frozen indicating that X and Y are the only two elements in this dimension. When a dimension if frozen all dimensions after it are also frozen. In this case, dim_1 and dim_3 also become frozen.
When dim_2 cycles back to X the values of dim_1 that were read are A and B. When it becomes frozen, no other element can be added to this dimension. When label C is read form dim_1, it generates the warning saying that label C cannot be added to dim_1. Although the warning says that label C cannot be added, it is actually added and everything works fine at the end. Then, if everything works fine, why does a dimension become frozen on the first hand? The answer will come shortly!
We will now read the file bellow, BadOrder.csv. It contains the same data as above but dimension dim_2 is the first column:
# File BadOrder.csv
Dim_2 Dim_1 Dim_3 Data
X A K 1
X A J 2
Y A H 3
Y A G 4
X B K 5
X B J 6
Y B H 7
Y B G 8
X C K 9
X C J 10
Y C H 11
Y C G 12
reader = Jcsv.reader("../data/BadOrder.csv", format: :map, chunk_size: :all, col_sep: ";",
dimensions: [:dim_2, :dim_1, :dim_3], deep_map: true)
table = reader.read[0]
Warning reading row: [X, B, K, 5] in field 'dim_1'. Dimension 'dim_1' is frozen when adding label 'B'.
Warning reading row: [X, C, K, 9] in field 'dim_1'. Dimension 'dim_1' is frozen when adding label 'C'.
Note now that the warning happens earlier in the file. Again, as we read dim_2 we get X, X, Y, Y. When we cycle back to X, dim_2 is frozen, freezing dim_1 and dim_3 in the sequence. Now when the first B is read, dim_1 is already frozen and a warning is issued.
Even though a warning is issued, reading continues normally and the table can be printed:
> pp table
{"X"=>
{"A"=>{"K"=>{:data=>"1"}, "J"=>{:data=>"2"}},
"B"=>{"K"=>{:data=>"5"}, "J"=>{:data=>"6"}},
"C"=>{"K"=>{:data=>"9"}, "J"=>{:data=>"10"}}},
"Y"=>
{"A"=>{"H"=>{:data=>"3"}, "G"=>{:data=>"4"}},
"B"=>{"H"=>{:data=>"7"}, "G"=>{:data=>"8"}},
"C"=>{"H"=>{:data=>"11"}, "G"=>{:data=>"12"}}}}
If reading continues normally, why is a warning issued? For large datasets, when data is organized with the slowest changing dimension first, it becomes easier to identify missing or duplicated data. It is also a necessary condition for reading data into a vector, as we will show in the the next section ("Reading into a Vector").
We now show a data file in which there is some missing data. Note that we removed the fourth line from the file, and note also that this is not easily identified. In a larger dataset, seeing this would be very hard.
# File missing_data.csv
Dim_1 Dim_2 Dim_3 Data
A X K 1
A X J 2
A Y H 3
B X K 5
B X J 6
B Y H 7
B Y G 8
C X K 9
C X J 10
C Y H 11
C Y G 12
reader = Jcsv.reader("../data/missing_data.csv", format: :map, chunk_size: :all,
col_sep: ";", dimensions: [:dim_1, :dim_2, :dim_3],
deep_map: true)
table = reader.read[0]
Warning reading row: [B, Y, G, 8] in field 'dim_3'. Dimension 'dim_3' is frozen when adding label 'G'.
Again, we get a warning with 'dimension frozen'. This happens when reading the 8th row. Dimension 3 was frozen after reading element K, since this dimension cycled from H back to K. When reaching the 8th row a new element G is seen and indicates that something is wrong in the file.
Let's again take a look at what was read:
> pp table
{"A"=>{"X"=>{"K"=>{:data=>"1"}, "J"=>{:data=>"2"}}, "Y"=>{"H"=>{:data=>"3"}}},
"B"=>
{"X"=>{"K"=>{:data=>"5"}, "J"=>{:data=>"6"}},
"Y"=>{"H"=>{:data=>"7"}, "G"=>{:data=>"8"}}},
"C"=>
{"X"=>{"K"=>{:data=>"9"}, "J"=>{:data=>"10"}},
"Y"=>{"H"=>{:data=>"11"}, "G"=>{:data=>"12"}}}}
Bellow we show another CSV file. Note that there is a missing row in this data. Can you quickly see it? Which row is it?
# File missing_data2.csv
Dim_1 Dim_2 Dim_3 Data
A X K 1
A X J 2
A Y H 3
A Y G 4
A Z F 5
A Z D 6
B X K 7
B X J 8
B Y H 9
B Z F 10
B Z D 11
reader = Jcsv.reader("../data/missing_data2.csv", format: :map, chunk_size: :all,
col_sep: ";", dimensions: [:dim_1, :dim_2, :dim_3],
deep_map: true)
table = reader.read[0]
Warning reading row: [B, Z, F, 10] in field 'dim_3'. Missing data: next expected label was 'G' but read 'F'.
This last example shows how dimensions can help identify duplicate data. A set of dimensions should be unique, as a key in a database. If the key is duplicate, an error is issued. The same missing_data2.csv file is read, but passing only two dimensions, dim_1 and dim2:
reader = Jcsv.reader("../data/missing_data2.csv", format: :map,
chunk_size: :all, col_sep: ";",
dimensions: [:dim_1, :dim_2], deep_map: true)
table = reader.read[0]
RuntimeError: Key not unique for this dataset. {:dim_3=>"J", :data=>"2"}
Since warnings are shown but data is still read, if the user knows she doesn't want to be notified of warnings, she could add the suppress_warnings directive:
reader = Jcsv.reader("../data/missing_data2.csv", format: :map, chunk_size: :all,
col_sep: ";", dimensions: [:dim_1, :dim_2, :dim_3],
deep_map: true, suppress_warnings: true)
table = reader.read[0]
As can be seen, the code above does not generate any warnings any more.
Reading data with dimensions to lists is also possible, and will generate arrays of arrays:
reader = Jcsv.reader("../data/GoodOrder.csv", chunk_size: :all, col_sep: ";",
dimensions: [:dim_1, :dim_2, :dim_3])
table = reader.read
> pp table
[[[["A", "X", "K"], "1"],
[["A", "X", "J"], "2"],
[["A", "Y", "H"], "3"],
[["A", "Y", "G"], "4"],
[["B", "X", "K"], "5"],
[["B", "X", "J"], "6"],
[["B", "Y", "H"], "7"],
[["B", "Y", "G"], "8"],
[["C", "X", "K"], "9"],
[["C", "X", "J"], "10"],
[["C", "Y", "H"], "11"],
[["C", "Y", "G"], "12"]]]
A crit bit tree, also known as a Binary Patricia Trie (https://en.wikipedia.org/wiki/Trie), sometimes called digital tree, radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.
Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.
The above description might seem a bit complex (and it is), suffice to say that JRuby has a Critbit implementation that follows the same API as a hash. So, in simple words, a Critbit is a data structure that is similar to a hash, but that "automagically" sorts all the keys.
The Critbit CSV reader will thous create a map similar to the maps we've seen in the examples above, however, data will be sorted by key instead of organized by reading order. Let's see some examples by reading again our customer.csv file and using :last_name and :first_name as dimensions
reader = Jcsv.reader("../data/customer.csv", format: :critbit,
dimensions: [:last_name, :first_name])
reader.filters = {:customer_no => Jcsv.int }
reader.mapping = {:favourite_quote => false,
:mailing_address => false,
:birth_date => false,
:email => false }
customers = reader.read
> pp customers
{"Down.Bob"=>
{:customer_no=>2,
:married=>"Y",
:number_of_kids=>"0",
:loyalty_points=>"123456"},
"Dunbar.John"=>
{:customer_no=>1, :married=>nil, :number_of_kids=>nil, :loyalty_points=>"0"},
"Jobs.Bill"=>
{:customer_no=>4,
:married=>"Y",
:number_of_kids=>"3",
:loyalty_points=>"36"},
"Wunderland.Alice"=>
{:customer_no=>3,
:married=>"Y",
:number_of_kids=>"0",
:loyalty_points=>"2255887799"}}
Note now that records were sorted by :last_name, :first_name. In this case, the first record is "Down.Bob", then "Dunbar.John", etc. Let's now change the order of the dimensions in the directive and see the results:
reader = Jcsv.reader("../data/customer.csv", format: :critbit,
dimensions: [:first_name, :last_name])
reader.filters = {:customer_no => Jcsv.int }
reader.mapping = {:favourite_quote => false,
:mailing_address => false,
:birth_date => false,
:email => false }
customers = reader.read
> pp customers
{"Alice.Wunderland"=>
{:customer_no=>3,
:married=>"Y",
:number_of_kids=>"0",
:loyalty_points=>"2255887799"},
"Bill.Jobs"=>
{:customer_no=>4,
:married=>"Y",
:number_of_kids=>"3",
:loyalty_points=>"36"},
"Bob.Down"=>
{:customer_no=>2,
:married=>"Y",
:number_of_kids=>"0",
:loyalty_points=>"123456"},
"John.Dunbar"=>
{:customer_no=>1, :married=>nil, :number_of_kids=>nil, :loyalty_points=>"0"}}
Now, as can be seen, "Alice.Wunderland" is the first record, then comes "Bill.Jobs", etc.
Another advantage of the critbit reader is that it is possible to retrieve records according to their prefix. For example, let's retrieve all customers whose names start with "B". Two records will be retrieved "Bill Jobs" and "Bob Down". Let's do it now:
> customers.each_pair("B") do |key, value|
+ puts "#{key} => #{value}"
+ end
Bill.Jobs => {:customer_no=>4, :married=>"Y", :number_of_kids=>"3", :loyalty_points=>"36"}
Bob.Down => {:customer_no=>2, :married=>"Y", :number_of_kids=>"0", :loyalty_points=>"123456"}
It is also possible to use deep_map with the critbit reader. We will not show an example here.
CSV files are often used to convey scientific data. Scientific data is numerical
data that is the result of some experiment or data collection and is collected for
further analysis. In general, scientific data is stored in an array or dataframe and
analyzed statistically or mathematically. The MDArray reader will read data and
store it directly in an MDArray. MDArray is a JRuby library for multidimensional arrays
similar to NumPy that integrates with Parallel Colt (through MDMatrix) for strong
statistical analysis and integrates also with SciCom (an R interpreter for the
JVM).
To show how to use the MDArray reader we will come back to our balanced and unbalanced panel data from the "Dimension" section. Let's start by reading the balanced data that we show again here:
person,year,income,age,sex
1,2001,1300,27,1
1,2002,1600,28,1
1,2003,2000,29,1
2,2001,2000,38,2
2,2002,2300,39,2
2,2003,2400,40,2
In order to read an MDArray we need to pass :mdarray to the format: parameter and also give it the MDArray data type 'dtype'. MDArray supports the following data types: :byte, :char, :short, :int, :long, :float, :double. However, in this version of jCSV we do not support filters for all those types, so the user needs to be careful to implement her own filters when necessary.
In this example, we have two dimensions: person and year.
reader = Jcsv.reader("../data/balanced_panel.csv", format: :mdarray, dtype: :double,
dimensions: [:person, :year])
balanced_panel = reader.read
First let's check again our dimensions' labels
> pp reader[:person]
+ pp reader[:year]
+ pp reader[:_data_]
["1", "2"]
["2001", "2002", "2003"]
[:income, :age, :sex]
And let's now see our MDArray with the balanced data:
> balanced_panel.print
[[[1300.0 27.0 1.0]
[1600.0 28.0 1.0]
[2000.0 29.0 1.0]]
[[2000.0 38.0 2.0]
[2300.0 39.0 2.0]
[2400.0 40.0 2.0]]]
Note that this is a multidimensional array with rank 3. The array rank is the number of "dimensions" in the array. As we have seen above, for this example there are three dimensions: 'person', 'year' and 'data'. The 'data' dimension is a special dimension that has the actual data columns. The 'shape' of the array is the number of elements per dimensions.
> p balanced_panel.shape
[2, 3, 3]
MDArray has many methods to slice and dice the array. Let's use method slice, that gets a slice of the array without copying, to get only the data from person 1, which is in index 0 of the MDArray:
> balanced_panel.slice(0, 0).print
[[1300.0 27.0 1.0]
[1600.0 28.0 1.0]
[2000.0 29.0 1.0]]
Another such method is method 'section'. This method takes two parameters: the first is the beginning index for the given dimension, and the second parameter is the number of elements to be taken in that dimension. In the example bellow, we will again get all elements from person 1. The first parameter is [0, 0, 0] since we want all elements for person 1 and all indices start with 0. For the second parameter we use [1, reader[:year].size, reader[:data].size]. The first element is '1' indicating that we only want 1 element for the persons' dimension. Then we got the size of the year dimension and the size of the data dimension:
> balanced_panel.section([0, 0, 0],
+ [1, reader[:year].size, reader[:_data_].size]).print
[[[1300.0 27.0 1.0]
[1600.0 28.0 1.0]
[2000.0 29.0 1.0]]]
Let's now get a section for only the year 2002 for this same person. Also, let's only get columns 'age' and 'sex'. Since the year 2002 is the second row in our data, then we want to start in index 1 and get 1 element. For the data part, we want to start on index 1 and get 2 elements:
> balanced_panel.section([0, 1, 1], [1, 1, 2]).print
[[[28.0 1.0]]]
Note that the result above is still a rank 3 array. If we pass as a third parameter of 'true' to the section method, empty ranks are removed from the array, giving us:
> balanced_panel.section([0, 1, 1], [1, 1, 2], true).print
[28.0 1.0]
We started this section saying that scientific data is collected for further analysis. We will now read data from the 'sleep.csv' file from R. This dataset shows the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients:
reader = Jcsv.reader("../data/sleep.csv", format: :mdarray, col_sep: ";",
comment_starts: "#", dtype: :double,
dimensions: [:group, :id])
reader.mapping = {:row => false}
ssleep = reader.read
ssleep.print
group1 = ssleep.slice(0, 0)
group2 = ssleep.slice(0, 1)
[[[0.7]
[-1.6]
[-0.2]
[-1.2]
[-0.1]
[3.4]
[3.7]
[0.8]
[0.0]
[2.0]]
[[1.9]
[0.8]
[1.1]
[0.1]
[-0.1]
[4.4]
[5.5]
[1.6]
[4.6]
[3.4]]]
And now, let's get many interesting statistics on this data. We will focus on 'group1'. Need to call reset_statistics on group1 to prepare it for calculations and clear all the caches. Calculations are cached, for example, when the 'mean' is calculated it will be cached. When the standard deviation is calculated, there is no need to calculate the mean again, since it has already been calculated. There problem with this approach that if the array is changed in any way, then reset_statistics needs to be called again or all cached values will be returned and wrong results will be obtained.
> group1.reset_statistics
+
+ puts "correlation group1 vs group2: " + group1.correlation(group2).to_s
+ puts "auto correlation: " + group1.auto_correlation(1).to_s
+ puts "durbin watson: " + group1.durbin_watson.to_s
+ puts "geometric mean: " + group1.geometric_mean.to_s
+ puts "harmonic mean: " + group1.harmonic_mean.to_s
+ puts "kurtosis: " + group1.kurtosis.to_s
+ puts "lag1: " + group1.lag1.to_s
+ puts "max: " + group1.max.to_s
+ puts "mean: " + group1.mean.to_s
+ puts "mean deviation: " + group1.mean_deviation.to_s
+ puts "median: " + group1.median.to_s
+ puts "min: " + group1.min.to_s
+ puts "moment3: " + group1.moment3.to_s
+ puts "moment4: " + group1.moment4.to_s
+ puts "product: " + group1.product.to_s
+ puts "quantile(0.2): " + group1.quantile(0.2).to_s
+ puts "quantile inverse(35.0): " + group1.quantile_inverse(35.0).to_s
+ puts "rank interpolated(33.0): " + group1.rank_interpolated(33.0).to_s
+ puts "rms: " + group1.rms.to_s
+ puts "sample kurtosis: " + group1.sample_kurtosis.to_s
+ puts "sample kurtosis standard error: " + group1.sample_kurtosis_standard_error.to_s
+ puts "sample skew: " + group1.sample_skew.to_s
+ puts "sample skew standard error: " + group1.sample_skew_standard_error.to_s
+ puts "sample standard deviation: " + group1.sample_standard_deviation.to_s
+ puts "sample variance: " + group1.sample_variance.to_s
+ puts "skew: " + group1.skew.to_s
+ puts "standard deviation: " + group1.standard_deviation.to_s
+ puts "standard error: " + group1.standard_error.to_s
+ puts "sum: " + group1.sum.to_s
+ puts "sum of inversions: " + group1.sum_of_inversions.to_s
+ puts "sum of logarithms: " + group1.sum_of_logarithms.to_s
+ puts "sum of power deviations: " + group1.sum_of_power_deviations(2, group1.mean).to_s
+ puts "sum of powers(3): " + group1.sum_of_powers(3).to_s
+ puts "sum of squares: " + group1.sum_of_squares.to_s
+ puts "sum of squared deviations: " + group1.sum_of_squared_deviations.to_s
+ puts "trimmed_mean(2, 2): " + group1.trimmed_mean(2, 2).to_s
+ puts "variance: " + group1.variance.to_s
+ puts "winsorized mean: " + group1.winsorized_mean(1, 1).to_s
correlation group1 vs group2: 0.7951702058335778
auto correlation: 0.40878319736156915
durbin watson: 1.0121986639558525
geometric mean: NaN
harmonic mean: 0.0
kurtosis: -0.9017177044248927
lag1: 0.36790487762541224
max: 3.7
mean: 0.75
mean deviation: 1.38
median: 0.35
min: -1.6
moment3: 2.3949
moment4: 17.41003625
product: 0.0
quantile(0.2): -0.39999999999999997
quantile inverse(35.0): 1.0
rank interpolated(33.0): 10.0
rms: 1.8555322686496185
sample kurtosis: -0.6298223703225774
sample kurtosis standard error: 1.334248769989982
sample skew: 0.5809206737398493
sample skew standard error: 0.6870429186215167
sample standard deviation: 1.8392973833255464
sample variance: 3.2005555555555554
skew: 0.48987532503933046
standard deviation: 1.6972035823671832
standard error: 0.5367028973277488
sum: 7.5
sum of inversions: Infinity
sum of logarithms: NaN
sum of power deviations: 28.805000000000003
sum of powers(3): 92.979
sum of squares: 34.43
sum of squared deviations: 25.924500000000002
trimmed_mean(2, 2): 0.5333333333333334
variance: 2.8805
winsorized mean: 0.76
We have already seen many filters in action during this whole document. We now present all filters available.
- bool
- convert_nil_to
- optional
- char
- collector
- ipaddr
- dynamic
- gsub
- str
- httpdate
- iso8601
- jd
- jisx0301
- date
- rfc2822
- rfc3339
- rfc822
- strptime
- xmlschema
- int
- long
- double
- fixnum
- float
- complex
- rational
- bignun
- bigdecimal: Convert a String to a BigDecimal. It uses the String constructor of BigDecimal (new BigDecimal("0.1")) as it yields predictable results (see BigDecimal). If the data uses a character other than "." as a decimal separator (Germany uses "," for example), then pass to it a Locale.
- in_range
- equals
- ascii_only?
- not_ascii?
- empty
- end_with?
- include
- start_with
- not_nil
- forbid_substrings
- is_element_of
This text was Markdown formated with CodeWriter:
- gem install CodeWriter
- jruby -S gem install CodeWriter
- https://github.com/rbotafogo/CodeWriter