Dataframe is a Torch7 class to load and manipulate tabular data (e.g. Kaggle-style CSVs) inspired from R's and pandas' data frames.
As of release 1.5 it fully supports the torchnet data structure. It also has custom iterators to convenient integration with torchnet's engines, see the mnist example. As of release 1.6 it has changed the internal storage to tensor
For a more detailed look at the changes between the versions have a look at the NEWS file.
You can clone this repository or directly install it through luarocks
:
git clone https://github.com/AlexMili/torch-dataframe
cd torch-dataframe
luarocks make rocks/torch-dataframe-scm-1.rockspec
the same in one line :
luarocks install torch-dataframe scm-1
or
luarocks install torch-dataframe
- Added faster torch.Tensor functions to fill/stat functions for speed
- Added mutate function to Dataseries
__index__
access for Df_Array- More complete documentation for Df_Array and specs
- Df_Dict elements can be accessed using
myDict[index]
ormyDict["$colname"]
- Df_Dict
key
property available. It list the Df_Dict's keys - Df_Dict
length
property available. It list by key, the length of its content - Df_Dict
check_length()
checks if all elements have the same length - Df_Dict
set_keys(table)
replaces every keys by the given table (must be the same size) - More complete documentation for Df_Dict and specs
- More complete documentation for Df_Tbl and specs
- Internal methods
_infer_csvigo_schema()
and_infer_data_schema()
renamed to_infer_schema()
- Type inference is now based on type frequences but if it encounter a single double/float in a integer column it will consider the column as double/float
- it is now possible to directly set a schema for a Dataframe without any checks with
set_schema()
. Use it wisely - Possibility to init a Dataframe with a schema, a column order and a number of rows with internal method
_init_with_schema()
- Added
bulk_load_csv()
method wich loads large CSVs files using threads but without checking missing values or data integrity. To use with caution. See #28 - Added
load_threadcsv()
- Added the possiblity to create empty Dataseries
- Added Dataseries
load()
method to directly load a tensor or tds.Vec in memory without any check - Added iris dataset in
/specs/data
- New specs structure
- Fixed csv loading when no header and test case according to it
- Changed
assert_is_index
return value totrue
on success instead ofself
See NEWS.md
file for previous changes.
The Dataframe relies on argcheck for parsing arguments. This means that you can used named parameters using the function{arg_name=value}
syntax. Named arguments are supported by all functions except the constructor and is in certain functions mandatory in order to avoid ambiguity.
The argcheck package also works as the API documentation. It checks arguments and if you happen to provide the function with invalid arguments it will automatically output the function documentation.
Important: Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:
- Df_Array - contains only values and no keys
- Df_Dict - a dictionary table that has named keys that map to all values
- Df_Tbl - a raw table wrapper that does a shallow argument copy
Initiate the object:
require 'Dataframe'
df = Dataframe()
Load CSV file:
df:load_csv{path='./data/training.csv', header=true}
Load from table:
df:load_table{data=Df_Dict{firstColumn={1,2,3},
secondColumn={4,5,6}}}
You can also instantiate the object with a csv-filename or a table by passing the table or filename as an argument:
require 'Dataframe'
df = Dataframe('./data/training.csv')
You can discover your dataset with the following functions:
-- you can either view the data as a plain text output or itorch html table
df:output() -- prints html if in itorch otherwise prints plain table
df:output{html=true} -- forces html output
df:show() -- prints the head + tail of the table
-- You can also directly call print() on the object
-- and it will print the ascii-table
print(df)
General dataset information can be found using:
df:shape() -- print {rows=3, cols=3}
#df -- gets the number of rows
df:size() -- returns a tensor with the size rows, columns
df.column_order -- table of columns names
df:count_na() -- print all the missing values by column name
If you want to inspect random elements you can use the get_random()
:
df:get_random(10):output()
You can manipulate it:
df:insert(Df_Dict({['first_column']={7,8,9},['second_column']={10,11,12}}))
df:remove_index(3) -- remove line 3 of the entire dataset
df:has_column('x') -- return true if the column exist
df:get_column('y') -- return column x as table
df["$y"] -- alias for get_column
df:add_column('z', 0) -- Add column with default value 0 at the end (right side of the table)
df:add_column('first_column', 1, 2) -- Add column with default value 2 at the beginning (left side of the table)
df:drop('x') -- delete column
df:rename_column('x', 'y') -- rename column 'x' in 'y'
df:reset_column('my_col', 0) -- reset the given column with 0
df:fill_na('x', 0) -- replace missing values in 'x' column with 0
df:fill_all_na(0) -- replace all missing values with the value 0
df:unique('col_name') -- return table with unique values of the given column
df:unique('col_name', true) -- return table with unique values of the given column as keys
df:where('column_name','my_value') -- find the first row where the column has the given value
-- Customly update all rows filling the condition defined in first lambda
df:update(function(row) row['column'] == 'test' end,
function(row) row['other_column'] = 'new_value' return row end)
You can define categorical variables that will be treated internally as numbers ranging from 1 to n levels while displayed as strings. The numeric representation is retained when exporting to_tensor
allowing a simpler understanding of a classifier's output:
df:as_categorical('my string column') -- converts a column to categorical
df:get_cat_keys('my string column') -- retreives the keys used to converts
df:to_categorical(Df_Array({1,2,1}), 'my string column') -- converts numbers to the categories
You can subset your data using:
df:head(20) -- print 20 first elements (10 by default)
df:tail(5) -- print 5 last elements (10 by default)
df:show() -- print 10 first and 10 last elements
df[13] -- returns a table with the row values
df["13:17"] -- returns a Dataframe with values in that span
df["13:"] -- returns a Dataframe with values starting from index 13
df[Df_Array(1,3,4)] -- returns a Dataframe with values index 1,3 and 4
Finally, you can save your dataset to tensor (only numerical/categorical columns will be taken):
df:to_tensor{filename = './data/train.th7'} -- saves data
data = df:to_tensor{columns = Df_Array('first_column', 'my string column')} -- Converts the two columns into tensor
or to CSV:
df:to_csv('data.csv')
The Dataframe provides a built-in system for handling batch loading. It also has an extensive set of samplers that you can use. See API docs for more on which that are available.
The gist of it is:
- The main Dataframe is initialized for batch loading via calling the
create_subsets
. This creates random subsets that have their own samplers. The default is a train 70%, validate 20%, and a test 10% split in the data but you can choose any split and any names. - Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require.
- When you want to retrieve a batch from a subset you call the subset using
my_dataframe:get_subset('train'):get_batch(30)
ormy_dataframe['/train']:get_batch(30)
. - The batch returned is also a subclass that has a custom
to_tensor
function that returns the data and corresponding label tensors. You can provide custom functions that will get the full row as an argument allowing you to use e.g. a filename that permits load an external resource.
A simple example:
local df = Dataframe('my_csv'):
create_subsets()
local batch = df["/train"]:get_batch(10)
local data, label = batch:to_tensor{
load_data_fn = my_image_loader
}
As of version 1.5 you may also want to consider using th iterators that integrate with the torchnet infrastructure. Take a look at the iterator API and the mnist example for how an implementation may look.
The package contains an extensive test suite and tries to apply a behavior driven development approach. All features should be accompanied by a test-case.
To launch the tests you need to install busted
(See:
Olivine-Labs/busted) via luarocks
:
luarocks install busted
then you can run all tests via command line:
cd specs/
./run_all.sh
The package relies on self-documenting functions via the argcheck package that reside in the doc folder. The GitHub Wiki is intended for more extensive in detail documentation.
To generate the documentation please run:
th doc.lua > /dev/null
See CONTRIBUTING.md for further details.