Skip to content

Sparse data processing toolbox. It builds on top of pandas and scipy to provide DataFrame like API to work with sparse categorical data.

Notifications You must be signed in to change notification settings

kayibal/sparsity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparsity

CircleCI Codecov

Sparse data processing toolbox. It builds on top of pandas and scipy to provide DataFrame like API to work with sparse categorical data.

It also provides a extremly fast C level interface to read from traildb databases. This make it a highly performant package to use for dataprocessing jobs especially such as log processing and/or clickstream ot click through data.

In combination with dask it provides support to execute complex operations on a concurrent/distributed level.

Attention

Not ready for production

Motivation

Many tasks especially in data analytics and machine learning domain make use of sparse data structures to support the input of high dimensional data.

This project was started to build an efficient homogen sparse data processing pipeline. As of today dask has no support for something as an sparse dataframe. We process big amounts of highdimensional data on a daily basis at datarevenue and our favourite language and ETL framework are python and dask. After chaining many function calls on scipy.sparse csr matrices that involved handling of indices and column names to produce a sparse data pipeline I decided to start this project.

This package might be especially usefull to you if you have very big amounts of sparse data such as clickstream data, categorical timeseries, log data or similarly sparse data.

Traildb access?

Traildb is an amazing log style database. It was released recently by AdRoll. It compresses event like data extremly efficient. Furthermore it provides a fast C-level api to query it.

Traildb has also python bindings but you still might need to iterate over many million of users/trail or even both which has quite some overhead in python. Therefore sparsity provides high speed access to the database in form of SparseFrame objects. These are fast, efficient and intuitive enough to do further processing on.

ATM uuid and timestamp informations are lost but they will be provided as a pandas.MultiIndex handled by the SparseFrame in a (very soon) future release.

In [1]: from sparsity import SparseFrame

In [2]: sdf = SparseFrame.read_traildb('pydata.tdb', field="title")

In [3]: sdf.head()
Out[3]: 
   0      1      2      3      4      ...    37388  37389  37390  37391  37392
0    1.0    0.0    0.0    0.0    0.0  ...      0.0    0.0    0.0    0.0    0.0
1    1.0    0.0    0.0    0.0    0.0  ...      0.0    0.0    0.0    0.0    0.0
2    1.0    0.0    0.0    0.0    0.0  ...      0.0    0.0    0.0    0.0    0.0
3    1.0    0.0    0.0    0.0    0.0  ...      0.0    0.0    0.0    0.0    0.0
4    1.0    0.0    0.0    0.0    0.0  ...      0.0    0.0    0.0    0.0    0.0

[5 rows x 37393 columns]

In [6]: %%timeit
   ...: sdf = SparseFrame.read_traildb("/Users/kayibal/Code/traildb_to_sparse/traildb_to_sparse/traildb_to_sparse/sparsity/test/pydata.tdb", field="title")
   ...: 
10 loops, best of 3: 73.8 ms per loop

In [4]: sdf.shape
Out[4]: (109626, 37393)

But wait pandas has SparseDataFrames and SparseSeries

Pandas has it's own implementation of sparse datastructures. Unfortuantely this structures performs quite badly with a groupby sum aggregation which we also often use. Furthermore doing a groupby on a pandasSparseDataFrame returns a dense DataFrame. This makes chaining many groupby operations over multiple files cumbersome and less efficient. Consider following example:

In [1]: import sparsity
   ...: import pandas as pd
   ...: import numpy as np
   ...: 

In [2]: data = np.random.random(size=(1000,10))
   ...: data[data < 0.95] = 0
   ...: uids = np.random.randint(0,100,1000)
   ...: combined_data = np.hstack([uids.reshape(-1,1),data])
   ...: columns = ['id'] + list(map(str, range(10)))
   ...: 
   ...: sdf = pd.SparseDataFrame(combined_data, columns = columns, default_fill_value=0)
   ...: 

In [3]: %%timeit
   ...: sdf.groupby('id').sum()
   ...: 
1 loop, best of 3: 462 ms per loop

In [4]: res = sdf.groupby('id').sum()
   ...: res.values.nbytes
   ...: 
Out[4]: 7920

In [5]: data = np.random.random(size=(1000,10))
   ...: data[data < 0.95] = 0
   ...: uids = np.random.randint(0,100,1000)
   ...: sdf = sparsity.SparseFrame(data, columns=np.asarray(list(map(str, range(10)))), index=uids)
   ...: 

In [6]: %%timeit
   ...: sdf.groupby_sum()
   ...: 
The slowest run took 4.20 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.25 ms per loop

In [7]: res = sdf.groupby_sum()
   ...: res.__sizeof__()
   ...: 
Out[7]: 6128

I'm not quite sure if there is some cached result but I don't think so. This only uses a smart csr matrix multiplication to do the operation.

About

Sparse data processing toolbox. It builds on top of pandas and scipy to provide DataFrame like API to work with sparse categorical data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages