Skip to content

A tool for iteratively building a COO matrix from large datasets

Notifications You must be signed in to change notification settings

mangangreg/coo-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

coo-builder

A tool for iteratively building a COO matrix from a large corpus of text data

What is a COO matrix?

A COO matrix (aka 'ijv' or triplet format) is an abstract way of storing a matrix as three arrays, that uses less memory if the matrix is sparse will be sparse.

For example, if you are tracking word occurrences in 100,000 (M) documents and your set of words of interest is 10,000 (N) that is 1,000,000,000 cells of information (or integers) if stored as an MxN array.

However if most of these documents only contain some subset with size of about 200 (S) of the words of interest, you can exploit the fact that this matrix will be very sparse (most values equal to zero). Since only 20,000,000 (SxM) will be non-zero, you can store the ijv-values (the row index, the column index and the cell value) separately in three arrays of size 20,000,000, meaning you only need to store 60,000,000 integers. This is a 94% reduction in memory usage (3 x (100 x [N-S]/N)%)

Instead of storing this matrix directly. You can store

The scenario

You're doing a bag-of-words analysis on some large N number of documents.

You want to build a matrix where:

  • each row corresponds to a document
  • each column corresponds to a word
  • the value in the cell is the # of times that word occured in that document

You want to build this matrix iteratively

The problem

  • How many columns do you need?
  • How do you efficiently build this object iteratively

For small datasets this is fine. You can use a list of lists, if you find a new word in a document and need to create a "new column" you just create a new list. For large datasets that becomes a huge memory issue.

The solution

  • Use a COO matrix.

The coo-builder module is based on this guide to building a COO but it solves the additional problem of how to build a COO matrix iteratively i.e. how do you build it when you don't know ex-ante what shape it will be (which will often be the case if your list of "words of interests" grows as you iteratively read more documents).

Usage

Create a builder object

from coo_builder import COOBuilder

builder = COOBuilder()

Iterate over documents

Iterate over each file/document in your dataset and for each one return a Counter object that maps words to counts e.g. {"hello":3, "world":5}. Then add this counter to the COO by using the add_doc_counter method

for file_ind, file in enumerate(some_iterable_of_files):
	
	### Read/process the file
	
	### Do any sort of formatting, filtering or removal of stop words

	### End up with a mapping from words (str) to counts (int)
	term_counter = # a Counter object (or dictionary)

	### Add to the builder
	builder.add_doc_counter(file_ind,term_counter)

The file_ind is assumed to be an incremental integer from 0 to M, so you can use enumerate over an iterable to generate this (but you should keep track of the file_ind yourself if you need to later look up specific files/words). The file_ind is used to keep track of the 'i' part of the ijv. The words are added to a dictionary of terms in the COOBuilder (mapping each unique word to a unique integer).

Exporting a COO matrix

When the build process is finished, the builder can generate a COO matrix, from the sparse matrix family in scikit by using the to_coo method:

coo_matrix = builder.to_coo()

To see the mapping from word index (column) to actual words, you can inspect the terms dictionary:

>>> builder.terms
>>> {'apple': 0, 'banana': 1, 'pear':2, ...}

About

A tool for iteratively building a COO matrix from large datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages