A text processing pipeline for turning unstructured text data into hierarchical datasets.
The Data Science Campus has been exploring how to process unlabelled list data that is collected manually in an uncontrolled fashion with no supplementary information to allow aggregation of data. Please note that this project is intended to work on short descriptions, of no more than around 10 words. For longer text descriptions you may need to fork the repository and optimise some of the metrics.
For further information on the methodology please read our blog.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Documentation on the methods utilised and how Optimus functions is pending. This README will be updated to include links to this material once it is made available.
You will need the following tools in order to be able to set up and use optimus:
- A modern MacOS or linux installation, Windows is not supported and you are on your own trying it there
- curl
- zsh
- python 3.6 or later
- git
Firstly the user should clone this git repository
git clone https://github.com/datasciencecampus/optimus.git
Within the repo is a file named setup.zsh
. This is a command line tool to
install all of the other things you need. For help using this, invoke the script
as
. setup.zsh -h
This script allows you to download the FastText wikipedia word embeddings model and places it in the optimus directory. If your project is elsewhere and you are not working in optimus directly then it is recommended to use this script to download the model and then you can move it to be local to your working directory.
There is a quick start example script that demonstrates how to use the pipeline called example.py
in the root directory. The final dataset is written to optimus_results.csv
also in the root directory.
In order to make the tool more accessible a web app based UI was developed. This user interface will help process data without the need of any python coding.
If this is something that interests you please read this README.md file for more info.
Import Optimus into python either through the whole module
import optimus
or by importing the Optimus classes
from optimus import Optimus
Configuration of the pipeline is controlled with a configuration file
config.json
file in the following format:
{
"data":"location/to/data.csv",
"model":"location/to/wiki.en.bin"
...
}
After creating a config.json
file, the location can be passed when creating an
instance of Optimus:
o = Optimus(config_path='path/to/config.json', ...)
Further settings can be added on an ad hoc basis and will overwrite any previous settings. To do so, pass in valid arguments into the Optimus class upon construction like so:
o = Optimus(
config_path='path/to/config.json',
data="path/to/new_data.csv",
cutoff=6,
...
)
Optimus has a default settings file to fall back on in case none of this is provided however using just default settings might cause issues. This is mainly due to the path specifications to the data and models in the default settings not being accurate.
The file etc/config.json
stores the default arguments used by Optimus. Please
do not edit this file.
Shortened reference:
obj = Optimus()
-> Uses default settingsobj = Optimus(config_path='path/to/user/config.json')
-> Uses custom config fileobj = Optimus(distance=10, stepsize=2, cutoff=16 ...)
-> replace specific parameter values instead of those defined in the config file.
Optimus takes in pandas.core.series.Series
objects. In order to run a
configured Optimus object on a series, simply call the object and enclose the
desired series in the brackets. For example, for a pandas series called text
:
from optimus import Optimus
O = Optimus()
results = O(text)
NOTE: If no data is passed into the the Optimus object the data defined in the config file will be used.
-
save_csv One can pass
save_csv
as an optional keyword argument. If the value is set tosave_csv=True
this will force Optimus to save the output DataFrame which includes all the labels from each iteration in the working directory as labelled.csv. -
full Similarly if one just needs a dataframe to be returned and not saved, use the full=True setting to receive back the dataframe containing the mapped labels.
-
verbose A boolean value which will dictate how much will be printed to the console as the code runs. Some outputs are still maintained in the console even if
verbose=False
as this allows some idea of progress of the processing.
The fastText model is large and requires a sizeable amount of RAM. Each instance
of optimus will load its own fast text model on the first processing call. It
does this by checking if the model was loaded before and if not will perform a
ft.load_model()
operation. Once its loaded, all subsequent runs (based on the
same instance of Optimus) should not reload a model.
The Optimus object has a replace_model
method. This method aims to provide a
way to control the memory usage of the Optimus object. This method allows a user
to reload and replace a new model or just to remove the loaded model from the
Optimus object.
The method takes a string or a fastText loaded model and assigns it to the Optimus object. If no model parameter is passed, the method will simply delete and garbage collect the existing loaded model.
o = Optimus(args, kwargs)
output = o(some_data)
# Load from a path
o.replace_model('string/path/to/model')
# Provide an already loaded model
o.replace_model(fastText.load_model('string/path/to/model'))
# Delete the existing model in the Optimus object
o.replace_model()
This pipeline comes with a helpful embedding visualiser module. This set of functions will allow users to pass in a pandas series full of text entries and a fastText model and use the model to embed these strings into first a n dimensional space which will then be reduced to 2 dimensional space using t-SNE.
This will then be plotted and exported into a 'embedding_plot.html' which is fully interactive.
import pandas as pd
from lib.emplot import plot
series = pd.Series(['string1', ..., 'string2'])
plot(series=series,
model='path/to/model.bin',
output_path='output_vectors.csv')
Ward linkage is computationally expensive. The process needs to calculate a
pairwise distance matrix for all of the embedded vectors and this is of order
Where data starts to push the boundaries of what is available to the process we currently recommend performing a sampling of your data points, using optimus to categorise the labelled points and then using (for example) a knn to 'smear' the generated labels across the points nearby.
Example code to do this is provided in the sampling/
directory. The program
performs a simple random sample of the content of your list and then embeds
these words before using the approach outlined above to generate labels for the
out of sample words. This approach is naive, but can provide a starting point
for more complex sampling mechanisms such as the use of
apricot.
- Steven Hopkins
- Gareth Clews
- Arturas Eidukas
- Lucy Gwilliam
- Tom Hopkinson
This project is licensed under the MIT License - see the LICENSE.md file for details
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@InProceedings{joulin2017bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
month={April},
year={2017},
publisher={Association for Computational Linguistics},
pages={427--431},
}