Skip to content

Latest commit

 

History

History
52 lines (36 loc) · 4.98 KB

HOWTO.md

File metadata and controls

52 lines (36 loc) · 4.98 KB

Welcome to the Python-based topic modeling pipeline

This set of Python scripts has been developed for teaching purposes. Its fundamental aim is to provide a minimal, working implementation of a Python-based processing pipeline for Topic Modeling.

These scripts are meant as complementary to a slide-deck that explains Topic Modeling to an audience of scholars from the (Digital) Humanities. See here: https://christofs.github.io/riga/#/.

Requirements

You will need a computer on which you can install and run Python. Most modern laptops running a reasonably recent version of Windows, Mac OS or Linux (e.g. Ubuntu 18.04+) should be just fine, but smaller devices like tablets running Android or iOS won't be enough.

Please install the following:

Installation

Once you have installed the above-mentioned software and Python libaries, it is sufficient to download or clone this Github repository. You can then run the scripts and access the sample dataset.

Testing your installation

Before trying to use the scripts, you should test your installation. For this test, please follow the instructions provided here in the folder called "test".

Usage notes

  • For simplicity's sake, this pipeline relies on TextBlob for linguistic annotations. This is an interface to NLTK and provides annotation resources for English, French and German. If you want to work with a different language, you need to supply the annotation step yourself, basically plugging in an alternative to the module called preprocessing.py.
  • The pipeline assumes your corpus is available in the shape of plain text files encoded in UTF-8 with each document being in its own file. The filename is the text's identifier.
  • The make_heatmap.py visualization requires a metadata file called metadata.csv to be present, in the form of a CSV file. You need to adjust the variable cats in run_pipeline.py and possibly make minor adjustments in the make_heatmap.py script to account for the metadata categories that are present in your metadata file. If you don't need the heatmap or don't have useful metadata, just comment out the corresponding line like so: #make_heatmap.main(workdir, identifier, cats).
  • If you want to use your own datasets, simply add a folder in the datasets directory and replicate the folder structure of the example datasets.
  • The script is not optimized for speed and will run for a long time on larger datasets. It mostly needs a fast CPU. To help you monitor the progress of the preprocessing and modeling steps, which can take considerable time, depending on the size of your dataset, there are two logging mechanisms: In the preprocessing step, the output shows the number of textfiles, among all textfiles, that have already been preprocessed. In the modeling step, a file called gensim.log is updated continuously by gensim. You can find it on the top level of the repository, open it in a text editor and scroll to the bottom from time to time to monitor progress. The number of passes already completed is shown there, among other things.
  • It is a good idea to start with a small dataset (one of the test datasets) and a small number of passes to test that everything works as expected before using the package to model a larger, more realistic dataset.
  • Best results will be obtained with a large number (over 5000 is good!) of short texts (1000-5000 words each is great). The larger the dataset is, the higher the number of topics you can usefully obtain, generally speaking. If you want to work with longer texts, such as novels, it makes sense to split them into smaller chunks first, using the separate split_text.py script.

Licence

This software is distributed under a so-called "unlicence", that is, with no restrictions or conditions on re-use, re-distribution, modification. However, it also comes with no warranty whatsoever. See: https://choosealicense.com/licenses/unlicense/

Maintainer and contact

This code has been put together by Christof Schöch, University of Trier, Germany, in July 2019. In case you have suggestions for improvement or run into problems, please use the Github issue tracker. Last (minor) update December 14, 2022.