Scripts to automatically generate submission bids for PC members.
This version uses a fairly naive matching algorithm based on the
Toronto Paper Matching System.
The prior_bids
branch uses a more sophsticated matching algorithm
based on StarSpace, and it
also has machinery to incorporate prior bidding information.
The scripts are writen for Python 2 and rely on a number of Python packages, as well as a few external tools. Here's a summary of how I installed all of the necessary dependencies on Mac OS X. YMMV on other platforms.
sudo pip install -U subprocess32
sudo pip install -U nltk
sudo pip install -U --ignore-installed gensim
brew install wget
brew cask install pdftotext
I experimented with using the Python library slate
to parse PDFs, but
found it to be annoying slow. The code is still available in analysis.py.
If you want to enable it, you'll need to run:
pip install pdfminer
pip install git+https://github.com/timClicks/slate.git
Each script depends on the shared definitions in common.py. Each one
will describe its command-line options when run with -h
. Note that the
scripts use the Python library pickle
to save information, so if you
alter the classes in common.py, you will be unlikely to be able to load
previously saved data.
The current pipeline consists of the following steps:
-
Fetch publications for each PC member:
./fetch.py --cache pc.dat --csv pc-info.csv
This parses the CSV file
pc-info.csv
to find a name, email, and URL for each PC member (seeparse_csv
in common.py for formatting details; if your CSV has different header names, just change the relevant labels used as indices into therow
variable). The script fetches each URL and parses it for direct links to PDF files. It then retrieves those files (usingwget
) and stores in them in a directory for each PC member. Information about the PC is saved into thepc.dat
file. If you later updatepc-info.csv
with new members, you can run the command above, and it will only do additional work for the new members. If an existing reviewer's email or URL change, then that PC member's status will be reset, and the script will attempt to fetch publications again. -
Analyze the PC publications:
./analysis.py --cache pc.dat
Extracts and normalizes text from the PC's publications. By default, this script will spawn threads equal to the number of your CPUs. Use
-j
to choose another value. -
Analyze the submissions:
./analysis.py --submissions submission_dir
Saves the results in
submission_dir/submissions.dat
-
Build a topic model that maps word occurences to different topics.
./analysis.py --corpus corpus_dir
The
corpus_dir
can be any directory of representative publications. You can use PDFs from a previous year, or pool all of the PC PDFs and/or submissions into the directory. The model is saved incorpus_dir/lda_model.*
. You can change the parameters used to learn the model (e.g., how many topics to extract, how many passes to make, etc.) by editing the call togensim.models.ldamodel.LdaModel
. -
[Optional] Associate PC reviewers with SQL IDs.
./util.py --cache pc.dat --pcids pc-info.csv
where the CSV file contains a comma-separated first name, last name, and SQL ID. For HotCRP, try:
select firstName, lastName, contactId from ContactInfo
This will allow the scripts to create MySQL scripts that can feed bid info directly into the database -
Generate bids for submissions:
./bid.py --cache pc.dat --submissions submission_dir --corpus corpus_dir
For each PC member, this will generate a bid for each submission in
submission_dir
. The bids will be output in abid.csv
file in each member's directory, as well as abid.mysql
file if you completed step 5 above. To make use of the SQL commands, try something like:for b in `ls */bid.mysql`; do cat $b >> bids.mysql; done mysql db_name -u user_name -p < bids.mysql
At any time, you can use the util.py script to check on the status of the PC and to reset the status of an individual PC member or the entire PC, in case you want to rerun only a portion of the pipeline above.