'Sokrates' is an ML-powered assistant to help write better questions!
- Esteban Lopez: esteban@factored.ai
- David Stiles: david@factored.ai
-
The
app
directory contains all code necessary to run the HTTP service. Within it, theapp_core
package handles all core logic such as model inference, while theapi
package deals with handling HTTP requests for model inference. They communicate through theapp_core.handlers
module. -
Most code is contained in the
app/app_core
package. -
The
app_core.data_processing
package contains data extraction and preprocessing functionalities. Within it:-
The
text_extract
package contains classes used to extract features from text. They should all follow theExtractor
interface. -
The
XMLparser
module contains functionality to parse the StackExchange.xml
files and convert them to dataframes. -
The
make_dataset_csv
module usestext_extract
andXMLparser
to process the StackExchange.xml
files and export them as csv.
-
-
The
app_core.ml_models
package contains managers (wrappers) to handle the ML models themselves. -
The
basic_nlp_model
package contains code to quickly build and test neural network models on the dataset. -
The
notebooks
directory contains several Jupyter notebooks with data exploration and model testing.
In order to run the project you must first ensure that the required packages are installed, which you can do with:
pip install -r requirements.txt
You must also install the nltk dependencies. To do this, run the following in a
python
session:
import nltk
nltk.download("punkt") # Punkt for tokenizing
To build the dataset, first you must have downloaded and decompressed the data files
from the stack exchange data dump. After this you will have a collection of directories
(one per topic) containing .xml
files. If mydir
is the directory that contains these
folders and outdir
is the directory where you want to store the output csvs, you can
generate them with:
python -m data_processing mydir outdir
You can also add an optional third argument (True
or False
) to force the
re-processing of existing csvs.
To run the first simple baseline of the model run:
python -m ml_models
This will then prompt you for the title of your question and the body, which could be a path to a file where the question is stored as rendered HTML.
To start the HTTP server with docker, do the following:
- First, install Docker.
- Second, prepare your
.env
file. It must follow this template:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
BUCKET_NAME=
MODEL_PATH=
ENV=development
- Note that you must have access to the S3 bucket where we are storing
our models! For production deployment, the
ENV
variable must be set toproduction
and theAWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
variables SHOULD NOT BE SET. The credentials should be handled via an AWS IAM role! - Third, navigate to the
app
directory and build the docker image with:
docker build -t sokrates:<version> .
- Finally, run the container with
docker run -p 3000:3000 --env-file <path-to-.env-file> -d sokrates:<version>
This may take a minute or two to initialize while it downloads the model.
If you have installed docker-compose
and you prefer one-liners, you can also
start the server by running
docker-compose up
You can add the --build
flag to update the image.
If you want a faster startup time LOCALLY, you can persist the downloaded model from the container in a Docker volume or bind mount.