Skip to content

factoredai/sokrates

Repository files navigation

Sokrates

Contents

About

'Sokrates' is an ML-powered assistant to help write better questions!

Authors

Repository Contents

  • The app directory contains all code necessary to run the HTTP service. Within it, the app_core package handles all core logic such as model inference, while the api package deals with handling HTTP requests for model inference. They communicate through the app_core.handlers module.

  • Most code is contained in the app/app_core package.

  • The app_core.data_processing package contains data extraction and preprocessing functionalities. Within it:

    • The text_extract package contains classes used to extract features from text. They should all follow the Extractor interface.

    • The XMLparser module contains functionality to parse the StackExchange .xml files and convert them to dataframes.

    • The make_dataset_csv module uses text_extract and XMLparser to process the StackExchange .xml files and export them as csv.

  • The app_core.ml_models package contains managers (wrappers) to handle the ML models themselves.

  • The basic_nlp_model package contains code to quickly build and test neural network models on the dataset.

  • The notebooks directory contains several Jupyter notebooks with data exploration and model testing.

Instructions and Usage

Setup

In order to run the project you must first ensure that the required packages are installed, which you can do with:

pip install -r requirements.txt

You must also install the nltk dependencies. To do this, run the following in a python session:

import nltk
nltk.download("punkt")  # Punkt for tokenizing

Building the dataset

To build the dataset, first you must have downloaded and decompressed the data files from the stack exchange data dump. After this you will have a collection of directories (one per topic) containing .xml files. If mydir is the directory that contains these folders and outdir is the directory where you want to store the output csvs, you can generate them with:

python -m data_processing mydir outdir

You can also add an optional third argument (True or False) to force the re-processing of existing csvs.

Running Simple Baseline

To run the first simple baseline of the model run:

python -m ml_models

This will then prompt you for the title of your question and the body, which could be a path to a file where the question is stored as rendered HTML.

Running the Server

Run with Docker

To start the HTTP server with docker, do the following:

  • First, install Docker.
  • Second, prepare your .env file. It must follow this template:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
BUCKET_NAME=
MODEL_PATH=
ENV=development
  • Note that you must have access to the S3 bucket where we are storing our models! For production deployment, the ENV variable must be set to production and the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables SHOULD NOT BE SET. The credentials should be handled via an AWS IAM role!
  • Third, navigate to the app directory and build the docker image with:
docker build -t sokrates:<version> .
  • Finally, run the container with
docker run -p 3000:3000 --env-file <path-to-.env-file> -d sokrates:<version>

This may take a minute or two to initialize while it downloads the model.

Run with Docker Compose

If you have installed docker-compose and you prefer one-liners, you can also start the server by running

docker-compose up

You can add the --build flag to update the image.

Note on Docker Startup Time

If you want a faster startup time LOCALLY, you can persist the downloaded model from the container in a Docker volume or bind mount.

Back to top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published