Skip to content

Natural Language Processing pipeline.

Notifications You must be signed in to change notification settings

justinhchae/nlp-pipeline

Repository files navigation

NLP pipeline

This is an NLP pipeline.

Overview

This is a natural-language processing pipeline. Currently it supports these stages:

  1. Scrape articles from the Wikipedia.
  2. Clean the scraped text.
  3. Perform a rudimentary analysis on the cleaned text.
  4. Split the text into training / testing / validation files.
  5. Perform frequency filtering.

Executing pipeline / workflow

Edit the pipeline config files to run the stages that you want, and run the following command:

make run

To only scrap the wikipedia:

make wikipedia-scraping

To only run srilm model (only works if you run a scraper pipeline before):

make srilm-model

To clean the directory:

make clean

Logging

This project uses logging library. The workflow generates log files that can be found in logs folder. Use logger.info / debug / error / warning instead of print for proper logging when creating new stages.

About

Natural Language Processing pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published