This is an NLP pipeline.
This is a natural-language processing pipeline. Currently it supports these stages:
- Scrape articles from the Wikipedia.
- Clean the scraped text.
- Perform a rudimentary analysis on the cleaned text.
- Split the text into training / testing / validation files.
- Perform frequency filtering.
Edit the pipeline config files to run the stages that you want, and run the following command:
make run
To only scrap the wikipedia:
make wikipedia-scraping
To only run srilm model (only works if you run a scraper pipeline before):
make srilm-model
To clean the directory:
make clean
This project uses logging library. The workflow generates log files that can be found in logs folder. Use logger.info / debug / error / warning instead of print for proper logging when creating new stages.