NLP pipeline

This is an NLP pipeline.

Overview

This is a natural-language processing pipeline. Currently it supports these stages:

Scrape articles from the Wikipedia.
Clean the scraped text.
Perform a rudimentary analysis on the cleaned text.
Split the text into training / testing / validation files.
Perform frequency filtering.

Executing pipeline / workflow

Edit the pipeline config files to run the stages that you want, and run the following command:

make run

To only scrap the wikipedia:

make wikipedia-scraping

To only run srilm model (only works if you run a scraper pipeline before):

make srilm-model

To clean the directory:

make clean

Logging

This project uses logging library. The workflow generates log files that can be found in logs folder. Use logger.info / debug / error / warning instead of print for proper logging when creating new stages.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
.idea		.idea
analysis		analysis
configs		configs
sql_scripts		sql_scripts
src		src
.gitignore		.gitignore
dockDockBuildParams.json		dockDockBuildParams.json
makefile		makefile
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP pipeline

Overview

Executing pipeline / workflow

Logging

About

Releases

Packages

Languages

justinhchae/nlp-pipeline

Folders and files

Latest commit

History

Repository files navigation

NLP pipeline

Overview

Executing pipeline / workflow

Logging

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages