GitHub - Dipesh13/doc-classification: Project contains web scaper to download files which serve as training data for the doc classification model along with the train and test scripts.

#TO-DO

Currently the scraper for guardian.com extracts articles for the month of MAY, make it a command line argument for the user to specify the date, month and year.
Make a new version to scrape all the articles from the begining.
Combine create and build dataset into one script

Industry wise/ company wise extraction of pdfs from annualreports.com on a seperate repo.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md
annualreports.py		annualreports.py
bob.ipynb		bob.ipynb
build_dataset.py		build_dataset.py
config.json		config.json
config.py		config.py
create_dataset.py		create_dataset.py
extract_text.py		extract_text.py
guardian.py		guardian.py
model.py		model.py
model_embeddings.py		model_embeddings.py
model_v1.py		model_v1.py
model_v2.py		model_v2.py
template_countvectorizer.py		template_countvectorizer.py
template_voting_classifier.py		template_voting_classifier.py
test.py		test.py
train.csv		train.csv
train.py		train.py

Provide feedback