Skip to content

Project contains web scaper to download files which serve as training data for the doc classification model along with the train and test scripts.

Notifications You must be signed in to change notification settings

Dipesh13/doc-classification

Repository files navigation

doc-classification

#TO-DO

  1. Currently the scraper for guardian.com extracts articles for the month of MAY, make it a command line argument for the user to specify the date, month and year.

  2. Make a new version to scrape all the articles from the begining.

  3. Combine create and build dataset into one script

Industry wise/ company wise extraction of pdfs from annualreports.com on a seperate repo.

About

Project contains web scaper to download files which serve as training data for the doc classification model along with the train and test scripts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published