News-Popularity-Prediction

Step1 web scraping

web scraping.py

Use Scrapy to scrape data(news title, date, contents, # of reviews, # of participants and first 10 reviews) from www.sina.com.cn. Note: The code can be easily adjusted for other websites.

Step2 data preprocessing

word segmentation.py

Because there is no space between Chinese words, segmentation is needed before any text analysis. The code uses 'jieba' to segment words.

delete stop words.R

Delete the stop words based on a Chinese stop words reference.

Step3 text vectorization Use the tool word2vec provided by Google (available on http://word2vec.googlecode.com/svn/trunk/) to train word vectors and form clusters.

Step4 clustering

clustering.c

Transform titles and reviews into high-dimensional numeric vector based on clustering.

Step5 prediction

predcition.py

Use random forest to train the classifier to predict the popularity of a certain news title.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
clustering.c		clustering.c
delete stop words.R		delete stop words.R
prediction.py		prediction.py
web scraping.py		web scraping.py
word segmentation.py		word segmentation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News-Popularity-Prediction

About

Releases

Packages

Languages

sunzeyeah/News-Popularity-Prediction

Folders and files

Latest commit

History

Repository files navigation

News-Popularity-Prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages