GitHub - s4sarath/sentimental_analysis

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
env		env
sklearn_model		sklearn_model
static		static
sts_gold_v03 2		sts_gold_v03 2
templates		templates
tweepy		tweepy
!		!
.DS_Store		.DS_Store
300features_40minwords_10context		300features_40minwords_10context
Procfile		Procfile
README		README
cfloor_index.html		cfloor_index.html
create_sentiment_frame.py		create_sentiment_frame.py
fetched_tweets.txt		fetched_tweets.txt
requirements.txt		requirements.txt
routes.py		routes.py
search_tweet.py		search_tweet.py
search_tweet.pyc		search_tweet.pyc
sentiment_vectorizer.py		sentiment_vectorizer.py
sentimental_analysis.ipynb		sentimental_analysis.ipynb
trainingandtestdata.zip		trainingandtestdata.zip
tweepy_listen_new.py		tweepy_listen_new.py
tweepy_listen_new.pyc		tweepy_listen_new.pyc
twitter_predict.py		twitter_predict.py
twitter_predict.pyc		twitter_predict.pyc

Repository files navigation

#### SENTIMENTAL ANALYSIS ASSIGNMENT

1. The folder named ancient- is the one which is a Flask application.
A version has been deployed to heroku, but it results in timeout due to complexity of the model. But work fine locally.

Run "python routes.py"

2. The sentimental_analysis.ipynb is the ipython version which is quite self explanatory , trained on sts-gold data. But as the data is less, it will be not perfect for new test samples.

But it results in an accuracy of 77% when repeated strings like "happyyy",
"yesss" will be added as features. This is also mentioned in the ipyython notebook.

3. Sentimental_new.ipynb is the one which trained on stanford data. As the test data is more than 16 lakhs, I wont be able to run in my system. i tried with 2 lakhs, it wont run. But all the model creation is also memntioned in data.

4. Heroku link is "https://ancient-peak-6490.herokuapp.com/"

in the search box add a tweet . By deafult it will wait for 30 sec. if it picks up no tweet, the request will be rejected. Used tweepy module and multithreading.

5. As a part of feature extraction I have used the following:

a.) Count vectorizer for counting the total number of words after stemming and tokenization.

b.) Tf-idf , which doesnt make much change in accuracy.

c.) word2vec features, which is a genesim library package, which is an implementation of deep-learning based, word to context features. It results in 71% accuracy.

More can be done and experimented with a sophisticated system.

System used : Mac Yosemite with 8gb Ram.