The purpose of this project was to investigate different methods of adding sentiments to social media posts and implement the preferrred method. Specifically, I wanted to add a sentiment section to an SQL database of posts, and populate those with sentiment polaritys based off the content of the post. Additionally, I wanted to update them in a batch format to make the program run faster.
Tools Used:
tensorflow
andnumpy
for machine learning and scientific computingspaCY
,NLTK
andTransformers
for Natural Language ProcessingPostgreSQL
for storing dataasyncpg
for getting data from SQL to Pythonasyncio
for sending concurrent requests to an event loop, which delivers them to be processed (not useful here but could be scaled up in the future)docker
for containerizing the project.toml
for storing configuration data.yml
for storing configuration data for docker and postgresmakefile
for creating shortcuts for rerunning and rebuilding the docker containerpdm
for installing and managing dependencies
First, I researched different methods of NLP for sentiment analysis on sentences. I choose spaCY
, NLTK
, and HuggingFace's Transformers
. Specifically, the model I used for Transformers was "finiteautomata/bertweet-base-sentiment-analysis"
.
I tested the accuracy of the sentiment for some random sentences I constructed.
spaCY | NLTK | HuggingFace Transformers | Content for Polarity Measurement |
---|---|---|---|
0.18 | -0.4215 | NEG', 'score': 0.50719153881073} | A very, very, very slow-moving, aimless movie about a distressed, drifting young man. |
0.01458333333 | -0.5507 | NEG', 'score': 0.8558710813522339}] | Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out. |
-0.1229166667 | -0.7178 | NEG', 'score': 0.9835329055786133}] | Attempting artiness with black & white and clever camera angles, the movie disappointed. |
-0.24375 | 0 | NEG', 'score': 0.5657512545585632}] | Very little music or anything to speak of. |
1 | 0.6369 | POS', 'score': 0.9857653975486755}] | The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head. |
-0.1 | -0.25 | NEU', 'score': 0.61434406042099}] | The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty. |
-0.2 | -0.4939 | NEG', 'score': 0.9729021787643433}] | Wasted two hours |
0.7 | 0.7003 | POS', 'score': 0.9911397099494934}] | Saw the movie today and thought it was a good effort, good messages for kids. |
0 | 0 | NEU', 'score': 0.9503368139266968}] | A bit |
0.7 | 0.5994 | POS', 'score': 0.9787101745605469}] | Loved the casting of Jimmy Buffet as the lead. |
-0.25 | -0.2411 | NEU', 'score': 0.7688046097755432}] | i'm not sure what those guys are up to. |
-0.4 | -0.4404 | NEG', 'score': 0.9838539958000183}] | I'm tired of these people. |
0 | 0 | NEG', 'score': 0.9542802572250366}] | America just isn't what it used to be. |
0 | 0 | 'POS', 'score': 0.9923742413520813}] | Can't wait for this weekend! |
0 | 0 | NEU', 'score': 0.8812552690505981}] | What were these people thinking? |
0 | 0 | 'POS', 'score': 0.9917930960655212}] | I'm stoked about the concert. |
0.2 | 0 | 'NEG', 'score': 0.961385190486908}] | I can't believe people stood in front of me the whole time. |
0.5 | 0.4404 | POS', 'score': 0.971764326095581}] | This is something that gets better with age. |
-0.4 | -0.4767 | NEG', 'score': 0.9818474650382996}] | This is something that gets worse with age. |
0 | 0 | 'NEU', 'score': 0.9516934752464294}] | Something feels different about this. |
Here is my implementation of a polarity() function for each tool.
Transformers polarity()
from transformers import pipeline
from transformers import AutoTokenizer
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
model = "finiteautomata/bertweet-base-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model)
def polarity(sentence) -> float:
"""returns exact polarity score of a sentence"""
MAX_LEN = 725
if len(sentence) > MAX_LEN:
sentence = sentence[:MAX_LEN]
return sentiment_pipeline(sentence) # type: ignore
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
NLTK polarity()
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
nltk.download([ "vader_lexicon", ])
def polarity(sentence) -> dict:
"""returns exact polarity score of a sentence"""
return sia.polarity_scores(sentence)
spaCY polarity()
from spacytextblob.spacytextblob import SpacyTextBlob
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("spacytextblob")
def polarity(sentence) -> float:
"""returns exact polarity score of a sentence"""
doc = nlp(sentence)
return doc._.blob.polarity
Second, I measured to see the speed of each model with different amounts of sentences.
Sentences | spaCY (sec) | NLTK (sec) | Transformers (sec) |
---|---|---|---|
10 | 1.196 | 0.00226 | 1.196 |
1000 | 7.9415 | 0.23229 | 102.747 |
10000 | 73.708 | 2.6488 | 1130.609 |
The way polarity
is measured in spaCY
, you are given a composite polarity, meaning the polarity's value would be a spectrum between -1 and 1. -1 would be considered extremely negative, 1 would be considered extremely positive, and 0 would be considered neutral. The way polarity
is measured in NLTK
is very similar, except you are given the positive, negative, neutral and composite values. The way the polarity
is calculated for Transformers
, it has three different sentiments, positive, negative, and neutral and returns the largest polarity of the three. I investigated meticulously and could not find any way to achieve a composite polarity value in Transformer. While I found Transformers to be the most accurate of the three, I decided to go with NLTK because I wanted to be able to distinguish between "slightly positive" and "slightly negative" values when looking at individual posts and averages of large numbers of posts. Plus, there is the added bonus that it was the fastest of the three and more accurate than spaCY based on my prompts.
Things I noted while working through this project:
- When deciding on how many to process at once in a batch, it needs to be calculated dynamically since the size of a database will change if we want to reuse this
- The size of the entry matters for some NLP tools like Transformers, but it doesn't for NLTK, which I ended up using
- Using the Type() function is helpful when using
Asyncpg
because sometimes its not clear what you are fetching - In
Asyncpg
, you usefetchrow()
for one row, and you usefetch()
for sql queries that get multiple rows
Originally, I added the sentiment column with the following command.
ALTER TABLE TESTING.POST
ADD COLUMN sentiment NUMERIC
This is what one of the records looked like before the program runs.
-[ RECORD 1 ]+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id | 10112943025779381
type | 4
author | 4
text | "Interesting milestone: Workplace, our business communication tool, has hit 7 million paid subscribers -- a 40% increase in the last year."
last_seen_at | 2021-08-17 14:55:26.348302
sentiment | Null
Here is the same post after running the program, this ran for all posts in the postgres database. The sentiment for this is 0.905, very positive.
-[ RECORD 1 ]+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id | 10112943025779381
type | 4
author | 4
text | "Interesting milestone: Workplace, our business communication tool, has hit 7 million paid subscribers -- a 40% increase in the last year."
last_seen_at | 2021-08-17 14:55:26.348302
sentiment | 0.905
Here are some of the psql and asyncpg related functions:
async def get_post_text(id: int, conn):
row = await conn.fetchrow("SELECT details FROM testing.post WHERE ID = $1", id)
json_row = json.loads(str(row[0]))
text = json_row["text"]
return text
async def update_post_sentiment(id: int, text, conn):
sentiment = polarity(text)
QUERY = f"UPDATE testing.post SET sentiment = {sentiment} WHERE ID = {id};"
await conn.execute(QUERY)
async def get_posts(i, window, conn):
posts = await conn.fetch(
f"SELECT * FROM testing.post WHERE ID BETWEEN {i} AND {i+window}"
)
posts_list = []
for post in posts:
json_row = json.loads(str(post["details"]))
text = json_row["text"]
id = post["id"]
posts_list.append([id, text])
return posts_list
async def get_max_id(conn):
max_query = await conn.fetchrow(f"SELECT MAX(ID) from testing.post;")
id = json.loads(str(max_query["max"]))
return id
async def get_min_id(conn):
min_query = await conn.fetchrow(f"SELECT MIN(ID) from testing.post;")
id = json.loads(str(min_query["min"]))
return id
```