Author: Igor Tannus Correa
This is a Java algorithm that executes several steps of pre-processing in a database. At first, it was written as a tweet's pre-processor, but it can be adapted to other types of data.
The pre-processing steps that it can do are:
-
remove
-
hashtags and citations (#lalaland, @user -> lalaland, user)
-
tweets unrelated to the theme according to a list of words (add words to unrelated.txt)
-
links
-
special characters (e.g. ~!@#$%ˆ*&), numbers, and the query term
-
stopwords (e.g. a, the, you, with, etc)
-
spaces (when there's more than one)
-
-
translate
-
slangs and abbreviations (e.g. omg, ily, brb -> oh my god, i love you, be right back -- add words to dictionary.txt)
-
emoticons (e.g. :], <3 -> happy, love -- add words to emoticons.txt)
-
-
replace
-
uppercase letters to lowercase
-
accented characters (ã, ê, ñ) to unaccented characters (a, e, n)
-
You can write new steps according to what you need or comment/delete the methods you don't want to use.
This algorithm is part of the paper that I wrote, "Sentiment analysis of tweets related to the movies nominated for the 2017 Academy Awards".
You can read it (in Portuguese or English) and understand how I used this tool in my paper.
If you use this tool, please cite the paper 😛