Skip to content

igoortc/tweets-preprocessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Tweets preprocessor

Author: Igor Tannus Correa

This is a Java algorithm that executes several steps of pre-processing in a database. At first, it was written as a tweet's pre-processor, but it can be adapted to other types of data.

The pre-processing steps that it can do are:

  • remove

    • hashtags and citations (#lalaland, @user -> lalaland, user)

    • tweets unrelated to the theme according to a list of words (add words to unrelated.txt)

    • links

    • special characters (e.g. ~!@#$%ˆ*&), numbers, and the query term

    • stopwords (e.g. a, the, you, with, etc)

    • spaces (when there's more than one)

  • translate

    • slangs and abbreviations (e.g. omg, ily, brb -> oh my god, i love you, be right back -- add words to dictionary.txt)

    • emoticons (e.g. :], <3 -> happy, love -- add words to emoticons.txt)

  • replace

    • uppercase letters to lowercase

    • accented characters (ã, ê, ñ) to unaccented characters (a, e, n)

You can write new steps according to what you need or comment/delete the methods you don't want to use.

This algorithm is part of the paper that I wrote, "Sentiment analysis of tweets related to the movies nominated for the 2017 Academy Awards".

You can read it (in Portuguese or English) and understand how I used this tool in my paper.

If you use this tool, please cite the paper 😛