Note: this project is under active development. Contributors (especially to documentation) are welcome.
Author: Chris Gibb
Twitter scraper and natural language processing platform.
Command line interface (CLI) based platform for high volume Twitter analytics.
- Linux based Operating System (OS)
- NodeJS
- Java (Optional, only needed for built in text analytics)
- Familiarity with the CLI
- Intel Celeron 2x Core @ 1.4GhZ
- 2GB RAM
- 500GB - 1TB disk space (depends on length of use and intentions)
From the directory where the source was cloned into:
Install (local) dependencies
bash install.bash
Assumes that javac, g++ and node are installed globally.
Build everything
bash build.bash
The platform requires Twitter developer credentials in order to make requests to the Twitter API. It uses Twitter's application only authorization. See https://dev.twitter.com/oauth/application-only for more details.
Once you have acquired credentials from Twitter, a file named keys.json must be created in the dep directory (or wherever you have placed the built application). It should look like this:
{
"consumerKey" : "",
"consumerSecret" : "",
"requestToken" : "",
"requestTokenSecret" : "",
"accessToken" : "",
"accessTokenSecret" : ""
}
In order to run a round of mining:
node tweetScheduler --dataDir=data --threads=1 --iterations=1
See tweetScheduler documentation for more information.
By default, the platform comes with a wrapper over Stanford Natural Language Processing Group's Named Entity Recognizer (SNER) in the form of nerLearner.jar. See http://nlp.stanford.edu/software/CRF-NER.shtml for more information. It builds a database of recognized words from SNER and allows other utilities to apply them to tweets in order to filter out words of interest. Some defaults in running nerLearner.jar are provided in runNerLearner.sh. nerLearner.sh takes a single argument, that is a path to a file containing a list of tweet bins to operate on.
In order to learn about words of interest (people, places, organizations ,mentions and hashtags) in tweets collected between January 1 2017 12:00am and January 1 2017 11:59pm, first generate a bin listing for that range. Assuming tweets have been saved into a directory named data:
./genListing data 2017 Jan 01 > Jan012017Listing
sh runNerLearner.sh Jan012017Listing
rm Jan012017Listing
runNerLearner.sh by default will output the results to classifiers/learned. If its output directory does not exist, it will produce no output.
node tweetScheduler --dataDir=data --threads=1 --iterations=1
sh runNerLearner.sh modBinsdata
rm modBinsData
Will run a round of mining and learn about only new tweets that have been acquired by that round of mining.
In order to visualize and analyze tweets, they need to be tagged with the words of interest generated from nerLearner.
You should always generate a fresh listing and run tagging before you generate visualization files in order to ensure that the freshest data is available (assuming you are acquiring tweets continuously).
./tagApplicator Jan2017Listing classifiers/learned
Assuming that words of interest are being written to classifiers/learned (the default).
Extract words of interest and sentiment information
node extract --listing=Jan2017Listing > Jan2017Dump
Average out word occurences and average sentiment
node average --file=Jan2017Dump --date=Jan2017 > Jan2017.json
Jan2017.json
can now be visualized and queried.