Skip to content

Reads streaming data from twitter and displays aggregated bar plots based on hashtags

Notifications You must be signed in to change notification settings

anirbankonar123/PySparkStreaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

PySparkStreaming

Reads streaming data from twitter and displays aggregated bar plots based on hashtags

Twitter account setup:
Set up an account at developer.twitter.com, accept the agreement and click defaults, provide reason for your account etc
Then create an app, and setup the credentials for a twitter app at https://developer.twitter.com/en/apps, ex: https://anirbank.twitter.com

Following libraries are to be installed:

sudo apt-get install default-jre (this is linux default jre, make sure version 1.8.x is installed)
sudo apt-get install scala (check for version 2.11.6-6)
sudo pip3 install py4j
Download spark (version 2.1, with Hadoop 2.7 works best) and extract the tar file
wget -q https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
pip install pyspark
pip install findspark
pip install python-twitter - a python library to connect your Python to the twitter dev account.
pip install tweepy

Setup following env vars:
export SPARK_HOME=.... (path where the spark-2.1.1-bin-hadoop2.7 is extracted)
export PATH = $SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON=OPTS="notebook"

In addition you might have to setup JAVA_HOME (since I am using default jre in Linux, I am not setting it)

About

Reads streaming data from twitter and displays aggregated bar plots based on hashtags

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published