This lab builds on the activities in Lab 1 and 2. In addition to setting up the EMR cluster, you will need to set up access to twitter through their Application Programming Interface (API) and Apache Kafka to monitor a certain hashtag.
These are the tasks you’ll need to complete the lab:
- Launch an EMR Cluster with some additional configurations and settings
- Set up Apache Kafka server
- Set up Twitter API access
- Read Twitter Feed into Kafka via Spark Streaming
- Shutdown and cluster termination
Amazon EMR provides a managed Hadoop framework. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.
- Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
- Choose Create cluster
3. Choose Go to advanced options
4. Choose Hadoop, Hive and Spark for this lab
5. You will also need to copy the following block of code (all in one line) into the text area Edit software settings (optional)
.
[{"configurations":[{"classification":"export","properties":{"PYSPARK_PYTHON":"python34"}}],"classification":"spark-env","properties":{}}]
- Click Next
- For hardware settings, 1 master node will suffice for this lab. Select m3.xlarge EC2 instance type for master node and 0 Instance count for all other nodes.
8. Click Next 9. Provide a cluster name 10. Disable Termination Protection
9. Under “Additional Options", expand Bootstrap Actions
, select the option custom action
under Add bootstrap action
and click Configure and add
Script Location:
s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh
Optional Arguments:
--port 8880 --copy-samples
- Click Next
- Choose the EC2 key pair
aws_emr_key
previously created in Lab 1
AWS will begin spinning up the EC2 instances and configuring the Hadoop applications selected on the instances. This could take a long time due to the bootstrapping actions.
Important! Note down the Master public DNS as you will need this often later on. It looks something like
ec2-54-164-153-7.compute-1.amazonaws.com
Previously in Lab 1 we allow the following inbound traffic ports:
- Port 22 (SSH) to access the command line
- Port 8888 (Hue Web Server) to access the Hue web-based interface
Open Port 8880 on the master node to access the Jupyter notebook over the internet
- Access the EC2 Dashboard by clicking on
Services
->Compute
->EC2
- In the sidebar of the Amazon EC2 console under
NETWORK & SECURITY
chooseSecurity Groups
2. On your right, select the group with the group name ElasticMapReduce-master
- Click on Actions and select
Edit Inbound rules
.
4. Click on Add Rule
. In the new row that appears, enter 8880
under Port Range
.
5. Under source, select Anywhere
and the empty text box on the right hand side will be filled with ‘0.0.0.0/0’ automatically.
6. Click Save. You should then see the new rule the next time you view Edit Inbound rules.
Note: A list of other default incoming ports you may need to allow can be found here - http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
Important! You'll need to have opened Port
8880
on the master node to access the web interface. If you have not, go back to the Authorize Inbound Traffic to your master node section and carry out the steps.
- Open a modern browser like Chrome, Safari, Edge or Firefox
- Browse to
http://your-master-node-public-dns:8880
, for example,http://ecX-XX-XXX-XXX-X.compute-1.amazonaws.com:8880
We need to connect to the Master Node for administration using SSH. Follow the instructions for PC or Mac detailed in Lab 1. Once you have a SSH session, run the following commands
# Install Tweepy, Kafka-Python, findspark
sudo python3 -m pip install tweepy kafka-python findspark
# Change owner of the /usr/lib folder to “hadoop”
sudo chown -R hadoop:hadoop /usr/lib
# Navigate to the /usr/lib folder
cd /usr/lib
# Download the Apache Kafka binary archive
sudo wget http://www-us.apache.org/dist/kafka/0.10.2.1/kafka_2.11-0.10.2.1.tgz
# Extract the TAR Archive
sudo tar -xzf kafka_2.11-0.10.2.1.tgz
# Navigate to the extracted Kafka directory
cd kafka_2.11-0.10.2.1
# Start the zookeeper server in the background
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/zookeeper-server-start.sh -daemon config/zookeeper.properties
# Start the Apache Kafka server in the background
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-server-start.sh -daemon config/server.properties
# Create a Kafka Topic with the topic name “twitterstream”
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic twitterstream
You have now created a topic in Kafka that is ready to receive events. Leave the SSH session open.
- Create a twitter account at http://www.twitter.com
- Go to https://apps.twitter.com and sign in with your twitter account
- Click on the
Create New App
button on upper right, fill in anything and leave the callbackURL blank - Click on the
Keys and Access Tokens
tab. - Click on the
Create Access Token
- Copy the following four important pieces of information:
Consumer Key (API Key)
,
Consumer Secret (API Secret)
,
Access Token
,
Access Token Secret
- On the Jupyter notebook hosted on your EMR cluster master node
http://your-master-node-public-dns:8880
, upload theLab 3_Jupyter Notebook 1_Twitter to Kafka
andLab 3_Jupyter Notebook 2_Using Spark Streaming
notebooks found on elearn. - Open the “Lab 3_Jupyter Notebook 1_Twitter to Kafka” notebook. On the top of the notebook, enter the Twitter credentials you have saved previously.
- Run the following commands in the SSH session running from the "Set up Kafka Server" section.
# To view messages sent to the topic, execute the following command. Keep the PuTTY window open for now:
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitterstream --from-beginning
# To pipe the messages sent to the topic into a both the terminal and a file, execute the following command:
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitterstream --from-beginning | tee ~/twitterstream.txt
Nothing will appear for now until you run the code in the Lab 3_Jupyter Notebook 1_Twitter to Kafka
notebook.
- Run the codes in the first three steps. On the third step, after running the code, you will be able to see the test message appearing on the PuTTY Terminal
- Before running the codes in the fourth block, you may choose to add or remove hashtags to retrieve tweets from. For multiple hashtags, use a comma to separate them. Once ready, run the code.
# To open and read the file, open a separate SSH session and run the following command:
tail -f ~/twitterstream.txt
- Open the “Lab 3_Jupyter Notebook 2_Using Spark Streaming” notebook. Read the description and run through the steps as described in the notebook.
To end the lab, perform the following steps in sequence:
- Shutdown both Jupyter notebooks:
- In the PuTTY Terminal (PC) or Terminal (Mac), enter “Ctrl” and “C” keys together to terminate the consumer application.
- Execute the following commands to shut down Apache Kafka server and Zookeeper server:
/usr/lib/kafka_2.11-0.10.2.1/bin/kafka-server-stop.sh
/usr/lib/kafka_2.11-0.10.2.1/bin/zookeeper-server-stop.sh
- Close PuTTY Terminal and terminate the AWS EMR Cluster.
You will be charged according to the number of hours which your cluster is left running. It is very important to terminate your cluster after you have finished using it. NOT TERMINATING THE CLUSTER CAN RESULT IN HUGE BILLS.
Tip: If you forget to terminate your cluster and accumulate a large bill, you can try to write in to Amazon Web Services customer service and request to have it waived.
You can terminate one or more clusters using the Amazon EMR console.
- Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
- Select the cluster to terminate. Note that you can select multiple clusters and terminate them at the same time.
- Click Terminate.
- When prompted, click Terminate.
If you have enabled the termination protection, there are a few more steps you need to carry out before being able to terminate your cluster.
- Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
- On the Cluster List page, select the cluster to terminate. Note that you can select multiple clusters and terminate them at the same time.
- Click Terminate.
- When prompted, click Change to turn termination protection off. Note, if you selected multiple clusters, click Turn off all to disable termination protection for all the clusters at once.
- In the Terminate clusters dialog, for Termination Protection, click Off and then click the check mark to confirm.
- Click Terminate.