B5 Big Data Tools and Technologies - Lab 3

Overview

This lab builds on the activities in Lab 1 and 2. In addition to setting up the EMR cluster, you will need to set up access to twitter through their Application Programming Interface (API) and Apache Kafka to monitor a certain hashtag.

Steps

These are the tasks you’ll need to complete the lab:

Launch an EMR Cluster with some additional configurations and settings
Set up Apache Kafka server
Set up Twitter API access
Read Twitter Feed into Kafka via Spark Streaming
Shutdown and cluster termination

Launch an Amazon Elastic Map Reduce (EMR) Cluster

Amazon EMR provides a managed Hadoop framework. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
Choose Create cluster

3. Choose Go to advanced options

4. Choose Hadoop, Hive and Spark for this lab

5. You will also need to copy the following block of code (all in one line) into the text area Edit software settings (optional).

[{"configurations":[{"classification":"export","properties":{"PYSPARK_PYTHON":"python34"}}],"classification":"spark-env","properties":{}}]

Click Next
For hardware settings, 1 master node will suffice for this lab. Select m3.xlarge EC2 instance type for master node and 0 Instance count for all other nodes.

8. Click Next 9. Provide a cluster name 10. Disable Termination Protection

9. Under “Additional Options", expand Bootstrap Actions, select the option custom action under Add bootstrap action and click Configure and add

Script Location:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

Optional Arguments:

--port 8880 --copy-samples

Click Next
Choose the EC2 key pair aws_emr_key previously created in Lab 1

13. Click “Create Cluster”

AWS will begin spinning up the EC2 instances and configuring the Hadoop applications selected on the instances. This could take a long time due to the bootstrapping actions.

Important! Note down the Master public DNS as you will need this often later on. It looks something like ec2-54-164-153-7.compute-1.amazonaws.com

Authorize Inbound Traffic to your master node

Previously in Lab 1 we allow the following inbound traffic ports:

Port 22 (SSH) to access the command line
Port 8888 (Hue Web Server) to access the Hue web-based interface

Open Port 8880 on the master node to access the Jupyter notebook over the internet

Access the EC2 Dashboard by clicking on Services -> Compute -> EC2

In the sidebar of the Amazon EC2 console under NETWORK & SECURITY choose Security Groups

2. On your right, select the group with the group name ElasticMapReduce-master

Click on Actions and select Edit Inbound rules.

4. Click on Add Rule. In the new row that appears, enter 8880 under Port Range.

5. Under source, select Anywhere and the empty text box on the right hand side will be filled with ‘0.0.0.0/0’ automatically. 6. Click Save. You should then see the new rule the next time you view Edit Inbound rules.

Note: A list of other default incoming ports you may need to allow can be found here - http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html

Connecting to the web interface from a browser

Important! You'll need to have opened Port 8880 on the master node to access the web interface. If you have not, go back to the Authorize Inbound Traffic to your master node section and carry out the steps.

Open a modern browser like Chrome, Safari, Edge or Firefox
Browse to http://your-master-node-public-dns:8880, for example, http://ecX-XX-XXX-XXX-X.compute-1.amazonaws.com:8880

Set up Kafka Server

We need to connect to the Master Node for administration using SSH. Follow the instructions for PC or Mac detailed in Lab 1. Once you have a SSH session, run the following commands

# Install Tweepy, Kafka-Python, findspark
sudo python3 -m pip install tweepy kafka-python findspark

#	Change owner of the /usr/lib folder to “hadoop”
sudo chown -R hadoop:hadoop /usr/lib

# Navigate to the /usr/lib folder
cd /usr/lib

# Download the Apache Kafka binary archive
sudo wget http://www-us.apache.org/dist/kafka/0.10.2.1/kafka_2.11-0.10.2.1.tgz

# Extract the TAR Archive
sudo tar -xzf kafka_2.11-0.10.2.1.tgz

# Navigate to the extracted Kafka directory
cd kafka_2.11-0.10.2.1

# Start the zookeeper server in the background
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/zookeeper-server-start.sh -daemon config/zookeeper.properties

# Start the Apache Kafka server in the background
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-server-start.sh -daemon config/server.properties

# Create a Kafka Topic with the topic name “twitterstream”
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic twitterstream

You have now created a topic in Kafka that is ready to receive events. Leave the SSH session open.

Set up Twitter API Access

Create a twitter account at http://www.twitter.com
Go to https://apps.twitter.com and sign in with your twitter account
Click on the Create New App button on upper right, fill in anything and leave the callbackURL blank
Click on the Keys and Access Tokens tab.
Click on the Create Access Token
Copy the following four important pieces of information:

Consumer Key (API Key), Consumer Secret (API Secret), Access Token, Access Token Secret

Read Twitter Feed into Kafka via Spark Streaming

On the Jupyter notebook hosted on your EMR cluster master node http://your-master-node-public-dns:8880, upload the Lab 3_Jupyter Notebook 1_Twitter to Kafka and Lab 3_Jupyter Notebook 2_Using Spark Streaming notebooks found on elearn.
Open the “Lab 3_Jupyter Notebook 1_Twitter to Kafka” notebook. On the top of the notebook, enter the Twitter credentials you have saved previously.
Run the following commands in the SSH session running from the "Set up Kafka Server" section.

# To view messages sent to the topic, execute the following command. Keep the PuTTY window open for now:
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitterstream --from-beginning

# To pipe the messages sent to the topic into a both the terminal and a file, execute the following command:
sudo /usr/lib/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitterstream --from-beginning | tee ~/twitterstream.txt

Nothing will appear for now until you run the code in the Lab 3_Jupyter Notebook 1_Twitter to Kafka notebook.

Run the codes in the first three steps. On the third step, after running the code, you will be able to see the test message appearing on the PuTTY Terminal
Before running the codes in the fourth block, you may choose to add or remove hashtags to retrieve tweets from. For multiple hashtags, use a comma to separate them. Once ready, run the code.

# To open and read the file, open a separate SSH session and run the following command:
tail -f ~/twitterstream.txt

Open the “Lab 3_Jupyter Notebook 2_Using Spark Streaming” notebook. Read the description and run through the steps as described in the notebook.

Graceful shutdown

To end the lab, perform the following steps in sequence:

Shutdown both Jupyter notebooks:
In the PuTTY Terminal (PC) or Terminal (Mac), enter “Ctrl” and “C” keys together to terminate the consumer application.
Execute the following commands to shut down Apache Kafka server and Zookeeper server:

/usr/lib/kafka_2.11-0.10.2.1/bin/kafka-server-stop.sh

/usr/lib/kafka_2.11-0.10.2.1/bin/zookeeper-server-stop.sh

Close PuTTY Terminal and terminate the AWS EMR Cluster.

Terminate your cluster

You will be charged according to the number of hours which your cluster is left running. It is very important to terminate your cluster after you have finished using it. NOT TERMINATING THE CLUSTER CAN RESULT IN HUGE BILLS.

Tip: If you forget to terminate your cluster and accumulate a large bill, you can try to write in to Amazon Web Services customer service and request to have it waived.

Terminate a Cluster Using the Console

You can terminate one or more clusters using the Amazon EMR console.

To terminate a cluster with termination protection off

Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
Select the cluster to terminate. Note that you can select multiple clusters and terminate them at the same time.
Click Terminate.
When prompted, click Terminate.

To terminate a cluster with termination protection on

If you have enabled the termination protection, there are a few more steps you need to carry out before being able to terminate your cluster.

Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.
On the Cluster List page, select the cluster to terminate. Note that you can select multiple clusters and terminate them at the same time.
Click Terminate.
When prompted, click Change to turn termination protection off. Note, if you selected multiple clusters, click Turn off all to disable termination protection for all the clusters at once.
In the Terminate clusters dialog, for Termination Protection, click Off and then click the check mark to confirm.
Click Terminate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 3.md

Lab 3.md

B5 Big Data Tools and Technologies - Lab 3

Overview

Steps

Launch an Amazon Elastic Map Reduce (EMR) Cluster

Authorize Inbound Traffic to your master node

Connecting to the web interface from a browser

Set up Kafka Server

Set up Twitter API Access

Read Twitter Feed into Kafka via Spark Streaming

Graceful shutdown

Terminate your cluster

Terminate a Cluster Using the Console

To terminate a cluster with termination protection off

To terminate a cluster with termination protection on

Files

Lab 3.md

Latest commit

History

Lab 3.md

File metadata and controls

B5 Big Data Tools and Technologies - Lab 3

Overview

Steps

Launch an Amazon Elastic Map Reduce (EMR) Cluster

Authorize Inbound Traffic to your master node

Connecting to the web interface from a browser

Set up Kafka Server

Set up Twitter API Access

Read Twitter Feed into Kafka via Spark Streaming

Graceful shutdown

Terminate your cluster

Terminate a Cluster Using the Console

To terminate a cluster with termination protection off

To terminate a cluster with termination protection on