Install and Set Up Kafka on Amazon EC2 for Streaming Data to S3 with AWS Glue and Athena

Setting up Apache Kafka on an Amazon EC2 instance can be a powerful way to handle real-time data streaming. This guide will walk you through installing Kafka on an EC2 instance, configuring Zookeeper, producer, and consumer, and streaming logs to an S3 bucket. We'll then use AWS Glue to create a data catalog and run SQL queries with Amazon Athena to analyze the data.

Prerequisites

An AWS account
An EC2 instance (t2.medium recommended)
SSH access to your EC2 instance
Basic knowledge of Kafka and AWS services

Step 1: Launch an EC2 Instance

Create an EC2 Instance:
- Type: t2.medium
- OS: Amazon Linux 2 AMI
- Security Group: Allow inbound traffic on ports 22 (SSH), 2181 (Zookeeper), 9092 (Kafka)

SSH into Your EC2 Instance:

ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip

Step 2: Install Java and Kafka

Install Java:

sudo yum install java-1.8.0-openjdk -y
java -version

Download and Extract Kafka:

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.12-3.3.1.tgz
tar -xvf kafka_2.12-3.3.1.tgz
cd kafka_2.12-3.3.1

Step 3: Start Zookeeper

Start Zookeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties

Step 4: Start Kafka Server

Open a New SSH Session and Connect to EC2 Instance:

ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip

Start Kafka Server:

export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
cd kafka_2.12-3.3.1
bin/kafka-server-start.sh config/server.properties

Configure Kafka for Public Access:
- Edit server.properties:
```
sudo nano config/server.properties
```
- Update ADVERTISED_LISTENERS to your EC2 instance's public IP:
```
advertised.listeners=PLAINTEXT://your-ec2-public-ip:9092
```

Step 5: Create a Kafka Topic

Open a New SSH Session and Connect to EC2 Instance:

ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip

Create the Topic:

cd kafka_2.12-3.3.1
bin/kafka-topics.sh --create --topic SAPDEMOTESTING --bootstrap-server your-ec2-public-ip:9092 --replication-factor 1 --partitions 1

Step 6: Start Producer

Open a New SSH Session and Connect to EC2 Instance:

ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip

Start Kafka Producer:

cd kafka_2.12-3.3.1
bin/kafka-console-producer.sh --topic SAPDEMOTESTING --bootstrap-server your-ec2-public-ip:9092

Step 7: Start Consumer

Open a New SSH Session and Connect to EC2 Instance:

ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip

Start Kafka Consumer:

cd kafka_2.12-3.3.1
bin/kafka-console-consumer.sh --topic SAPDEMOTESTING --bootstrap-server your-ec2-public-ip:9092

Step 8: Stream Consumer Logs to S3

Configure Consumer to Stream Logs to S3:
- Create an S3 bucket and set up necessary permissions.
- Use a script or application to redirect consumer logs to the S3 bucket.

Step 9: Create AWS Glue Data Catalog

Set Up AWS Glue Crawler:
- Configure the crawler to crawl your S3 bucket and create a data catalog.
Run the Crawler:
- Verify that the crawler has created tables in the Glue Data Catalog.

Step 10: Query Data with Amazon Athena

Configure Athena to Use Glue Data Catalog:
- In the Athena console, set the Glue Data Catalog as the data source.
Run SQL Queries:
- Use SQL queries in Athena to analyze the Kafka consumer logs stored in S3.

Repository Files and Their Explanation

The repository KAFKA-PYTHON-AWS-CRAWLER-AMAZON-ATHENA contains scripts and configurations that automate the steps mentioned above. Here’s an explanation of the key files and their content:

1. `kafka_producer.py`

This script sets up a Kafka producer that sends data to the specified Kafka topic.

from kafka import KafkaProducer
from json import dumps
import time

producer = KafkaProducer(
    bootstrap_servers=['13.201.13.205:9092'],
    value_serializer=lambda x: dumps(x).encode('utf-8')
)

for j in range(999):
    data = {'number': j}
    producer.send('SAPDEMOTESTING', value=data)
    time.sleep(1)

2. `kafka_consumer_s3.py`

This script sets up a Kafka consumer that reads data from a specified Kafka topic and writes it to an S3 bucket.

from kafka import KafkaConsumer
from json import loads
import boto3

s3 = boto3.client('s3')
consumer = KafkaConsumer(
    'SAPDEMOTESTING',
    bootstrap_servers=['13.201.13.205:9092'],
    auto_offset_reset='earliest',
    enable_auto_commit=True,
    group_id='my-group',
    value_serializer=lambda x: loads(x.decode('utf-8'))
)

for message in consumer:
    message = message.value
    s3.put_object(Bucket='your-s3-bucket', Key=f'{message["number"]}.json', Body=str(message))

3. `glue_crawler_setup.py`

This script sets up an AWS Glue Crawler that creates a data catalog from the files stored in an S3 bucket.

import boto3

client = boto3.client('glue')

response = client.create_crawler(
    Name='kafka-crawler',
    Role='your-iam-role',
    DatabaseName='kafka-db',
    Description='Crawler for Kafka logs in S3',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://your-s3-bucket/'
            },
        ]
    },
    Schedule='cron(15 * * * ? *)'
)

4. `athena_query.py`

This script uses AWS Athena to run SQL queries on the data catalog created by AWS Glue.

import boto3

client = boto3.client('athena')

response = client.start_query_execution(
    QueryString='SELECT * FROM kafka_logs',
    QueryExecutionContext={
        'Database': 'kafka-db'
    },
    ResultConfiguration={
        'OutputLocation': 's3://your-athena-query-results/'
    }
)

By following these steps and utilizing the provided scripts, you can set up a Kafka environment on an Amazon EC2 instance, stream logs to S3, create a data catalog with AWS Glue, and query the data with Amazon Athena. This setup enables real-time data processing and powerful data analytics capabilities, leveraging the scalability and flexibility of AWS services.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
images		images
KafkaConsumer.ipynb		KafkaConsumer.ipynb
KafkaProducer.ipynb		KafkaProducer.ipynb
README.md		README.md
athena_query.py		athena_query.py
glue_crawler_setup.py		glue_crawler_setup.py
kafka-installation script.txt		kafka-installation script.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install and Set Up Kafka on Amazon EC2 for Streaming Data to S3 with AWS Glue and Athena

Prerequisites

Step 1: Launch an EC2 Instance

Step 2: Install Java and Kafka

Step 3: Start Zookeeper

Step 4: Start Kafka Server

Step 5: Create a Kafka Topic

Step 6: Start Producer

Step 7: Start Consumer

Step 8: Stream Consumer Logs to S3

Step 9: Create AWS Glue Data Catalog

Step 10: Query Data with Amazon Athena

Repository Files and Their Explanation

1. `kafka_producer.py`

2. `kafka_consumer_s3.py`

3. `glue_crawler_setup.py`

4. `athena_query.py`

About

Releases

Packages

Languages

TravelXML/KAFKA-PYTHON-AWS-CRAWLER-AMAZON-ATHENA

Folders and files

Latest commit

History

Repository files navigation

Install and Set Up Kafka on Amazon EC2 for Streaming Data to S3 with AWS Glue and Athena

Prerequisites

Step 1: Launch an EC2 Instance

Step 2: Install Java and Kafka

Step 3: Start Zookeeper

Step 4: Start Kafka Server

Step 5: Create a Kafka Topic

Step 6: Start Producer

Step 7: Start Consumer

Step 8: Stream Consumer Logs to S3

Step 9: Create AWS Glue Data Catalog

Step 10: Query Data with Amazon Athena

Repository Files and Their Explanation

1. kafka_producer.py

2. kafka_consumer_s3.py

3. glue_crawler_setup.py

4. athena_query.py

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `kafka_producer.py`

2. `kafka_consumer_s3.py`

3. `glue_crawler_setup.py`

4. `athena_query.py`

Packages