Log Analytics

This repository contains two Python scripts for log analytics: kafka_producer.py and spark_stream.py. These scripts are designed to work together to process log data using Apache Kafka and Apache Spark.

To initiate the process, we need to feed the data, which is an input file, into the Kafka producer. This can be achieved through various approaches, one such approach is injecting the data line by line (feel free to come up with other ideas). This producer is responsible for publishing the data to a specific topic to which it is subscribed.

After the data is injected, the Kafka consumer, which is also subscribed to the same topic, will periodically check for new data. In this case, it will check for available data every 10 seconds.

Once the data is received by the consumer, we will utilize Spark to perform transformations and conduct exploratory data analysis (EDA), employing techniques similar to those discussed in the article.

Finally, the results (e.g., dataframes or RDDs from final transformation) obtained from each analysis will be converted into the Parquet file format, and then these Parquet files will be stored in HDFS.

Flow:

Kafka Producer

The kafka_producer.py script reads a log file and produces log entries to a Kafka topic. It uses the KafkaProducer class from the kafka library to connect to a Kafka broker and send log entries.

Prerequisites

Before running the kafka_producer.py script, make sure you have the following:

Kafka broker running on localhost:9092.
Log file path specified in the log_file_path variable.
Kafka topic name specified in the kafka_topic variable.

Usage

To run the Kafka producer script, execute the following command:

python kafka_producer.py

In this section, we will be working with a combination of Big Data technologies, including Kafka (producer and consumer), Spark Streaming, Parquet files, and HDFS File System. The objective is to process and analyze log files by establishing a flow that begins with a Kafka producer, followed by a Kafka consumer that performs transformations using Spark. The transformed data will then be converted into Parquet file format and stored in the HDFS.

Spark Streaming

The spark_stream.py script reads log entries from a Kafka topic and performs various analytics on the log data using Apache Spark Streaming.

Prerequisites

Before running the spark_stream.py script, ensure you have the following:

Apache Spark installed and configured.
Kafka broker running on localhost:9092.
Kafka topic name specified in the kafka_topic variable.
HDFS (Hadoop Distributed File System) configured and running on hdfs://127.0.0.1:9000.

$HADOOP_HOME/sbin/start-all.sh

Usage

To run the Spark Streaming script, execute the following command:

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 spark_stream.py

Please note that you may need to adjust the Spark package version (spark-sql-kafka-0-10_2.12:3.4.0) based on your Spark installation.

Analytics

The spark_stream.py script performs several analytics on the log data received from Kafka and writes the results to Parquet files in the HDFS output path specified in the script. Here are some of the analytics performed:

Content size statistics
Missing values analysis
HTTP status code distribution
Frequent hosts
Frequent endpoints
Top ten error endpoints
Unique hosts
Unique daily hosts
Average number of daily requests per host
404 response codes analysis

The results of these analytics are written to Parquet files in the HDFS output path specified in the script.

Jupyter Notebook: SectionB_demo.ipynb

The SectionB_demo.ipynb notebook demonstrates the results of the log analytics performed using the Kafka producer and Spark Streaming. It provides a step-by-step walkthrough of the analytics process and presents visualizations and analysis of the log data.

To run the notebook, ensure you have Jupyter Notebook installed and execute the following command:

jupyter notebook SectionB_demo.ipynb

Reference

This project is based on the article "Scalable Log Analytics with Apache Spark: A Comprehensive Case Study" by Dipanjan (DJ) Sarkar. You can find the article here.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
SectionA		SectionA
SectionB		SectionB
LICENSE		LICENSE
README.md		README.md
flow.png		flow.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Log Analytics

Kafka Producer

Prerequisites

Usage

Spark Streaming

Prerequisites

Usage

Analytics

Jupyter Notebook: SectionB_demo.ipynb

Reference

License

About

Releases

Packages

Contributors 2

Languages

License

wangyuhsin/log_analytics

Folders and files

Latest commit

History

Repository files navigation

Log Analytics

Kafka Producer

Prerequisites

Usage

Spark Streaming

Prerequisites

Usage

Analytics

Jupyter Notebook: SectionB_demo.ipynb

Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages