Skip to content

haozhang-x/log-analysis-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Structured Streaming Log Analysis

Project Introduction

Use Python to simulate a website log and send the log file to kafka's message. Use Spark Structured Streaming to process the log data in kafka to calculate the total PV, the PV of each IP, the PV of the search engine, the PV of the keyword, the PV of the terminal, and write the final result to the RDBMS.

Sample log data

You can find some examples of logs generated in Python here.

The log file is sent to kafka's message.

sample_web_log.py use to generate logs You can use the following commands to produce kafka's message

python sample_web_log.py|kafka-console-producer.sh --broker-list your_broker_list --topic  your_topic  

You can also use the crontab to generate kafka messages at regular intervals.

crontab -e
0/5 * * * * ? python sample_web_log.py|kafka-console-producer.sh --broker-list your_broker_list --topic  your_topic 

Other

There are two files application.properties and mysql.sql under the resources folder. application.properties is the connection information of the database, mysql.sql is used to create the database and data table sql statement.