Simple HTTPd log (a.k.a. access.log) parser for Spark SQL.
Currently, Combined and Common log formats are supported.
When start spark-sql:
spark-sql --packages net.sanori.spark:access-log_2.11:0.1.0
In SQL, you can create user defined function and use it:
-- attach ToCombined as to_combined(text_line)
CREATE OR REPLACE FUNCTION to_combined
AS "net.sanori.spark.ToCombined";
-- read raw log file as one column table
CREATE OR REPLACE TEMP VIEW accessLogText
USING text
OPTIONS (path "access.log");
-- create parsed log as a table
CREATE OR REPLACE TEMP VIEW accessLog
AS SELECT log.*
FROM (
SELECT to_combined(value) AS log
FROM accessLogText
)
When start spark-shell:
spark-shell --packages net.sanori.spark:access-log_2.11:0.1.0
Or in build.sbt:
libraryDependencies += "net.sanori.spark" %% "access-log" % "0.1.0"
import net.sanori.spark.accessLog.to_combined
import org.apache.spark.sql.functions._
val lineDf = spark.read.text("access.log")
val logDf = lineDf
.select(to_combined(col("value")).as("log"))
.select(col("log.*"))
import net.sanori.spark.accessLog.toCombinedLog
val lineDs = spark.read.textFile("access.log")
val logDs = lineDs.map(toCombinedLog)
import net.sanori.spark.accessLog.toCombinedLog
val lines = sc.textFile("access.log")
val rdd = lines.map(toCombinedLog)
Combined or Common logs are transformed to the table which has the following meaning:
name | type | default value |
---|---|---|
remoteAddr | String | "" |
remoteUser | String | "" |
time | Timestamp | 1970-01-01T00:00:00Z |
request | String | "" |
status | String | "" |
bytesSent | Long | null |
httpReferer | String | "" |
httpUserAgent | String | "" |
sbt clean package
generates access-log_2.11-0.1.0.jar
in target/scala-2.11
.
- To simplify analysis of web server logs
- Most of the logs of web server, that is HTTP server, are in Combined or Common log format.
- To make user defined function that can be used on spark-sql command
If you want to view access.log as a table on Hive, not on Spark, or want to process various log formats, nielsbasjes/logparser might be better solution.
Suggestions, idea, comments, pull requests are welcome.