spark-streaming-data-persistence-simulations

Examples showing how Spark streaming applications can be simulated and data persisted to Azure blob, Hive table and Azure SQL Table with Azure Servicebus Eventhubs as flow control manager.

Usage

EventhubsEventCount

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventHubsEventCount spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars <EventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --job-timeout-in-minutes $timeoutInMinutes

EventhubsToAzureBlobAsJSON

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToAzureBlobAsJSON spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-store-folder --job-timeout-in-minutes

EventhubsToHiveTable

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToHiveTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-hive-table --job-timeout-in-minutes

EventhubsToSQLTable

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToSQLTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --sql-server-fqdn --sql-database-name --database-username --database-password --event-sql-table --job-timeout-in-minutes

Example:

  spark-submit --master yarn-cluster --class com.microsoft.spark.streaming.simulations.workloads.EventhubsEventCount --num-executors 24 
  --executor-memory 2G --executor-cores 1 --driver-memory 4G spark-streaming-data-persistence-simulations-2.0.0.jar 
  --eventhubs-namespace 'sparkeventhubswestus' --eventhubs-name 'eventhubs8westus' --partition-count 8 --batch-interval-in-seconds 15 
  --checkpoint-directory 'hdfs://mycluster/EventCheckpoint-15-8-16' --event-count-folder '/EventCount-15-8-16/EventCount15' 
  --job-timeout-in-minutes -1

Note:

Use any string for --eventhubs-namespace and --eventhubs-name for launching the Spark streaming application.
Use the same strings to restart the Spark streaming application to use the saved checkpoint.
Use any integer for --partition-count and use at least double that number for --num-executors

Build Prerequisites

In order to build and run the examples, you need to have:

Java 1.8 SDK.
Maven 3.x
Scala 2.11

Build Command

mvn clean
mvn package

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

spark-streaming-data-persistence-simulations

Usage

EventhubsEventCount

EventhubsToAzureBlobAsJSON

EventhubsToHiveTable

EventhubsToSQLTable

Example:

Note:

Build Prerequisites

Build Command

Files

README.md

Latest commit

History

README.md

File metadata and controls

spark-streaming-data-persistence-simulations

Usage

EventhubsEventCount

EventhubsToAzureBlobAsJSON

EventhubsToHiveTable

EventhubsToSQLTable

Example:

Note:

Build Prerequisites

Build Command