Skip to content

Latest commit

 

History

History
62 lines (48 loc) · 3.7 KB

File metadata and controls

62 lines (48 loc) · 3.7 KB

spark-streaming-data-persistence-simulations

Examples showing how Spark streaming applications can be simulated and data persisted to Azure blob, Hive table and Azure SQL Table with Azure Servicebus Eventhubs as flow control manager.

Usage

EventhubsEventCount

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventHubsEventCount spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars <EventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --job-timeout-in-minutes $timeoutInMinutes

EventhubsToAzureBlobAsJSON

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToAzureBlobAsJSON spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count $partitionCount --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-store-folder --job-timeout-in-minutes

EventhubsToHiveTable

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToHiveTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --event-hive-table --job-timeout-in-minutes

EventhubsToSQLTable

spark-submit --master yarn-cluster <...SparkConfigurations...> --class com.microsoft.spark.streaming.simulations.workloads.EventhubsToSQLTable spark-streaming-data-persistence-simulations-2.0.0.jar --eventhubs-namespace --eventhubs-name --eventSizeInChars --partition-count --batch-interval-in-seconds --checkpoint-directory --event-count-folder --sql-server-fqdn --sql-database-name --database-username --database-password --event-sql-table --job-timeout-in-minutes

Example:

  spark-submit --master yarn-cluster --class com.microsoft.spark.streaming.simulations.workloads.EventhubsEventCount --num-executors 24 
  --executor-memory 2G --executor-cores 1 --driver-memory 4G spark-streaming-data-persistence-simulations-2.0.0.jar 
  --eventhubs-namespace 'sparkeventhubswestus' --eventhubs-name 'eventhubs8westus' --partition-count 8 --batch-interval-in-seconds 15 
  --checkpoint-directory 'hdfs://mycluster/EventCheckpoint-15-8-16' --event-count-folder '/EventCount-15-8-16/EventCount15' 
  --job-timeout-in-minutes -1

Note:

  1. Use any string for --eventhubs-namespace and --eventhubs-name for launching the Spark streaming application.
  2. Use the same strings to restart the Spark streaming application to use the saved checkpoint.
  3. Use any integer for --partition-count and use at least double that number for --num-executors

Build Prerequisites

In order to build and run the examples, you need to have:

  1. Java 1.8 SDK.
  2. Maven 3.x
  3. Scala 2.11

Build Command

mvn clean
mvn package