Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Python Bindings for launching PySpark Jobs from the JVM (v1) #351

Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
d3cf58f
Adding PySpark Submit functionality. Launching Python from JVM
ifilonenko Jun 16, 2017
bafc13c
Addressing scala idioms related to PR351
ifilonenko Jun 17, 2017
59d9f0a
Removing extends Logging which was necessary for LogInfo
ifilonenko Jun 17, 2017
4daf634
Refactored code to leverage the ContainerLocalizedFileResolver
ifilonenko Jun 20, 2017
51105ca
Modified Unit tests so that they would pass
ifilonenko Jun 20, 2017
bd30f40
Modified Unit Test input to pass Unit Tests
ifilonenko Jun 20, 2017
720776e
Setup working environent for integration tests for PySpark
ifilonenko Jun 21, 2017
4b5f470
Comment out Python thread logic until Jenkins has python in Python
ifilonenko Jun 21, 2017
1361a26
Modifying PythonExec to pass on Jenkins
ifilonenko Jun 21, 2017
0abc3b1
Modifying python exec
ifilonenko Jun 21, 2017
0869b07
Added unit tests to ClientV2 and refactored to include pyspark submis…
ifilonenko Jun 23, 2017
38d48ce
Merge branch 'branch-2.1-kubernetes' of https://github.com/apache-spa…
ifilonenko Jun 23, 2017
9bf7b9d
Modified unit test check
ifilonenko Jun 23, 2017
4561194
Scalastyle
ifilonenko Jun 23, 2017
2cf96cc
Merged with PR 348 and added further tests and minor documentation
ifilonenko Jun 23, 2017
eb1079a
PR 348 file conflicts
ifilonenko Jun 23, 2017
4a6b779
Refactored unit tests and styles
ifilonenko Jun 28, 2017
363919a
further scala stylzing and logic
ifilonenko Jun 28, 2017
9c7adb1
Modified unit tests to be more specific towards Class in question
ifilonenko Jun 28, 2017
0388aa4
Removed space delimiting for methods
ifilonenko Jun 28, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ We've been asked by an Apache Spark Committer to work outside of the Apache infr

This is a collaborative effort by several folks from different companies who are interested in seeing this feature be successful. Companies active in this project include (alphabetically):

- Bloomberg
- Google
- Haiwen
- Hyperpilot
Expand Down
14 changes: 10 additions & 4 deletions core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
Original file line number Diff line number Diff line change
Expand Up @@ -335,8 +335,8 @@ object SparkSubmit {
(clusterManager, deployMode) match {
case (KUBERNETES, CLIENT) =>
printErrorAndExit("Client mode is currently not supported for Kubernetes.")
case (KUBERNETES, CLUSTER) if args.isPython || args.isR =>
printErrorAndExit("Kubernetes does not currently support python or R applications.")
case (KUBERNETES, CLUSTER) if args.isR =>
printErrorAndExit("Kubernetes does not currently support R applications.")
case (STANDALONE, CLUSTER) if args.isPython =>
printErrorAndExit("Cluster deploy mode is currently not supported for python " +
"applications on standalone clusters.")
Expand Down Expand Up @@ -620,8 +620,14 @@ object SparkSubmit {

if (isKubernetesCluster) {
childMainClass = "org.apache.spark.deploy.kubernetes.submit.Client"
childArgs += args.primaryResource
childArgs += args.mainClass
if (args.isPython) {
childArgs += args.primaryResource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could factor out childArgs += args.primaryResource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val mainAppResource = args(0)
so it is necessary to point to the mainAppResource in Client.main()

childArgs += "org.apache.spark.deploy.PythonRunner"
childArgs += args.pyFiles
} else {
childArgs += args.primaryResource
childArgs += args.mainClass
}
childArgs ++= args.childArgs
}

Expand Down
26 changes: 26 additions & 0 deletions docs/running-on-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,32 @@ The above mechanism using `kubectl proxy` can be used when we have authenticatio
kubernetes-client library does not support. Authentication using X509 Client Certs and OAuth tokens
is currently supported.

### Running PySpark

Running PySpark on Kubernetes leverages the same spark-submit logic when launching on Yarn and Mesos.
Python files can be distributed by including, in the conf, `--py-files`

Below is an example submission:


```
bin/spark-submit \
--deploy-mode cluster \
--master k8s://http://127.0.0.1:8001 \
--kubernetes-namespace default \
--conf spark.executor.memory=500m \
--conf spark.driver.memory=1G \
--conf spark.driver.cores=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.instances=1 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=spark-driver-py:latest \
--conf spark.kubernetes.executor.docker.image=spark-executor-py:latest \
--conf spark.kubernetes.initcontainer.docker.image=spark-init:latest \
--py-files local:///opt/spark/examples/src/main/python/sort.py \
local:///opt/spark/examples/src/main/python/pi.py 100
```

## Dynamic Executor Scaling

Spark on Kubernetes supports Dynamic Allocation with cluster mode. This mode requires running
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ package object constants {
private[spark] val ENV_DRIVER_ARGS = "SPARK_DRIVER_ARGS"
private[spark] val ENV_DRIVER_JAVA_OPTS = "SPARK_DRIVER_JAVA_OPTS"
private[spark] val ENV_MOUNTED_FILES_DIR = "SPARK_MOUNTED_FILES_DIR"
private[spark] val ENV_PYSPARK_FILES = "PYSPARK_FILES"
private[spark] val ENV_PYSPARK_PRIMARY = "PYSPARK_PRIMARY"

// Bootstrapping dependencies with the init-container
private[spark] val INIT_CONTAINER_ANNOTATION = "pod.beta.kubernetes.io/init-containers"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,11 @@ private[spark] class Client(
appName: String,
kubernetesResourceNamePrefix: String,
kubernetesAppId: String,
mainAppResource: String,
pythonResource: Option[PythonSubmissionResources],
mainClass: String,
sparkConf: SparkConf,
appArgs: Array[String],
sparkJars: Seq[String],
sparkFiles: Seq[String],
waitForAppCompletion: Boolean,
kubernetesClient: KubernetesClient,
initContainerComponentsProvider: DriverInitContainerComponentsProvider,
Expand Down Expand Up @@ -82,9 +82,10 @@ private[spark] class Client(
org.apache.spark.internal.config.DRIVER_JAVA_OPTIONS)

def run(): Unit = {
validateNoDuplicateFileNames(sparkJars)
validateNoDuplicateFileNames(sparkFiles)

val arguments = pythonResource match {
case Some(p) => p.arguments
case None => appArgs
}
val driverCustomLabels = ConfigurationUtils.combinePrefixedKeyValuePairsWithDeprecatedConf(
sparkConf,
KUBERNETES_DRIVER_LABEL_PREFIX,
Expand Down Expand Up @@ -136,7 +137,7 @@ private[spark] class Client(
.endEnv()
.addNewEnv()
.withName(ENV_DRIVER_ARGS)
.withValue(appArgs.mkString(" "))
.withValue(arguments.mkString(" "))
.endEnv()
.withNewResources()
.addToRequests("cpu", driverCpuQuantity)
Expand Down Expand Up @@ -182,9 +183,14 @@ private[spark] class Client(
.map(_.build())

val containerLocalizedFilesResolver = initContainerComponentsProvider
.provideContainerLocalizedFilesResolver()
.provideContainerLocalizedFilesResolver(mainAppResource)
val resolvedSparkJars = containerLocalizedFilesResolver.resolveSubmittedSparkJars()
val resolvedSparkFiles = containerLocalizedFilesResolver.resolveSubmittedSparkFiles()
val resolvedPySparkFiles = containerLocalizedFilesResolver.resolveSubmittedPySparkFiles()
val resolvedPrimaryPySparkResource = pythonResource match {
case Some(p) => p.primarySparkResource(containerLocalizedFilesResolver)
case None => ""
}

val initContainerBundler = initContainerComponentsProvider
.provideInitContainerBundle(maybeSubmittedResourceIdentifiers.map(_.ids()),
Expand Down Expand Up @@ -221,7 +227,7 @@ private[spark] class Client(
val resolvedDriverJavaOpts = resolvedSparkConf.getAll.map {
case (confKey, confValue) => s"-D$confKey=$confValue"
}.mkString(" ") + driverJavaOptions.map(" " + _).getOrElse("")
val resolvedDriverPod = podWithInitContainerAndMountedCreds.editSpec()
val resolvedDriverPodBuilder = podWithInitContainerAndMountedCreds.editSpec()
.editMatchingContainer(new ContainerNameEqualityPredicate(driverContainer.getName))
.addNewEnv()
.withName(ENV_MOUNTED_CLASSPATH)
Expand All @@ -233,7 +239,16 @@ private[spark] class Client(
.endEnv()
.endContainer()
.endSpec()
.build()
val resolvedDriverPod = pythonResource match {
case Some(p) => p.driverPod(
initContainerComponentsProvider,
resolvedPrimaryPySparkResource,
resolvedPySparkFiles.mkString(","),
driverContainer.getName,
resolvedDriverPodBuilder
)
case None => resolvedDriverPodBuilder.build()
}
Utils.tryWithResource(
kubernetesClient
.pods()
Expand Down Expand Up @@ -271,17 +286,6 @@ private[spark] class Client(
}
}
}

private def validateNoDuplicateFileNames(allFiles: Seq[String]): Unit = {
val fileNamesToUris = allFiles.map { file =>
(new File(Utils.resolveURI(file).getPath).getName, file)
}
fileNamesToUris.groupBy(_._1).foreach {
case (fileName, urisWithFileName) =>
require(urisWithFileName.size == 1, "Cannot add multiple files with the same name, but" +
s" file name $fileName is shared by all of these URIs: $urisWithFileName")
}
}
}

private[spark] object Client {
Expand All @@ -292,22 +296,38 @@ private[spark] object Client {
val appArgs = args.drop(2)
run(sparkConf, mainAppResource, mainClass, appArgs)
}

def run(
sparkConf: SparkConf,
mainAppResource: String,
mainClass: String,
appArgs: Array[String]): Unit = {
val isPython = mainAppResource.endsWith(".py")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alluding to a remark from earlier - we might want to treat these arguments differently. For example we could take a command line argument that is a "language mode" and expect SparkSubmit to give us the right language mode and handle accordingly - e.g. Scala, Python, R. We have control over the arguments that SparkSubmit.scala sends us and so we should encode the arguments clearly if we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. but R will also have MainResource be the Python file, but there are no --r-files. So arguments logic is only for Python. I think it is simple enough that refactoring the arguments, might not be necessary. Something to consider, but I agree with what you are saying

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more wary of the fact that we're matching against Nil here when we could be doing this in a type-safe way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I match on NIl is because the appArgs being passed in are:
(null 500) from the spark-submit because --py-files are null. The first value will always be the --py-spark files. and the rest are the arguments passed into the file itself. I don't see problems with that exactly.

val pythonResource: Option[PythonSubmissionResources] =
if (isPython) {
Option(new PythonSubmissionResources(mainAppResource, appArgs))
} else None
// Since you might need jars for SQL UDFs in PySpark
def sparkJarFilter() : Seq[String] = pythonResource match {
case Some(p) => p.sparkJars
case None =>
Option(mainAppResource)
.filterNot(_ == SparkLauncher.NO_RESOURCE)
.toSeq
}
val sparkJars = sparkConf.getOption("spark.jars")
.map(_.split(","))
.getOrElse(Array.empty[String]) ++
Option(mainAppResource)
.filterNot(_ == SparkLauncher.NO_RESOURCE)
.toSeq
.getOrElse(Array.empty[String]) ++ sparkJarFilter()
val launchTime = System.currentTimeMillis
val sparkFiles = sparkConf.getOption("spark.files")
.map(_.split(","))
.getOrElse(Array.empty[String])
val pySparkFiles: Array[String] = pythonResource match {
case Some(p) => p.pySparkFiles
case None => Array.empty[String]
}
validateNoDuplicateFileNames(sparkJars)
validateNoDuplicateFileNames(sparkFiles)
if (pythonResource.isDefined) {validateNoDuplicateFileNames(pySparkFiles)}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Option.forEach

val appName = sparkConf.getOption("spark.app.name").getOrElse("spark")
// The resource name prefix is derived from the application name, making it easy to connect the
// names of the Kubernetes resources from e.g. Kubectl or the Kubernetes dashboard to the
Expand All @@ -326,6 +346,7 @@ private[spark] object Client {
namespace,
sparkJars,
sparkFiles,
pySparkFiles,
sslOptionsProvider.getSslOptions)
Utils.tryWithResource(SparkKubernetesClientFactory.createKubernetesClient(
master,
Expand All @@ -346,16 +367,26 @@ private[spark] object Client {
appName,
kubernetesResourceNamePrefix,
kubernetesAppId,
mainAppResource,
pythonResource,
mainClass,
sparkConf,
appArgs,
sparkJars,
sparkFiles,
waitForAppCompletion,
kubernetesClient,
initContainerComponentsProvider,
kubernetesCredentialsMounterProvider,
loggingPodStatusWatcher).run()
}
}
private def validateNoDuplicateFileNames(allFiles: Seq[String]): Unit = {
val fileNamesToUris = allFiles.map { file =>
(new File(Utils.resolveURI(file).getPath).getName, file)
}
fileNamesToUris.groupBy(_._1).foreach {
case (fileName, urisWithFileName) =>
require(urisWithFileName.size == 1, "Cannot add multiple files with the same name, but" +
s" file name $fileName is shared by all of these URIs: $urisWithFileName")
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,19 @@ private[spark] trait ContainerLocalizedFilesResolver {
def resolveSubmittedAndRemoteSparkJars(): Seq[String]
def resolveSubmittedSparkJars(): Seq[String]
def resolveSubmittedSparkFiles(): Seq[String]
def resolveSubmittedPySparkFiles(): Seq[String]
def resolvePrimaryResourceFile(): String
}

private[spark] class ContainerLocalizedFilesResolverImpl(
sparkJars: Seq[String],
sparkFiles: Seq[String],
pySparkFiles: Seq[String],
primaryPyFile: String,
jarsDownloadPath: String,
filesDownloadPath: String) extends ContainerLocalizedFilesResolver {
filesDownloadPath: String
) extends ContainerLocalizedFilesResolver {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line up - notice the diff from what this had before.



override def resolveSubmittedAndRemoteSparkJars(): Seq[String] = {
sparkJars.map { jar =>
Expand All @@ -53,16 +59,33 @@ private[spark] class ContainerLocalizedFilesResolverImpl(
resolveSubmittedFiles(sparkFiles, filesDownloadPath)
}

private def resolveSubmittedFiles(files: Seq[String], downloadPath: String): Seq[String] = {
files.map { file =>
val fileUri = Utils.resolveURI(file)
Option(fileUri.getScheme).getOrElse("file") match {
case "file" =>
val fileName = new File(fileUri.getPath).getName
s"$downloadPath/$fileName"
case _ =>
file
}
override def resolveSubmittedPySparkFiles(): Seq[String] = {
def filterMainResource(x: String) = x match {
case `primaryPyFile` => None
case _ => Some(resolveFile(x, filesDownloadPath))
}
pySparkFiles.flatMap(x => filterMainResource(x))
}

override def resolvePrimaryResourceFile(): String = {
Option(primaryPyFile) match {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Option.map. Never use match on Options:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Is that Spark specific scala practice?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just spark-specific but seems to be the standard across all of Scala.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://www.scala-lang.org/api/current/scala/Option.html

"The most idiomatic way to use an scala.Option instance is to treat it as a collection or monad and use map,flatMap, filter, or foreach... A less-idiomatic way to use scala.Option values is via pattern matching"

case None => ""
case Some(p) => resolveFile(p, filesDownloadPath)
}
}

private def resolveFile(file: String, downloadPath: String) = {
val fileUri = Utils.resolveURI(file)
Option(fileUri.getScheme).getOrElse("file") match {
case "file" =>
val fileName = new File(fileUri.getPath).getName
s"$downloadPath/$fileName"
case _ =>
file
}
}

private def resolveSubmittedFiles(files: Seq[String], downloadPath: String): Seq[String] = {
files.map { file => resolveFile(file, downloadPath) }
}
}
Loading