Python Bindings for launching PySpark Jobs from the JVM (v1) #351

ifilonenko · 2017-06-16T20:30:42Z

What changes were proposed in this pull request?

The changes that were proposed in the pull request are the following:

Using separate docker images, built on-top of spark-base, for PySpark jobs.
These images differ with the inclusion of python and pyspark specific environment variables. The user-entry point also differs for driver-py as you must include the location of the primary PySpark file and distributed py-files in addition to driver args.
New FileMountingTrait that is generic enough for both SparkR and PySpark to handle passing in the proper arguments for PythonRunner and RRunner. This FileMounter uses the filesDownloadPath resolved in the DriverInitComponent to ensure that file paths are correct. These file-paths are stored as environmental variables that are mounted on the driver pod.
Inclusion of integration tests for PySpark (TODO: Build environment identical to distribution python environment to run the tests)
Match statements to account for varying arguments and malformed inputs which may include nulls or a mix of local:// and file:// file-types.

Example Spark Submit

This is an example spark-submit that uses the custom pyspark docker images and distributes the staged sort.py file across the cluster. The entry point for the driver is:
org.apache.spark.deploy.PythonRunner <FILE_DOWNLOADS_PATH>/pi.py <FILE_DOWNLOADS_PATH>/sort.py 100

bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://<k8s-api-url> \
  --kubernetes-namespace default \
  --conf spark.executor.memory=500m \
  --conf spark.driver.memory=1G \
  --conf spark.driver.cores=1 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.instances=1 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=spark-driver-py:latest \
  --conf spark.kubernetes.executor.docker.image=spark-executor-py:latest \
  --conf spark.kubernetes.initcontainer.docker.image=spark-init:latest \
  --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.99.100:31000 \
  --py-files examples/src/main/python/sort.py \
  examples/src/main/python/pi.py 100

How was this patch tested?

This was fully tested by building a make_distribution environment and running on a local minikube cluster with a single executor. The following command is an example submission:

$ build/mvn install -Pkubernetes -pl resource-managers/kubernetes/core -am -DskipTests
$ build/mvn compile -T 4C -Pkubernetes -pl resource-managers/kubernetes/core -am -DskipTests
$ dev/make-distribution.sh --pip --tgz -Phadoop-2.7 -Pkubernetes
$ tar -xvf spark-2.1.0-k8s-0.2.0-SNAPSHOT-bin-2.7.3.tgz
$ cd spark-2.1.0-k8s-0.2.0-SNAPSHOT-bin-2.7.3
$ minikube start --insecure-registry=localhost:5000 --cpus 8 --disk-size 20g --memory 8000 --kubernetes-version v1.5.3; eval $(minikube docker-env)
$ # Build all docker images using docker build ....
$ # Make sure staging server is up 
$ kubectl cluster-info
Kubernetes master is running at https://192.168.99.100:8443
KubeDNS is running at https://192.168.99.100:8443/api/v1/proxy/namespaces/kube-system/services/kube-dns
kubernetes-dashboard is running at https://192.168.99.100:8443/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard
$ docker images
REPOSITORY                                          
spark-integration-test-asset-server                 
spark-init                                           
spark-resource-staging-server                         
spark-shuffle                                      
spark-executor-py                                    
spark-executor                                      
spark-driver-py                                      
spark-driver                                        
spark-base                                         
$ bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://192.168.99.100:8443 \
  --kubernetes-namespace default \
  --conf spark.executor.memory=500m \
  --conf spark.driver.memory=1G \
  --conf spark.driver.cores=1 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.instances=1 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=spark-driver-py:latest \
  --conf spark.kubernetes.executor.docker.image=spark-executor-py:latest \
  --conf spark.kubernetes.initcontainer.docker.image=spark-init:latest \
  --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.99.100:31000 \
  --py-files examples/src/main/python/sort.py \
  local:///opt/spark/examples/src/main/python/pi.py 100

Integration and Unit tests have been added.

Future Versions of this PR

Launching JVM from Python (log issue)
MemoryOverhead testing (OOMKilled errors)

ifilonenko · 2017-06-16T20:39:14Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

  def run(
      sparkConf: SparkConf,
      mainAppResource: String,
      mainClass: String,
      appArgs: Array[String]): Unit = {
-    val sparkJars = sparkConf.getOption("spark.jars")
+    val isPython = mainAppResource.endsWith(".py")
+    val sparkJars = if (isPython) Array.empty[String] else {


Would it be the case that people upload spark.jars when running a PySpark job?

It could happen for SQL UDFs, perhaps?

I think staging jars is a possible use case, although I have never done it. Seems OK to allow jar list to be set non-empty for now

mccheah · 2017-06-16T20:40:22Z

(y) This is on my radar to look at - thanks a lot for submitting this.

ifilonenko · 2017-06-16T20:53:33Z

resource-managers/kubernetes/integration-tests/pom.xml

+                      </resource>
+                  </resources>
+              </configuration>
+          </execution>


This part is what is incomplete for making fully fledged integration tests. It seems that we need to mimic the environment created by the make_distribution.sh environment, done with the following script:

# Make pip package if [ "$MAKE_PIP" == "true" ]; then echo "Building python distribution package" pushd "$SPARK_HOME/python" > /dev/null python setup.py sdist popd > /dev/null else echo "Skipping building python distribution package" fi

A similar environment will need to be mimicked for testing R bindings. Any recommendations would be great :)

What do the Pyspark tests do?

Modify the mainClass to PythonRunner and pass in custom arguments to test on a locally baked PySpark test file. The file in the test environment is pi.py

Further tests I wish to add include calling other python files, but I am bottle-necked by the environment atm.

erikerlandson · 2017-06-16T22:07:57Z

IIRC, you had concerns about additional container size - but should we consider folding the python-specific images into the standard spark images, to avoid the need for specifying special images? What is the size impact?

ifilonenko · 2017-06-16T22:18:42Z

@erikerlandson That could be something to look into. Python environment that I am loading in doubles the size of the driver image to 573 MB from the original 258 MB.

erikerlandson · 2017-06-16T21:50:21Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

-      childArgs += args.primaryResource
-      childArgs += args.mainClass
+      if (args.isPython) {
+        childArgs += args.primaryResource


nit: could factor out childArgs += args.primaryResource

val mainAppResource = args(0)
so it is necessary to point to the mainAppResource in Client.main()

erikerlandson · 2017-06-16T21:51:01Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+        childArgs += "org.apache.spark.deploy.PythonRunner"
+        childArgs += args.pyFiles
+      }
+      else {


I think scala-style expected to be } else {

erikerlandson · 2017-06-16T21:53:28Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

@@ -83,7 +86,14 @@ private[spark] class Client(
  def run(): Unit = {
    validateNoDuplicateFileNames(sparkJars)
    validateNoDuplicateFileNames(sparkFiles)
-
+    if (isPython) {validateNoDuplicateFileNames(pySparkFiles)}
+    val arguments = if (isPython) pySparkFiles match {


Wonder if this could be factored to scale out cleaner. For example, if we add R next, is that going to be a new 3rd layer of arguments?

I don't see --r-files as a submission type so it is something to look into.

erikerlandson · 2017-06-16T21:57:51Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

  def run(
      sparkConf: SparkConf,
      mainAppResource: String,
      mainClass: String,
      appArgs: Array[String]): Unit = {
-    val sparkJars = sparkConf.getOption("spark.jars")
+    val isPython = mainAppResource.endsWith(".py")
+    val sparkJars = if (isPython) Array.empty[String] else {


I think staging jars is a possible use case, although I have never done it. Seems OK to allow jar list to be set non-empty for now

erikerlandson · 2017-06-16T22:00:16Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

    val launchTime = System.currentTimeMillis
    val sparkFiles = sparkConf.getOption("spark.files")
      .map(_.split(","))
      .getOrElse(Array.empty[String])
+    val pySparkFiles: Array[String] = if (isPython) {
+      appArgs(0) match {
+        case null => Array(mainAppResource)


keying in on a null return is a scala no-no. Recommend if (appArgs.isEmpty) ... or some other test

erikerlandson · 2017-06-16T22:01:39Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

@@ -302,12 +326,17 @@ private[spark] object Client {
    val namespace = sparkConf.get(KUBERNETES_NAMESPACE)
    val master = resolveK8sMaster(sparkConf.get("spark.master"))
    val sslOptionsProvider = new ResourceStagingServerSslOptionsProviderImpl(sparkConf)
+    // No reason to distribute python files that are locally baked into Docker image
+    def filterByFile(pFiles: Array[String]) : Array[String] = {
+      val LocalPattern = "(local://)(.*)"


Is local:// purely reserved for things that would already be on the image?

yes, if there are other patterns, let me know

erikerlandson · 2017-06-16T22:03:57Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+  * environmental variables in the driver-pod.
+  */
+private[spark] trait DriverPodKubernetesFileMounter {
+  def addPySparkFiles(mainAppResource: String, pythonFiles: List[String],


This seems to be python specific, but it's trait name is not. Should the trait name include ...Python... somewhere? Or, should this be folded into more general file mounter?

This will be a more general file mounter for rfiles as well.

Have you considered including this in SubmittedDependencyUploader or re-using that somehow?

Never mind - I misunderstood the name of the class and thought it was also uploading the files to the resource staging server as well.

erikerlandson · 2017-06-16T22:04:56Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

@@ -301,11 +301,14 @@ class ClientV2Suite extends SparkFunSuite with BeforeAndAfter {
      APP_NAME,
      APP_RESOURCE_PREFIX,
      APP_ID,
+      null,


Use Option[T] for things that may or may not be set.

Noted. Modifying nulls to Nil and Empty String where appropriate

erikerlandson · 2017-06-16T22:21:31Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+  * the filesDownloadPath has been defined. The file-names are then stored in the
+  * environmental variables in the driver-pod.
+  */
+private[spark] trait DriverPodKubernetesFileMounter {


With the V2 file staging server, do we need special code for staging python files?
cc/ @mccheah

They're just added to spark.files and that should suffice.

I think there's a legitimate question of if we want these files to be deployed in the same location as where the spark.files is deployed. I think this is fine for now but we have split the jars out so it seems strange that we're splitting in one instance but bundling in the same directory in this case.

foxish · 2017-06-16T23:51:42Z

Really excited to see this. Thanks @ifilonenko!

kimoonkim · 2017-06-17T00:13:30Z

Yes, an awesome contribution! @ifilonenko

mccheah · 2017-06-19T18:58:16Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+
+private[spark] class DriverPodKubernetesFileMounterImpl(filesDownloadPath: String)
+  extends DriverPodKubernetesFileMounter {
+  val LocalPattern = "(local://)(.*)".r


We've been using Utils.resolveURI(uri).getScheme match - see KubernetesFileUtils.

mccheah

Main change I would like to see is to try to merge the container file resolution logic with ContainerLocalizedFilesResolver.

mccheah · 2017-06-19T20:19:04Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

      .map(_.split(","))
      .getOrElse(Array.empty[String]) ++
      Option(mainAppResource)
        .filterNot(_ == SparkLauncher.NO_RESOURCE)
-        .toSeq
+        .toSeq }


} should go on the next line

mccheah · 2017-06-19T20:24:28Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

    val launchTime = System.currentTimeMillis
    val sparkFiles = sparkConf.getOption("spark.files")
      .map(_.split(","))
      .getOrElse(Array.empty[String])
+    val pySparkFiles: Array[String] = if (isPython) {


I'm starting to wonder if we want our arguments to be similar to CLI arguments. We can then adjust SparkSubmit.scala to conform to the contract here.

For example, we could expect our main method to have arguments like this:

org.apache.spark.deploy.kubernetes.Client --primary-resource <resource> --py-files <pyfiles>.

Basically we can reformat Client.main's contract to expect named arguments, make SparkSubmit pass us named arguments, and parse them here accordingly.

@ifilonenko thoughts on this?

mccheah · 2017-06-19T20:29:28Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+ /**
+  * Trait that is responsible for providing full file-paths dynamically after
+  * the filesDownloadPath has been defined. The file-names are then stored in the
+  * environmental variables in the driver-pod.


Given this description, can we use ContainerLocalizedFileResolver?

FileMounter is also a misleading name since this doesn't actually mount the files themselves, but rather it resolves the paths to them.

mccheah · 2017-06-19T20:35:50Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+  * environmental variables in the driver-pod.
+  */
+private[spark] trait DriverPodKubernetesFileMounter {
+  def addPySparkFiles(mainAppResource: String, pythonFiles: List[String],


Never mind - I misunderstood the name of the class and thought it was also uploading the files to the resource staging server as well.

mccheah · 2017-06-19T20:38:05Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

@@ -301,11 +301,14 @@ class ClientV2Suite extends SparkFunSuite with BeforeAndAfter {
      APP_NAME,
      APP_RESOURCE_PREFIX,
      APP_ID,
+      "",


We should add tests that specifically check the Python logic.

Added unit tests

Can you update ClientV2Suite to ensure that we're getting an instance of the file mounter and using it to mount the PySpark files?

mccheah · 2017-06-19T20:39:09Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+    case FilePattern(_, file_name) => filesDownloadPath + "/" + getName(file_name, '/')
+    case _ => filesDownloadPath + "/" + getName(file, '/')
+  }
+  def pythonFileLocations(pFiles: List[String], mainAppResource: String) : String = {


These operations are mostly shared with ContainerLocalizedFileResolver. Can we use that instead?

I have refactored to do this

mccheah · 2017-06-19T20:41:48Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+    recFileLoc(pFiles).mkString(",")
+  }
+  override def addPySparkFiles(mainAppResource: String, pythonFiles: List[String],
+                               mainContainerName: String,


Indentation here - put each argument on its own line. See argument lists for the constructor of DriverInitContainerComponentsProviderImpl.

mccheah · 2017-06-19T20:42:36Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+                               originalPodSpec: PodBuilder): PodBuilder = {
+    originalPodSpec
+      .editSpec()
+      .editMatchingContainer(new ContainerNameEqualityPredicate(mainContainerName))


We've been indenting these to make it easier to track where the objects begin and end. See for example SparkPodInitContainerBootstrapImpl.

mccheah · 2017-06-19T20:43:28Z

resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/driver-py/Dockerfile

+ENV PYSPARK_DRIVER_PYTHON python
+ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}
+
+CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \


The classpath environment variables may mean different in Pyspark, but given that we might want to also be shipping jars for UDFs then this might still apply.

Exactly, I was confused about whether this would be necessary. This is an extension to the question about whether spark.jars should be empty

My understanding now is that we actually do need jars for SQL UDFs. cc @robert3005

Okay I will account for that by allowing submission of spark.jars in Client.scala

mccheah · 2017-06-19T20:44:10Z

...ests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesSuite.scala

  val TIMEOUT = PatienceConfiguration.Timeout(Span(2, Minutes))
  val INTERVAL = PatienceConfiguration.Interval(Span(2, Seconds))
  val SPARK_PI_MAIN_CLASS = "org.apache.spark.deploy.kubernetes" +
    ".integrationtest.jobs.SparkPiWithInfiniteWait"
+  val PYSPARK_PI_MAIN_CLASS = "org.apache.spark.deploy.PythonRunner"
+  val PYSPARK_PI_CONTAINER_LOCAL_FILE_LOCATION = "local:///opt/spark/" +
+    "examples/src/main/python/pi.py"


Move the entire string to this line.

ifilonenko · 2017-06-21T18:44:28Z

rerun integration test please

mccheah

Keep in mind that I'm glossing over some of the details of the submission client design because as discussed in the SIG meeting this morning, it probably needs to be refactored and restructured anyways.

mccheah · 2017-06-21T22:27:20Z

...cala/org/apache/spark/deploy/kubernetes/integrationtest/docker/SparkDockerImageBuilder.scala

+    val exitCode = process.waitFor()
+    if (exitCode != 0) {
+      // scalastyle:off println
+      println(s"exitCode: $exitCode")


We can use logging here also.

mccheah · 2017-06-21T22:28:47Z

...cala/org/apache/spark/deploy/kubernetes/integrationtest/docker/SparkDockerImageBuilder.scala

+      .getOrElse("/usr/bin/python")
+    val builder = new ProcessBuilder(
+      Seq(pythonExec, "setup.py", "sdist").asJava)
+    builder.directory(new java.io.File(s"$DOCKER_BUILD_PATH/python"))


Use new File(DOCKER_BUILD_PATH.toFile(), "python").

mccheah · 2017-06-21T22:29:45Z

...ests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesSuite.scala

+  test("Run PySpark Job on file from CONTAINER with spark.jar defined") {
+    assume(testBackend.name == MINIKUBE_TEST_BACKEND)
+
+    sparkConf.setJars(Seq(CONTAINER_LOCAL_HELPER_JAR_PATH))


A more interesting test could be to put a jar on the classpath that contains a UDF that the job needs. That could be left for follow-up work though.

mccheah · 2017-06-21T22:30:03Z

...ests/src/test/scala/org/apache/spark/deploy/kubernetes/integrationtest/KubernetesSuite.scala

+
+    launchStagingServer(SSLOptions(), None)
+    sparkConf.set(DRIVER_DOCKER_IMAGE,
+      System.getProperty("spark.docker.test.driverImage", "spark-driver-py:latest"))


Chain the .set calls together.

mccheah · 2017-06-21T22:36:27Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

    val launchTime = System.currentTimeMillis
    val sparkFiles = sparkConf.getOption("spark.files")
      .map(_.split(","))
      .getOrElse(Array.empty[String])
+    val pySparkFiles: Array[String] = if (isPython) {


@ifilonenko thoughts on this?

mccheah · 2017-06-21T22:43:12Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

-      .map(_.split(","))
-      .getOrElse(Array.empty[String]) ++
-      Option(mainAppResource)
+    val isPython = mainAppResource.endsWith(".py")


Alluding to a remark from earlier - we might want to treat these arguments differently. For example we could take a command line argument that is a "language mode" and expect SparkSubmit to give us the right language mode and handle accordingly - e.g. Scala, Python, R. We have control over the arguments that SparkSubmit.scala sends us and so we should encode the arguments clearly if we can.

True. but R will also have MainResource be the Python file, but there are no --r-files. So arguments logic is only for Python. I think it is simple enough that refactoring the arguments, might not be necessary. Something to consider, but I agree with what you are saying

I'm more wary of the fact that we're matching against Nil here when we could be doing this in a type-safe way.

The reason I match on NIl is because the appArgs being passed in are:
(null 500) from the spark-submit because --py-files are null. The first value will always be the --py-spark files. and the rest are the arguments passed into the file itself. I don't see problems with that exactly.

mccheah · 2017-06-21T22:45:21Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+}
+
+private[spark] class DriverPodKubernetesFileMounterImpl()
+  extends DriverPodKubernetesFileMounter {


I like the idea of making this generic - we might want to put the submitted jars in here too. It's worth noting that for the next refactor pass.

In the above case, I only load file:// into spark-files, but resolve paths and mount for the purpose of the docker image environment variables using this Trait.

mccheah · 2017-06-21T22:54:22Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

@@ -301,11 +301,14 @@ class ClientV2Suite extends SparkFunSuite with BeforeAndAfter {
      APP_NAME,
      APP_RESOURCE_PREFIX,
      APP_ID,
+      "",


Can you update ClientV2Suite to ensure that we're getting an instance of the file mounter and using it to mount the PySpark files?

mccheah · 2017-06-21T22:54:41Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+  * the filesDownloadPath has been defined. The file-names are then stored in the
+  * environmental variables in the driver-pod.
+  */
+private[spark] trait DriverPodKubernetesFileMounter {


They're just added to spark.files and that should suffice.

mccheah · 2017-06-21T22:55:33Z

...rc/main/scala/org/apache/spark/deploy/kubernetes/submit/DriverPodKubernetesFileMounter.scala

+  * the filesDownloadPath has been defined. The file-names are then stored in the
+  * environmental variables in the driver-pod.
+  */
+private[spark] trait DriverPodKubernetesFileMounter {


I think there's a legitimate question of if we want these files to be deployed in the same location as where the spark.files is deployed. I think this is fine for now but we have split the jars out so it seems strange that we're splitting in one instance but bundling in the same directory in this case.

…sion resources

…rk-on-k8s/spark into branch-2.1-kubernetes

mccheah · 2017-06-23T19:29:36Z

rerun integration tests please

ifilonenko · 2017-06-27T05:11:18Z

rerun integration tests please

ifilonenko · 2017-06-27T05:11:26Z

rerun unit tests please

mccheah · 2017-06-27T20:00:40Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

+        override def answer(invocation: InvocationOnMock) : PodBuilder = {
+          invocation.getArgumentAt(3, classOf[PodBuilder])
+          .editSpec()
+            .editMatchingContainer(new ContainerNameEqualityPredicate(


There's no need to write out all of the specific Pyspark logic here. We should prefer making this part of the test as simple as possible, and only checking the specifics in the file mounter suite.

How about just having this answer:

override def answer(invocation: InvocationOnMock): PodBuilder = { invocation.getArgumentAt(3, classOf[PodBuilder]).editSpec().editMetadata().addToLabels("pyspark", "true").endMetadata(); }

Then just check that the pod has the given label.

I understand. But why not test the specific functionality that the properly resolved file_names are mounted into the proper environmental variables? It seems to be the entire purpose of this unit test no?

That should be tested in the file mounter unit test. The Client class doesn't decide the environment variables to set - it trusts the submodule is providing the right configurations.

mccheah · 2017-06-27T20:02:23Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

+      any[String],
+      any[PodBuilder])).thenAnswer( new Answer[Pod] {
+        override def answer(invocation: InvocationOnMock) : Pod = {
+          invocation.getArgumentAt(0, classOf[DriverInitContainerComponentsProvider])


Use the mock file mounter directly from here instead ot calling to the init container components provider.

It was because PythonSubmissionResource wasn't a trait. That is a good point.

It's possible to use mocks to stub classes, but it's certainly not preferred. Mocks also can't be used to stub final methods or final classes so if we're mocking classes we have to make an assumption that we aren't doing those things - best to avoid the uncertainty entirely.

mccheah · 2017-06-27T20:03:34Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+
+import io.fabric8.kubernetes.api.model.{Pod, PodBuilder}
+
+class PythonSubmissionResources(


Make this a trait and write a unit test. This will allow us to mock this out and simplify the ClientV2Suite.

mccheah · 2017-06-27T20:04:04Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+  private val mainAppResource: String,
+  private val appArgs: Array[String] ) {
+
+  private val pyFiles: Array[String] = Option(appArgs(0)) match {


Use Option.getOrElse instead of match. Avoid matching with Options.

mccheah · 2017-06-27T20:04:36Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+
+  def pySparkFiles: Array[String] = pyFiles
+
+  def arguments: Array[String] =


Always wrap multi-line methods with curly braces, even if it's just a single statement.

mccheah · 2017-06-27T20:06:38Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+    resolvedPySparkFiles: String,
+    driverContainerName: String,
+    driverPodBuilder: PodBuilder) : Pod = {
+      initContainerComponentsProvider


I'm uncertain about the design where we having the init container components provider is used from here. The indirection is becoming tricky to follow. Shouldn't the init container components provider be creating this class and then the addPySparkFiles method is defined on this class?

Though we still wanted to make this generic. I think after seeing this though that we can make this generic only later - it's probably simpler to put everything Python related into this class, including the file mounting / resolution.

Good point actually, I should pass in the FileMounter instead of the InitContainer, I will refactor for that.

mccheah · 2017-06-27T20:22:47Z

...nagers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/Client.scala

+    }
+    validateNoDuplicateFileNames(sparkJars)
+    validateNoDuplicateFileNames(sparkFiles)
+    if (pythonResource.isDefined) {validateNoDuplicateFileNames(pySparkFiles)}


Use Option.forEach

mccheah · 2017-06-27T20:23:26Z

...c/main/scala/org/apache/spark/deploy/kubernetes/submit/ContainerLocalizedFilesResolver.scala

    jarsDownloadPath: String,
-    filesDownloadPath: String) extends ContainerLocalizedFilesResolver {
+    filesDownloadPath: String
+    ) extends ContainerLocalizedFilesResolver {


Move this line up - notice the diff from what this had before.

mccheah · 2017-06-27T20:24:53Z

...c/main/scala/org/apache/spark/deploy/kubernetes/submit/ContainerLocalizedFilesResolver.scala

+  }
+
+  override def resolvePrimaryResourceFile(): String = {
+    Option(primaryPyFile) match {


Use Option.map. Never use match on Options:

Why not? Is that Spark specific scala practice?

Not just spark-specific but seems to be the standard across all of Scala.

See https://www.scala-lang.org/api/current/scala/Option.html

"The most idiomatic way to use an scala.Option instance is to treat it as a collection or monad and use map,flatMap, filter, or foreach... A less-idiomatic way to use scala.Option values is via pattern matching"

mccheah · 2017-06-27T20:26:43Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+  def primarySparkResource (containerLocalizedFilesResolver: ContainerLocalizedFilesResolver)
+    : String = containerLocalizedFilesResolver.resolvePrimaryResourceFile()
+
+  def driverPod(


There might be a better name here - try for something that indicates this is Pyspark specific.

mccheah · 2017-06-27T20:27:48Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

@@ -169,30 +189,85 @@ class ClientV2Suite extends SparkFunSuite with BeforeAndAfter {
            .endMetadata()
        }
      })
-    when(initContainerComponentsProvider.provideContainerLocalizedFilesResolver())
+    when(initContainerComponentsProvider.provideContainerLocalizedFilesResolver(any[String]))


Don't match on any here - since it's a simple type (String) we should be able to capture the specific argument we're looking for.

ifilonenko · 2017-06-28T18:16:13Z

@erikerlandson @mccheah PTAL

mccheah · 2017-06-28T19:42:22Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+    containerLocalizedFilesResolver: ContainerLocalizedFilesResolver) : String =
+      containerLocalizedFilesResolver.resolvePrimaryResourceFile()
+
+  override def driverPod(


Can we rename this method?

mccheah · 2017-06-28T19:43:09Z

...kubernetes/core/src/test/scala/org/apache/spark/deploy/kubernetes/submit/ClientV2Suite.scala

+          .editSpec()
+            .editMatchingContainer(new ContainerNameEqualityPredicate(
+              invocation.getArgumentAt(2, classOf[String])))
+              .addNewEnv()


This is still too complex - just add to the labels or the annotations in the metadata.

But the test is checking if the matching Container has it's environmental variables changed as a result of the function. Which is what I am doing here.

The Client is unaware that the underlying implementation is specifically setting environment variables - so if the Client is unaware, the Client's test should be agnostic to that as well. All the test cares about is "Did the client use the files mounter to alter the driver pod in some way?".

mccheah · 2017-06-28T19:44:31Z

...rc/test/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResourcesSuite.scala

+  private val SPARK_JARS = Seq.empty[String]
+  private val JARS_DOWNLOAD_PATH = "/var/data/spark-jars"
+  private val FILES_DOWNLOAD_PATH = "/var/data/spark-files"
+  private val localizedFilesResolver = new ContainerLocalizedFilesResolverImpl(


Don't use the impl - use a mock object.

Generally a given test should only use the concrete implementation for the class that is under test. All other logical units should be mocks.

Noted, addressed in most recent commit

mccheah · 2017-06-28T19:44:54Z

...rc/test/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResourcesSuite.scala

+    .withNewSpec()
+      .addToContainers(driverContainer)
+    .endSpec()
+  private val driverFileMounter = new DriverInitContainerComponentsProviderImpl(


Don't use the impl - use a mock object.

ifilonenko · 2017-06-28T21:17:01Z

rerun unit tests please

ifilonenko · 2017-06-28T21:38:15Z

rerun unit tests please

mccheah · 2017-06-28T21:39:27Z

...ore/src/main/scala/org/apache/spark/deploy/kubernetes/submit/PythonSubmissionResources.scala

+  private val appArgs: Array[String] ) extends PythonSubmissionResources {
+
+  private val pyFiles: Array[String] = {
+    (Option(appArgs(0)) map (a => mainAppResource +: a.split(",")))


Put dots between (Option(appArgs(0)) and map. Space delimiting for methods is discouraged in general and I think there are a few other places where this is done.

Should this be changed in Client.scala as well or just here

Change it everywhere

ifilonenko · 2017-06-30T01:50:03Z

PR moved to #364

…on-k8s#351)

Adding PySpark Submit functionality. Launching Python from JVM

d3cf58f

ifilonenko commented Jun 16, 2017

View reviewed changes

erikerlandson reviewed Jun 16, 2017

View reviewed changes

ifilonenko added 2 commits June 16, 2017 18:20

Addressing scala idioms related to PR351

bafc13c

Removing extends Logging which was necessary for LogInfo

59d9f0a

mccheah reviewed Jun 19, 2017

View reviewed changes

mccheah suggested changes Jun 19, 2017

View reviewed changes

ifilonenko mentioned this pull request Jun 20, 2017

Bypass init-containers if spark.jars and spark.files is empty or … #348

Merged

ifilonenko added 7 commits June 19, 2017 18:53

Refactored code to leverage the ContainerLocalizedFileResolver

4daf634

Modified Unit tests so that they would pass

51105ca

Modified Unit Test input to pass Unit Tests

bd30f40

Setup working environent for integration tests for PySpark

720776e

Comment out Python thread logic until Jenkins has python in Python

4b5f470

Modifying PythonExec to pass on Jenkins

1361a26

Modifying python exec

0abc3b1

mccheah suggested changes Jun 21, 2017

View reviewed changes

ifilonenko added 5 commits June 22, 2017 18:59

Added unit tests to ClientV2 and refactored to include pyspark submis…

0869b07

…sion resources

Merge branch 'branch-2.1-kubernetes' of https://github.com/apache-spa…

38d48ce

…rk-on-k8s/spark into branch-2.1-kubernetes

Modified unit test check

9bf7b9d

Scalastyle

4561194

Merged with PR 348 and added further tests and minor documentation

2cf96cc

PR 348 file conflicts

eb1079a

mccheah reviewed Jun 27, 2017

View reviewed changes

ifilonenko added 2 commits June 27, 2017 18:06

Refactored unit tests and styles

4a6b779

further scala stylzing and logic

363919a

mccheah reviewed Jun 28, 2017

View reviewed changes

Modified unit tests to be more specific towards Class in question

9c7adb1

mccheah reviewed Jun 28, 2017

View reviewed changes

Removed space delimiting for methods

0388aa4

mccheah mentioned this pull request Jun 30, 2017

[WIP] Submission client redesign to use a step-based builder pattern #363

Merged

ifilonenko closed this Jun 30, 2017

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019

Add reminder for upstream ticket/PR to github template (apache-spark-…

4cc4dee

…on-k8s#351)


		import io.fabric8.kubernetes.api.model.{Pod, PodBuilder}

		class PythonSubmissionResources(


		def pySparkFiles: Array[String] = pyFiles

		def arguments: Array[String] =

Python Bindings for launching PySpark Jobs from the JVM (v1) #351

Python Bindings for launching PySpark Jobs from the JVM (v1) #351

Conversation

ifilonenko commented Jun 16, 2017 • edited Loading

What changes were proposed in this pull request?

Example Spark Submit

How was this patch tested?

Future Versions of this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Jun 16, 2017 • edited Loading

ifilonenko Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikerlandson commented Jun 16, 2017

ifilonenko commented Jun 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifilonenko Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifilonenko Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish commented Jun 16, 2017

kimoonkim commented Jun 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah Jun 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifilonenko commented Jun 21, 2017

mccheah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifilonenko commented Jun 16, 2017 •

edited

Loading

mccheah commented Jun 16, 2017 •

edited

Loading

ifilonenko Jun 16, 2017 •

edited

Loading

ifilonenko Jun 16, 2017 •

edited

Loading

ifilonenko Jun 16, 2017 •

edited

Loading

mccheah Jun 19, 2017 •

edited

Loading