Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark Job #1467

Merged
merged 12 commits into from
Feb 14, 2019
Merged

Add Spark Job #1467

merged 12 commits into from
Feb 14, 2019

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Sep 3, 2018

Initial work in progress attempt at adding Spark to kubeflow using the spark-on-k8s-operator as a base starting point. This is super early but I'd love peoples feedback on the direction with this.

cc @texasmichelle

Known TODOs:

  • Tests that run on more than just my specific directory layout
  • Documentation
  • Cleanup

This change is Reviewable

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

"name": "spark",
"apiVersion": "0.0.1",
"kind": "ksonnet.io/parts",
"description": "An empty package used as a stub for new packages.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Holden's awesome Spark Job prototype\n"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3 :D

@pdmack
Copy link
Member

pdmack commented Sep 4, 2018

/ok-to-test

@holdenk did you autoformat the *sonnet files yet? Guess we'll find out...

@pdmack
Copy link
Member

pdmack commented Sep 4, 2018

winner, winner...

Please use scripts/autoformat_jsonnet.sh to fix up the sonnet files.

Copy link

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just few nits from me. Question, these are for spark worker, is controller cluster also in plans? Awesome work:)

},
labels: {
"app.kubernetes.io/name": name + "-sparkoperator",
"app.kubernetes.io/version": "v2.3.1-v1alpha1",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a parameter with default ver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch,

{
kind: "ServiceAccount",
name: name + "-spark",
namespace: "default",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching.

@googlebot
Copy link

CLAs look good, thanks!

@holdenk
Copy link
Contributor Author

holdenk commented Oct 19, 2018

Hey @texasmichelle & @kunmingg I'm wondering about the best way to test this operator - looking at the test_deploy.py script it seems to use KF components @ master but since this is a PR I'm not sure that would work well.

@holdenk holdenk changed the title [WIP] Add Spark Job - not ready for merge / full review [WIP] Add Spark Job - not ready for merge Oct 19, 2018
@kunmingg
Copy link
Contributor

@holdenk

  1. You can edit tmp logic change in test_deploy.py like changing version to use.
  2. presubmit test will then test under your new logic.
  3. When test looks good you can revert change in 1. and merge this PR.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 19, 2018 via email

@kunmingg
Copy link
Contributor

@holdenk
For tests running on GKE I believe now they are covered by https://github.com/kubeflow/kubeflow/blob/master/prow_config.yaml#L15
which should deploy components from current PR code.
It would be easier to add new test here.

test_deploy.py is more focusing on minikube integration now.

"nodes",
],
verbs: [
"get",
Copy link

@krisnova krisnova Dec 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good girl...

While Nodes are technically read-only this is still decent practice as the object is so weird in general. Do we need specific Node information? If so what?

I am wondering if this is why we are using ClusterRole instead of a Role

Just a nit/question - non blocking because IDGAF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ClusterRole versus Role is now user configurable, if folks don't need to run jobs outside of the namespace where they created the operator we'll just do a Role, but if they want to have the operator and jobs sit in different namespaces we use clusterrole.

},
},
operatorClusterRole:: {
kind: "ClusterRole",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ClusterRole instead of Role? Looks like we are binding on a single namespace only, and opening up the broader permissions might be unnecessary?

https://kubernetes.io/docs/reference/access-authn-authz/rbac/#role-and-clusterrole

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super sure since the RBACS are based on the ones from the spark-operator project from GCP folks, my gut is that if we wanted to support having jobs in different namesapces wed need the operator have a ClusterRole but I'm not super sure. The RBAC file for the spark-operator is at https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/8c7fdbb306dfd656093c1b2a4ede901d651c9bd5/manifest/spark-operator-rbac.yaml , I could try and scope it down though and see if it still works?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize it was a port from the spark-operator project. I can ask there. It's not necessarily a concern, just more of wandering why we needed the broader scope. If we are spanning namespaces that makes sense. No change needed, thanks for clarifying.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 3, 2019

/retest

1 similar comment
@holdenk
Copy link
Contributor Author

holdenk commented Jan 4, 2019

/retest

@holdenk holdenk changed the title Add Spark Job - not ready for merge Add Spark Job Feb 9, 2019
@holdenk
Copy link
Contributor Author

holdenk commented Feb 9, 2019

Looks like the spark operator is applying successfully now, so it should be ready for review again. cc @jlewi @texasmichelle

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r1, 1 of 5 files at r4, 1 of 5 files at r10, 1 of 7 files at r11, 8 of 8 files at r13.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @gaocegege, @holdenk, @inc0, @jlewi, @kris-nova, and @pdmack)


kubeflow/spark/all.libsonnet, line 101 at r4 (raw file):

Previously, kris-nova (Kris Nova) wrote…

Ah I didn't realize it was a port from the spark-operator project. I can ask there. It's not necessarily a concern, just more of wandering why we needed the broader scope. If we are spanning namespaces that makes sense. No change needed, thanks for clarifying.

I think the pattern we want is to install the operator in one namespace e.g. "kubeflow-system" and users will use a different namesapce.

So I do think we need a ClusterRole because the operator will want to claim jobs in other namespaces.


kubeflow/spark/all.libsonnet, line 26 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

Chatted with @texasmichelle about this and the weird behaviour I had scene back with the minikube tests and it make sense now so I'll keep as-is

Getting namespace from params is mostly a legacy. There was a time when ksonnet didn't support getting the namespace from the environment. So as a workaround we got namespace from params.

The current pattern is to always get namespace from environment and if users want to deploy in a specific namespace they should create a new environment.


kubeflow/spark/parts.yaml, line 5 at r1 (raw file):

Previously, holdenk (Holden Karau) wrote…

<3 :D

Maybe add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ?


kubeflow/spark/README.md, line 2 at r13 (raw file):

A very early attempt at allowing Apache Spark to be used with Kubeflow.
Starts a container to run the driver program in, and the rest is up to the Spark on K8s integration.

Add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator if that's what its based on?


testing/workflows/components/workflows.libsonnet, line 281 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

So this part seems to be triggered outside of the minikube tests.

That being said, I think it might make sense, in the future, to test this and other operator one-by-one on minikube, what do you think?

It might make sense to test on minikube one by one. That said the minikube test is probably in need of some major updating. So I don't know how useful this will be.

But I don't have a strong opinion either way.


testing/deploy_utils.py, line 100 at r10 (raw file):

Previously, holdenk (Holden Karau) wrote…

So I figured that hard coding master was a bad idea and installing Spark as part of the e2e minikube tests make sure it at least can be installed. I don't feel strongly about this since we don't need it for the full workflow tests so I'm happy to revert if this complicates matters.

The minikube test isn't using kfctl so it really isn't testing what we want anymore. So I'd probably recommend not worrying about it.


testing/deploy_utils.py, line 115 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

I wish I knew why, but it is unrelated so I'll get rid of this.

Make sense; per comment above minikube test is in't using kfctl so its not really testing what we want.

I don't know if this comment is even still relevant.


testing/test_deploy.py, line 129 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

TODO: revert this

Still planning on reverting this?

@jlewi
Copy link
Contributor

jlewi commented Feb 10, 2019

Thanks @holdenk. It looks like the spark apply test is still failing

+ ks apply default -c spark-operator --verbose
level=debug msg="setting log verbosity" verbosity-level=1
level=debug msg="loading application configuration from /mnt/test-data-volume/kubeflow-presubmit-kfctl-1467-39b503c-5575-5aca/kfctl-5aca/ks_app"
level=debug msg="creating ks pipeline for environment \"default\""
level=debug msg="building objects" action=pipeline module-name=/
level=debug msg="jsonnet evaluate snippet" elapsed=59.03316ms name=applyGlobals
level=error msg="find objects: open /mnt/test-data-volume/kubeflow-presubmit-kfctl-1467-39b503c-5575-5aca/kfctl-5aca/ks_app/environments/default/params.li
bsonnet: no such file or directory"

I think the problem is you aren't creating the default environment.
The default environment is created by the kfctl-apply-k8s step. So I think you want to make the spark-operator step depend on that step.

@jlewi
Copy link
Contributor

jlewi commented Feb 11, 2019

/ok-to-test

@holdenk
Copy link
Contributor Author

holdenk commented Feb 11, 2019

@jlewi so I think the apply operator is succeeding, the part which is failing is the part which depended on the Python script so I don't think it the env issue (although that was possibly the issue before).

…link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now
Copy link
Contributor Author

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 13 unresolved discussions (waiting on @gaocegege, @holdenk, @inc0, @jlewi, @kris-nova, and @pdmack)


kubeflow/spark/all.libsonnet, line 82 at r2 (raw file):

Previously, holdenk (Holden Karau) wrote…

thanks for catching.

Done.


kubeflow/spark/all.libsonnet, line 247 at r2 (raw file):

Previously, holdenk (Holden Karau) wrote…

Yes, good catch,

Done.


kubeflow/spark/all.libsonnet, line 101 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think the pattern we want is to install the operator in one namespace e.g. "kubeflow-system" and users will use a different namesapce.

So I do think we need a ClusterRole because the operator will want to claim jobs in other namespaces.

Ok, I'll switch it to clusterrole


kubeflow/spark/parts.yaml, line 5 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Maybe add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ?

Done

Copy link
Contributor Author

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 10 unresolved discussions (waiting on @gaocegege, @inc0, @jlewi, @kris-nova, and @pdmack)


kubeflow/spark/all.libsonnet, line 140 at r4 (raw file):

Previously, holdenk (Holden Karau) wrote…

So ClusterRole versus Role is now user configurable, if folks don't need to run jobs outside of the namespace where they created the operator we'll just do a Role, but if they want to have the operator and jobs sit in different namespaces we use clusterrole.

Resolved from @jlewi's comment


kubeflow/spark/README.md, line 2 at r13 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator if that's what its based on?

Done.


testing/workflows/components/kfctl_test.jsonnet, line 221 at r10 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Did you modify kfctl to add this?

Backed out this change anyways so it shouldn't matter.


testing/workflows/components/kfctl_test.jsonnet, line 235 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

Yeah I can revert those if we want, I figured it made sense to see Spark installed on minikube even if we only used the operator on the full version in e2e workflow.

Reverted changes to Python helper scripts.


testing/spark_temp/simple_test.sh, line 1 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

This was for local testing, I can remove it.

Removed


testing/workflows/components/workflows.libsonnet, line 281 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

It might make sense to test on minikube one by one. That said the minikube test is probably in need of some major updating. So I don't know how useful this will be.

But I don't have a strong opinion either way.

Done. For now I took this out, it is used beyond e2e minikube but since I don't have this wired up to also work on e2e minikube tests doesn't make sense to put this in.

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 6 files reviewed, 5 unresolved discussions (waiting on @gaocegege, @inc0, @jlewi, @kris-nova, and @pdmack)

@jlewi
Copy link
Contributor

jlewi commented Feb 12, 2019

Woo Hoo!
Thanks for pushing this through.
/lgtm
/approve

@holdenk
Copy link
Contributor Author

holdenk commented Feb 13, 2019

Test failures look unrelated, /retest

@jlewi
Copy link
Contributor

jlewi commented Feb 13, 2019

We were having quota issues earlier. Should be fixed now.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@holdenk
Copy link
Contributor Author

holdenk commented Feb 14, 2019

It's still failing before getting to the Spark specific code, and looks like a quota failure so lets try:
/retest

@holdenk
Copy link
Contributor Author

holdenk commented Feb 14, 2019

Although IDK if the bot listens to me might need someone else to tell it to retest

@holdenk
Copy link
Contributor Author

holdenk commented Feb 14, 2019

Ok the spark jobs passed but the notebooks test failed ?

@holdenk
Copy link
Contributor Author

holdenk commented Feb 14, 2019

/retest

@holdenk
Copy link
Contributor Author

holdenk commented Feb 14, 2019

/meow

@k8s-ci-robot k8s-ci-robot merged commit f5dc021 into kubeflow:master Feb 14, 2019
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
* Add a Spark operator to Kubeflow along with integration tests.

Start adding converted spark operator elements

Can generate empty service account for Spark

Create the service account for the spark-operator.

Add clusterRole for Spark

Add cluster role bindings for Spark

Add deployment (todo cleanup name/image)

Can now launch spark operator

Put in a reasonable default for namespace (e.g default not null) and make the image used for spark-operator configurable

Start working to add job type

We can now launch and operator and launch a job, but the service accounts don't quite line up. TODO(holden) refactor the service accounts for the job to only be created in the job and move sparkJob inside of all.json as well then have an all / operator / job entry point in all.json maybe?

Add two hacked up temporary test scripts for use during dev (TODO refactor later into proper workflow)

Able to launch a job fixed coreLimit and added job arguments. Remaining TODOs are handling of nulls & svc account hack + test cleanup.

Start trying to re-organize the operator/job

Fix handling of optional jobArguments and mainClass and now it _works_ :)

Auto format the ksonnet.

Reviewer feedback: switch description of Spark operator to something meaningful, use sparkVersion param instead of hard coded v2.3.1-v1alpha1, and fix hardcoded namespace.

Clarify jobName param, remove Fix this since it has been integrated into all.libsonnet as intended.

CR feedback: change description typo and add opitonal param to spark operator for sparkVersion

Start trying to add spark tests to test_deploy.py

At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests.

This reverts commit 912a763.

Start trying to add Spark to the e2e workflow for testing

Looks like the prow tests call into the python tests normally so Revert "At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests."

This reverts commit 6c4c81f.

autoformat jsonnet

s/core/common/ and /var/log/syslog to README

Race condition on first deployment

Start adding SparkPI job to the workflow test.

Generate spark operator during CI as well.

Fix deploy kf indent

Already covered by deploy.

Install spark operator

Revert "Install spark operator"

This reverts commit cc559dd.

Test against the PR not master.

Fix string concat

Take spark-deploy out of workflows since cover in kf presub anyways.

Debug commit revert later.

idk whats going on for real.

hax

Ok lets use where the sym link was coming from idk.

Debug deploy kubeflow call...

Pritn in maint oo.

Specify a name.

name

Get all.

More debugging also why do we eddit app.yaml; directly.

don't gen common

import for debug

Just do spark-operator as verbose.

spelling

hmm namespace looked weird, lets run pytorch in verbose too so I can compare

put verbose at the end

Autoformat the json

Add a deployment scope and give more things a namespace

Format.

Gen pytorch and spark ops as verbose

idk wtf this is.

Don't deploy the spark job in the releaser test

no kfctl test either.

Just use name

We don't append any junk anymore

format json

Don't do spark in deploy_kubeflow anymore

Spark job deployment with workflows

Apply spark operator.

Add a sleep hack

Fix multi-line

add a working dir for the ks app

temp debug

garbage

specify working dir

Working dir was not happy, just cd cause why not

testdir not appDir

change to tests.testDir

Move operator deployment

Make sure we are in the ks_app?

Remove debugging and YOLO

90% less YOLO

Add that comma

Change deps

well CD seems to work in the other command so uhhh who knows?

Use runpath + pushd instead of kfctl generate

Just generate for now

Do both

Generate k8s

Install operator

Break down setting up the spark operator into different steps

We are in default rather than ghke

Use the run script to do the dpeloy

Change the namespace to stepsNamespace and add debug step cauise idk

Append the params to generate cmd

Remove params_str since we're doing list and a param of namespace

s/extends/extend/

Move params to the right place

Remove debug cluster step

Remove local test since we now use the regular e2e argo triggered tests.

Respond to the CR feedback

Fix paramterization of spark executor config.

Plumb through spark version to executor version label

Remove unecessary whitespace change in otherwise unmodified file.

* re-run autoformat

* default doesn't seem to exists anymore

* Debug the env list cause it changed

* re-run autoformat again

* Specify the env since env list shows default env is the only env present.

* Remove debug env list since the operator now works

* autofrmat and indent default

* Address CR feedback: remove deploymentscope and just use clusterole, link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now

* Take out the spark job from ther workflows in components test we just test the operator applies for now.

* Remove namespace as a param and just use the env.

* Fix end of line on namespace from ; to ,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants