-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark Job #1467
Add Spark Job #1467
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
kubeflow/spark/parts.yaml
Outdated
"name": "spark", | ||
"apiVersion": "0.0.1", | ||
"kind": "ksonnet.io/parts", | ||
"description": "An empty package used as a stub for new packages.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Holden's awesome Spark Job prototype\n"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3 :D
/ok-to-test @holdenk did you autoformat the *sonnet files yet? Guess we'll find out... |
winner, winner... Please use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just few nits from me. Question, these are for spark worker, is controller cluster also in plans? Awesome work:)
kubeflow/spark/all.libsonnet
Outdated
}, | ||
labels: { | ||
"app.kubernetes.io/name": name + "-sparkoperator", | ||
"app.kubernetes.io/version": "v2.3.1-v1alpha1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a parameter with default ver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch,
kubeflow/spark/all.libsonnet
Outdated
{ | ||
kind: "ServiceAccount", | ||
name: name + "-spark", | ||
namespace: "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching.
CLAs look good, thanks! |
Hey @texasmichelle & @kunmingg I'm wondering about the best way to test this operator - looking at the |
|
Ok so I can point it at this branch. Is there a reason this isn't auto
pointed at the branch in PR builds (seems like it could give us bad results
on PRs in general)? Unless I'm missing something which is quite possible
since it's my first pass through this code.
…On Fri, Oct 19, 2018, 3:43 PM Kunming Qu ***@***.***> wrote:
@holdenk <https://github.com/holdenk>
1. You can edit tmp logic change in test_deploy.py like changing
version to use.
2. presubmit test will then test under your new logic.
3. When test looks good you can revert change in 1. and merge this PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1467 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADp9Yti_u4x1R_FDkw4E8_59xz5UbZZks5umlWugaJpZM4WX-c_>
.
|
@holdenk
|
"nodes", | ||
], | ||
verbs: [ | ||
"get", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good girl...
While Node
s are technically read-only this is still decent practice as the object is so weird in general. Do we need specific Node
information? If so what?
I am wondering if this is why we are using ClusterRole
instead of a Role
Just a nit/question - non blocking because IDGAF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ClusterRole versus Role is now user configurable, if folks don't need to run jobs outside of the namespace where they created the operator we'll just do a Role, but if they want to have the operator and jobs sit in different namespaces we use clusterrole.
kubeflow/spark/all.libsonnet
Outdated
}, | ||
}, | ||
operatorClusterRole:: { | ||
kind: "ClusterRole", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ClusterRole
instead of Role
? Looks like we are binding on a single namespace only, and opening up the broader permissions might be unnecessary?
https://kubernetes.io/docs/reference/access-authn-authz/rbac/#role-and-clusterrole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super sure since the RBACS are based on the ones from the spark-operator project from GCP folks, my gut is that if we wanted to support having jobs in different namesapces wed need the operator have a ClusterRole
but I'm not super sure. The RBAC file for the spark-operator is at https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/8c7fdbb306dfd656093c1b2a4ede901d651c9bd5/manifest/spark-operator-rbac.yaml , I could try and scope it down though and see if it still works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't realize it was a port from the spark-operator
project. I can ask there. It's not necessarily a concern, just more of wandering why we needed the broader scope. If we are spanning namespaces that makes sense. No change needed, thanks for clarifying.
/retest |
1 similar comment
/retest |
Looks like the spark operator is applying successfully now, so it should be ready for review again. cc @jlewi @texasmichelle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 7 files at r1, 1 of 5 files at r4, 1 of 5 files at r10, 1 of 7 files at r11, 8 of 8 files at r13.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @gaocegege, @holdenk, @inc0, @jlewi, @kris-nova, and @pdmack)
kubeflow/spark/all.libsonnet, line 101 at r4 (raw file):
Previously, kris-nova (Kris Nova) wrote…
Ah I didn't realize it was a port from the
spark-operator
project. I can ask there. It's not necessarily a concern, just more of wandering why we needed the broader scope. If we are spanning namespaces that makes sense. No change needed, thanks for clarifying.
I think the pattern we want is to install the operator in one namespace e.g. "kubeflow-system" and users will use a different namesapce.
So I do think we need a ClusterRole because the operator will want to claim jobs in other namespaces.
kubeflow/spark/all.libsonnet, line 26 at r12 (raw file):
Previously, holdenk (Holden Karau) wrote…
Chatted with @texasmichelle about this and the weird behaviour I had scene back with the minikube tests and it make sense now so I'll keep as-is
Getting namespace from params is mostly a legacy. There was a time when ksonnet didn't support getting the namespace from the environment. So as a workaround we got namespace from params.
The current pattern is to always get namespace from environment and if users want to deploy in a specific namespace they should create a new environment.
kubeflow/spark/parts.yaml, line 5 at r1 (raw file):
Previously, holdenk (Holden Karau) wrote…
<3 :D
Maybe add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ?
kubeflow/spark/README.md, line 2 at r13 (raw file):
A very early attempt at allowing Apache Spark to be used with Kubeflow. Starts a container to run the driver program in, and the rest is up to the Spark on K8s integration.
Add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator if that's what its based on?
testing/workflows/components/workflows.libsonnet, line 281 at r12 (raw file):
Previously, holdenk (Holden Karau) wrote…
So this part seems to be triggered outside of the minikube tests.
That being said, I think it might make sense, in the future, to test this and other operator one-by-one on minikube, what do you think?
It might make sense to test on minikube one by one. That said the minikube test is probably in need of some major updating. So I don't know how useful this will be.
But I don't have a strong opinion either way.
testing/deploy_utils.py, line 100 at r10 (raw file):
Previously, holdenk (Holden Karau) wrote…
So I figured that hard coding master was a bad idea and installing Spark as part of the e2e minikube tests make sure it at least can be installed. I don't feel strongly about this since we don't need it for the full workflow tests so I'm happy to revert if this complicates matters.
The minikube test isn't using kfctl so it really isn't testing what we want anymore. So I'd probably recommend not worrying about it.
testing/deploy_utils.py, line 115 at r12 (raw file):
Previously, holdenk (Holden Karau) wrote…
I wish I knew why, but it is unrelated so I'll get rid of this.
Make sense; per comment above minikube test is in't using kfctl so its not really testing what we want.
I don't know if this comment is even still relevant.
testing/test_deploy.py, line 129 at r12 (raw file):
Previously, holdenk (Holden Karau) wrote…
TODO: revert this
Still planning on reverting this?
Thanks @holdenk. It looks like the spark apply test is still failing
I think the problem is you aren't creating the default environment. |
/ok-to-test |
@jlewi so I think the apply operator is succeeding, the part which is failing is the part which depended on the Python script so I don't think it the env issue (although that was possibly the issue before). |
…link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now
… test the operator applies for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 13 unresolved discussions (waiting on @gaocegege, @holdenk, @inc0, @jlewi, @kris-nova, and @pdmack)
kubeflow/spark/all.libsonnet, line 82 at r2 (raw file):
Previously, holdenk (Holden Karau) wrote…
thanks for catching.
Done.
kubeflow/spark/all.libsonnet, line 247 at r2 (raw file):
Previously, holdenk (Holden Karau) wrote…
Yes, good catch,
Done.
kubeflow/spark/all.libsonnet, line 101 at r4 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
I think the pattern we want is to install the operator in one namespace e.g. "kubeflow-system" and users will use a different namesapce.
So I do think we need a ClusterRole because the operator will want to claim jobs in other namespaces.
Ok, I'll switch it to clusterrole
kubeflow/spark/parts.yaml, line 5 at r1 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Maybe add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ?
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 10 unresolved discussions (waiting on @gaocegege, @inc0, @jlewi, @kris-nova, and @pdmack)
kubeflow/spark/all.libsonnet, line 140 at r4 (raw file):
Previously, holdenk (Holden Karau) wrote…
So ClusterRole versus Role is now user configurable, if folks don't need to run jobs outside of the namespace where they created the operator we'll just do a Role, but if they want to have the operator and jobs sit in different namespaces we use clusterrole.
Resolved from @jlewi's comment
kubeflow/spark/README.md, line 2 at r13 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator if that's what its based on?
Done.
testing/workflows/components/kfctl_test.jsonnet, line 221 at r10 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Did you modify kfctl to add this?
Backed out this change anyways so it shouldn't matter.
testing/workflows/components/kfctl_test.jsonnet, line 235 at r12 (raw file):
Previously, holdenk (Holden Karau) wrote…
Yeah I can revert those if we want, I figured it made sense to see Spark installed on minikube even if we only used the operator on the full version in e2e workflow.
Reverted changes to Python helper scripts.
testing/spark_temp/simple_test.sh, line 1 at r12 (raw file):
Previously, holdenk (Holden Karau) wrote…
This was for local testing, I can remove it.
Removed
testing/workflows/components/workflows.libsonnet, line 281 at r12 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
It might make sense to test on minikube one by one. That said the minikube test is probably in need of some major updating. So I don't know how useful this will be.
But I don't have a strong opinion either way.
Done. For now I took this out, it is used beyond e2e minikube but since I don't have this wired up to also work on e2e minikube tests doesn't make sense to put this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 5 unresolved discussions (waiting on @gaocegege, @inc0, @jlewi, @kris-nova, and @pdmack)
Woo Hoo! |
Test failures look unrelated, /retest |
We were having quota issues earlier. Should be fixed now. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It's still failing before getting to the Spark specific code, and looks like a quota failure so lets try: |
Although IDK if the bot listens to me might need someone else to tell it to retest |
Ok the spark jobs passed but the notebooks test failed ? |
/retest |
/meow |
* Add a Spark operator to Kubeflow along with integration tests. Start adding converted spark operator elements Can generate empty service account for Spark Create the service account for the spark-operator. Add clusterRole for Spark Add cluster role bindings for Spark Add deployment (todo cleanup name/image) Can now launch spark operator Put in a reasonable default for namespace (e.g default not null) and make the image used for spark-operator configurable Start working to add job type We can now launch and operator and launch a job, but the service accounts don't quite line up. TODO(holden) refactor the service accounts for the job to only be created in the job and move sparkJob inside of all.json as well then have an all / operator / job entry point in all.json maybe? Add two hacked up temporary test scripts for use during dev (TODO refactor later into proper workflow) Able to launch a job fixed coreLimit and added job arguments. Remaining TODOs are handling of nulls & svc account hack + test cleanup. Start trying to re-organize the operator/job Fix handling of optional jobArguments and mainClass and now it _works_ :) Auto format the ksonnet. Reviewer feedback: switch description of Spark operator to something meaningful, use sparkVersion param instead of hard coded v2.3.1-v1alpha1, and fix hardcoded namespace. Clarify jobName param, remove Fix this since it has been integrated into all.libsonnet as intended. CR feedback: change description typo and add opitonal param to spark operator for sparkVersion Start trying to add spark tests to test_deploy.py At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests. This reverts commit 912a763. Start trying to add Spark to the e2e workflow for testing Looks like the prow tests call into the python tests normally so Revert "At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests." This reverts commit 6c4c81f. autoformat jsonnet s/core/common/ and /var/log/syslog to README Race condition on first deployment Start adding SparkPI job to the workflow test. Generate spark operator during CI as well. Fix deploy kf indent Already covered by deploy. Install spark operator Revert "Install spark operator" This reverts commit cc559dd. Test against the PR not master. Fix string concat Take spark-deploy out of workflows since cover in kf presub anyways. Debug commit revert later. idk whats going on for real. hax Ok lets use where the sym link was coming from idk. Debug deploy kubeflow call... Pritn in maint oo. Specify a name. name Get all. More debugging also why do we eddit app.yaml; directly. don't gen common import for debug Just do spark-operator as verbose. spelling hmm namespace looked weird, lets run pytorch in verbose too so I can compare put verbose at the end Autoformat the json Add a deployment scope and give more things a namespace Format. Gen pytorch and spark ops as verbose idk wtf this is. Don't deploy the spark job in the releaser test no kfctl test either. Just use name We don't append any junk anymore format json Don't do spark in deploy_kubeflow anymore Spark job deployment with workflows Apply spark operator. Add a sleep hack Fix multi-line add a working dir for the ks app temp debug garbage specify working dir Working dir was not happy, just cd cause why not testdir not appDir change to tests.testDir Move operator deployment Make sure we are in the ks_app? Remove debugging and YOLO 90% less YOLO Add that comma Change deps well CD seems to work in the other command so uhhh who knows? Use runpath + pushd instead of kfctl generate Just generate for now Do both Generate k8s Install operator Break down setting up the spark operator into different steps We are in default rather than ghke Use the run script to do the dpeloy Change the namespace to stepsNamespace and add debug step cauise idk Append the params to generate cmd Remove params_str since we're doing list and a param of namespace s/extends/extend/ Move params to the right place Remove debug cluster step Remove local test since we now use the regular e2e argo triggered tests. Respond to the CR feedback Fix paramterization of spark executor config. Plumb through spark version to executor version label Remove unecessary whitespace change in otherwise unmodified file. * re-run autoformat * default doesn't seem to exists anymore * Debug the env list cause it changed * re-run autoformat again * Specify the env since env list shows default env is the only env present. * Remove debug env list since the operator now works * autofrmat and indent default * Address CR feedback: remove deploymentscope and just use clusterole, link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now * Take out the spark job from ther workflows in components test we just test the operator applies for now. * Remove namespace as a param and just use the env. * Fix end of line on namespace from ; to ,
Initial work in progress attempt at adding Spark to kubeflow using the spark-on-k8s-operator as a base starting point. This is super early but I'd love peoples feedback on the direction with this.
cc @texasmichelle
Known TODOs:
This change is