Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create deploy job #673

Merged
merged 6 commits into from
Apr 28, 2018
Merged

Create deploy job #673

merged 6 commits into from
Apr 28, 2018

Conversation

inc0
Copy link

@inc0 inc0 commented Apr 17, 2018

Once image is available somewhere (for testing you can use inc0/bootstraper:latest, just edit the job def) whole setup of kubeflow will be as easy as kubectl create -f deploy_job.yaml.

One missing thing is RBAC role bindings, to avoid them for now you can use https://kubernetes.io/docs/admin/authorization/rbac/#permissive-rbac-permissions


This change is Reviewable

@inc0
Copy link
Author

inc0 commented Apr 17, 2018

/assign @jlewi
/assign @nickchase


ARG github_token

RUN export GITHUB_TOKEN=${github_token} && cd / && ks init kubeflow && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably pin to a version of Kubeflow

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pin in release - like in kubeless (example below) - either way works for me

@@ -0,0 +1,31 @@
# Deploy kubeflow to namespace "kubeflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a job?
Why isn't kubectl run sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubectl run would require person to deal with problems that this wont have:

  1. GITHUB_TOKEN - first hurdle that's hard to debug
  2. ksonnet -> k8s configuration - it's not trivial to run ksonnet from within kubernetes
  3. multiple commands, first kubectl run <> bash -> ks generate, ks apply (again, figure out how to run ks in cluster)

vs

kubectl create -f https://raw.githubusercontent.com/kubeflow/kubeflow/master/bootstrap/deploy_job.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickchase

I think you misunderstand me. All a K8s job is doing is running a docker container. So everything you are doing with a K8s job you could probably do just by doing kubectl run.

Either way you end up running the same container. So if the container has the Kubeflow registry baked in then they don't need to worry about GITHUB_TOKEN.

All your deploy_job is doing is running a shell script so the same thing could be done via kubectl run.

The reason I like kubectl run as opposed to
kubectl create -f ....

is because I think a YAML file is an invitation to end up back where we started. i.e. with a YAML file its tempting to keep adding environment variables to expose more parameters options. Which eventually leads to us picking an appropriate template and packing solution and we're back where we started.

If we do kubectl run and/or docker run then I think its more likely that we will keep it simple; i.e. restrict the number of parameters to the point its easy to type out the command line.

For anything more complex, I think the solution should be 'here's the ksonnet app created by the bootstraper' now you can customize it a hundred different ways.

So instead of having the bootstrapper expose more parameters to the user (e.g. via a config map).

I would like to make the bootstrapper

  1. better at picking good parameters for the app based on the user's setup
  2. making the resulting ksonnet app available to the user (e.g. by pushing to git or a shared filesystem)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is difference in kubectl run vs job - kubectl run is defined by user 100% and job is defined by us - therefore easier.
It won't invalidate kubectl run path - but it will provide very quick, no sweat, way to deploy default environment - perfect for quickstart. kubectl run will require few non-trivial steps.
For kubectl run it's connecting ks with underlying cluster
For docker run it's 1. having docker locally and 2. Exposing kubernetes creds to docker.

Job definition is not replacement of your bootstrapper approach, it's alternative or improvement.
If we run bootstraper to do what it's meant to be doing - determining params for deploy, then we can put these params into configmap or something and run this job with it, still less clicks than kubectl run, ks init, ... especially if you don't want to have ksonnet repos baked in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My contention is there is no good default environment.

The default environment is dependent on the cluster. This the lesson of trying to set the PV for JupyterHub. Not all clusters have default storage class.

So the point is to run the bootstrapper to determine what the optimal defaults should be.

Please see my other comment. If you want to create a job/statefulset to automatically deploy a cluster I'm fine with it but subject to the following

  1. The job should emit the ksonnet app to allow advanced customization
  2. The ksonnet app should be accessible after the job finishes
  3. The ksonnet app should be created by running the bootstraper.

- name: kubeflow-deploy
image: gcr.io/kubeflow-images-staging/bootstraper:latest
env:
- name: PVC_MOUNT_PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just a layer of indirection? e.g. you're taking the parameters exposed by ksonnet and just turning them into environment variables? Why not just tell users to run ks param set if they want to set it.

The point of the bootstraper is to solve the problem of Kubeflow exposes X parameters how do I set those parameters optimally for my deployment?

Forcing users to manually select those parameters (whether as environment variables or ks parameters) doesn't solve that problem. All this is doing is telling users that instead of doing ks param set they need to edit key-value paris in a YAML file.

The bootstrapper is trying to do something fundamentally different to create a different experience. In particular, the boot strapper figures out based on the users setup how to set different values.

The current bootstrapper has a pretty simplistic example. The bootstrapper will check if the user has PVC and depending on the result set the value of jupyterNotebookPVCMount accordingly.

So setting jupyterNotebookPVCMount should be unnecessary except in the case the user wants to override it for some reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few options how to approach that. We can create set of ENV like here, easiest. Better way would be to create configmap + shim layer to translate optiosn from configmap to params. Best way would be to have ksonnet app definition with all the configurations available (can be created by bootstrapper) and then consume it in job, so user can do sth like that:

  1. bootstrapper - create template ksonnet app in app.yaml
  2. edit app.yaml to your taste
  3. kubectl create -f app.yaml
  4. kubectl create -f https://raw.githubusercontent.com/kubeflow/kubeflow/master/bootstrap/deploy_job.yaml

And, in case you just want to roll with defaults, only step 4 is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need another shim layer?

What is the long term plan? Would there be a 1:1 mapping from environment variables to ksonnet parameters?

Is the fact that the environment variable name is different from the ksonnet parameter intentional? Or is it just an oversight?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oversight/trying to make it look like ENV variable rather than camel-cased one.

envs is just one way to pass params to deployment and I'm by no means saying it's best one. I think best one would be bootstraper-driven CRD or configmap.

End goal I think would be something like this:
run boostraper locally or in container (since it's golang app it can totally be just wget->run).
bootstraper will create CRD or configmap or any other k8s resource definition (ksonnet/app?) that will contain all the params it figures out are best - I'd say it should output editable file or something to allow operator modify them prior to running deployment.
operator runs deploy_job which will look for bootstraper-driven configmap and deploy accordingly.

Now if we also add safe defaults to deploy_job, bootstraper steps are optional for super-quickstart


RUN export GITHUB_TOKEN=${github_token} && cd / && ks init kubeflow && \
cd kubeflow && \
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should bake the ksonnet app dir into the container. We ultimately want to expose the app to users so they can do more advanced customization and check into source.

I think a better solution is that ksonnet 0.10 will support using a file location as a repository. Should we just wait for ksonnet 0.10?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to eliminate the need for people to manually install additional packages as much as possible. For that reason I'm also going to push back on "why don't we just tell them to set ksonnet variables".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, right now GITHUB_TOKEN is huge hurdle I think. Also makes it virtually impossible to deploy kubeflow on non-internet-connected env (finantial, military...) - we might not have these users yet but we might in the future.

Pulling stuff on runtime is long and it's antipattern really. I think if we can bake static files into container, we should. If user wants to create their own customized image they can always FROM this one and only override this one dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting we don't bake the Kubeflow registry into the container; I'm just suggesting that doing it by creating a ksonnet app may not be the best approach.

@nickchase We are providing them a docker container that contains ksonnet so regardless of whether they run ks param set or you use environment variables they don't have to install any additional tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 @nickchase regarding ""why don't we just tell them to set ksonnet variables"

My goal with the bootstrapper is to create a different experience than what we have today.
Specifically, right now a user has 12 different parameters for Kubeflow and this will only grow.

I think the big problem for users is "What is the correct value for these parameters".

I agree there's friction over ksonnet and we should fix that and I like a lot of what this PR does in that regard (e.g. by eliminating GITHUB_TOKEN and baking in the registry).

But whether a user sets 12 parameters via ks param set or setting 12 different environment variables that's syntactic sugar (perhaps ks param should just support taking a file) and absent that maybe we should write a shell script to do that.

But for the bootstrapper the problem I'd like to solve is trying figure out what are the correct values for the parameters automatically based on the user's setup so the user doesn't have to think about it.

If we're going to get info from the user I think it should be of a fundamentally different kind (from what the ksonnet parameters currently expose) so that we create a fundamentally different experience from what the user gets by calling ks param set.

For example, instead of just turning the ksonnet parameters into environment variables we should provide a wizard that ask a series of questions

  1. Are you working in a team or single user?
  2. How big is your data?
  3. Do you use deep learning?

The bootstraper can then tune the ksonnet parameters based on the user input.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be terrific to have a wizard that asks questions rather than having people manually tune parameters, but in the meantime I think it's even BETTER to have something that can install Kubeflow with a single command -- at least in the short run. Ideally, you could install it with a single command, and THEN we can add the ability to tune things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the data scientist end user, grad student in a lab, or ML user with little dev ops experience, simplicity is key and avoiding jargon related to k8 and systems. These users need a solution that: a) let's them get to work with ML tools quickly, b) does not place an upfront burden on them to make decisions or settings to get a running environment, c) allows for additional configuration after a while or alternatively an easy way to spin up a new system tailored to their growing needs and familiarity with the systems that run beneath the ML tools.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willingc I agree and that was basis for my environment proposal - kubeflow/community#62 - Operator creates environments for students from common template and just gives namespace for them

@jlewi I agree we shouldn't do ks param set like this. It's a quick PoC and we should solve it asap. Look at my comment above for one potential solution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


# TODO(inc0): Create rbac role bindings
---
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running in cluster as a job seems like it creates RBAC issues which is one of the current problems I'd like the bootstrapper to deal with.
https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#rbac-clusters

Deploying Kubeflow requires high level permissions e.g. because you need to create things like RBAC roles. You also might want to create keys to use as secrets for GCP/S3.

So if you run on cluster asynchronously then you're forcing users to go through the hurdle of creating appropriate service accounts and provisioning them; which adds the kind of complexity we're trying to avoid.

The bootstraper aims to avoid this by being designed to run interactively. If it runs interactively then the bootstraper can use the user's credentials. This allows us to use the end user's credentials to boot strap the cluster.

For example, in the current execution mode the bootstrapper runs locally in docker on the user's computer so we can have the user go through whatever login flow is neccessary to get a credential and then use that. So we can use the user's credential to create cluster roles, Cloud secrets etc.

We can eventually allow this to be run on cluster e.g. via kubectl exec -ti but we need to be careful about how we handle the user credential because we wouldn't want to store the user credential in a pod unless the pod is properly locked down.

One solution is to make it a web app. In that case the refresh token can be stored client side and the web app can just get short lived access tokens to forward on to various services.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current pattern is to create RBAC bindings in bootstrap. Look at this: https://github.com/kubeless/kubeless/releases/download/v0.6.0/kubeless-v0.6.0.yaml

I think kubeless is good example of what I'd like to see - they specify required ClusterRoles and make it work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 @jlewi We've been doing a lot of prototyping on separating bootstrapper into specific phases:

  • create cluster level resources: clusterroles, clusterbindings, namespace, pod security contexts, service accounts, PVs. In short, isolate ksonnet libs into those which require cluster-admin equivalent RBAC roles.
  • create a single-signon that uses IAP or other authentication providers. This auth provider will use web-token authentication (we've experimented with oauth/openid tokens as well) and will likely be based on or use https://github.com/appscode/guard.
  • define RBAC roles based on the provider. For example roles for 'data scientist' users, leads, etc. One thing guard does is import github users in an organization including teams. Depending on the auth provider, a RBAC mapping of teams could be automated so that data scientists have a 'zero-admin' experience.

We see bootstrapper as having to address 2 concerns: 'cloud-providers' and 'authentication providers'. Each are distinct and could be bundled as ksonnet libs similarly to IAP. Having a framework in bootstrapper that can mix and match cloud providers with auth providers is how we've been thinking about this problem. Part of the difficulty we've been having is under GKE it's not straightforward to update an api-server with the equivalent of boot flags like --authentication-token-webhook-config-file or --oidc-issuer-url. We've done a fair amount of prototyping using kops, dex, and guard. @jlewi it looks like currently there is an option to use cloud endpoints and ESP to delegate to a provider that does IAP under GKE. We've gone down this path as well to use other providers. We're working on a document you both can review (and others) but wanted to have enough concrete prototyping done to have answers for questions. Is there a particular preference for how the bootstrapper is implemented? It seems like the go implementation has the most potential to get at features not normally exposed in CLI tools like ks, gcloud, kubectl, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another problem is on prem which can be any mixture of this. I'm not sure if we should go as far as try to autodetect this. I's going to be really tricky as every on prem is different.

RBAC is slightly different problem tho (imho). One is managing users (I'm user X) and another is managing roles (user X is allowed to do Y). 1st is external to k8s (but also we shouldn't solve this ourselves, there are projects doing this), another is just k8s - RBAC.

In any case I don't think we should discuss it here, it's broader discussion that we need to have and will affect much more (auth to jupyterhub?). I'd just say we that we can for now create namespace and rolebindings to this namespace, which user gets there roles - let's discuss that separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkasravi I agree with you there are multiple phases. My current thinking is

  1. Cluster prep
  2. tune ksonnet parameters
  3. deploy

What's the point of dividing manifests into ones requiring cluster admin and ones that don't? My expectation is that whoever is deploying the app has appropriate privileges to deploy Kubeflow in its entirety.

Why is creating RBAC roles a separate phase? RBAC roles for datascientists/users/ etc... should be defined in your ksonnet app and deployed when you deploy Kubeflow.

I think Auth (e.g. IAP) should be handled as a combination of the cluster prep and ksonnet parameter stages.

Ideally to enable certain kinds of auth e.g. IAP you would just add appropriate manifests to your ksonnet APP and these would then be deployed when you deploy your app. This is what we currently do for IAP. i.e if you set the appropriate parameters for IAP we will deploy envoy and some other endpoints.

But there are likely some things that you can't manage as K8s resources. For example, on GCP you might want to create a static IP. So in bootstrapper we'd just make an RPC to the appropriate GCP API.

I think this matches what you are saying about cloud providers and auth providers and mixing and matching. If you have an OAuth provider that is sufficiently general that it can run on multiple clouds then a user could select it and ksonnet parameters would be set to enable that manifest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify a bit on @jlewi's suggestion to make sure that I am understanding it correctly:

  1. kubectl create -f kubeflow_cli.yaml where kubeflow_cli.yaml has defaults so a data scientist does not have to decide them to deploy.
  2. kubectl exec -ti kubeflow-cli-0 opens a shell for the user to interact with kubectl or ks.

a) If the user doesn't wish to interact with the shell, what is displayed to them?
b) Kubeflow is fully deployed at this point (i.e. I can open a notebook or start a ML job)
c) ks is still available if I am familiar with how use it to make modifications

Forgive me, I'm not sure what "PD" stands for. Assuming its some sort of persistent storage or persistent disk (PD?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl create -f kubeflow_cli.yaml where kubeflow_cli.yaml has defaults so a data scientist does not have to decide them to deploy.

Correct in that user doesn't have to specify anything.

a) If the user doesn't wish to interact with the shell, what is displayed to them?
I think it depends on what we decide to do.
We could either require the user to always login via kubectl exec and then do "quick_start.sh" to deploy Kubeflow with automatic defaults.

Or we could just have kubectl crete -f kubeflow_cli.yaml automatically run quick_start.sh so user would only have to start a shell and use kubectl/ks if they weren't happy with the auto-configured deployment.

Forgive me, I'm not sure what "PD" stands for. Assuming its some sort of persistent storage or persistent disk (PD?).

Correct; persistent disk.

I'm basically just suggesting a couple modifications to @inc0's original PR

  1. We should by default store the ksonnet app on the K8s cluster using the cluster default PVC (if cluster doesn't have a default, that could be handled as an exceptional case requiring the user to make a change to the YAML file).

  2. Instead of running the container as a job and exiting, we should have the container enter an idle loop so that users can just do kubectl exec to start a shell and interact with the app and kubeflow deployment after the initial deployment is complete

  3. I don't think we should expose ksonnet parameters in kubeflow_cli.yaml. Instead, we should try to make the bootstrap go program more intelligent and pick better defaults for the ksonnet app. If users want to customize this deployment then we should tell customers how to do this using ksonnet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jlewi for the clarifications.

I'll let you iterate on the details. If possible #3, does seem appealing for an end user. Overall, simplicity and stability are key for the end user data scientist persona. By stability, I mean, what the user needs to do to troubleshoot when/if something goes wrong on initial install.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi for 2. above could kubectl exec ... also act as a kind of REPL? That is the data scientist doesn't have to explicitly enter a shell but could do something like kubectl exec ... tfjob start job1. This would operate as a type of CLI if the bootstrap go program is given the REPL tfjob start job1 and would know that it needs to call the ks api to generate and apply the tfjob package. Maybe this is what you had in mind with 3. with making the go program more intelligent and could accommodate customizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 Did you see my comment above

@kkasravi That's an interesting idea. It seems like that would just work since kubectl exec allows executing binaries in a container that is already running. So it would work for any CLI we install.

Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this PR folks :-)


# TODO(inc0): Create rbac role bindings
---
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify a bit on @jlewi's suggestion to make sure that I am understanding it correctly:

  1. kubectl create -f kubeflow_cli.yaml where kubeflow_cli.yaml has defaults so a data scientist does not have to decide them to deploy.
  2. kubectl exec -ti kubeflow-cli-0 opens a shell for the user to interact with kubectl or ks.

a) If the user doesn't wish to interact with the shell, what is displayed to them?
b) Kubeflow is fully deployed at this point (i.e. I can open a notebook or start a ML job)
c) ks is still available if I am familiar with how use it to make modifications

Forgive me, I'm not sure what "PD" stands for. Assuming its some sort of persistent storage or persistent disk (PD?).

@inc0
Copy link
Author

inc0 commented Apr 24, 2018

So I'm thinking how to incorporate bootstrapper to this. Problem I have with bootstrapper (as it stands today) is that it, in the runtime, pulls packages etc. That means you can't have immutable image that will always deploy the same env.

How about we move bootstrapper to different role?
What it could do is to detect existing environment - like check storageclass and write all the proposed configs to configmap or other k8s resource. I'd like configmap because user could then run kubectl edit and modify options there.

Next step would be to run deploy job which would use this image, have kubeflow pulled down already, then for every param in configmap run ks param set it and run ks apply.
This approach would give us best of all the worlds there:

  1. bootstrapper would be focused on determining defaults, low bar
  2. kubernetes offers ability to edit these defaults.
  3. power users could have their own configs and manually create configmaps skipping step 1 entirely
  4. whole deployment would be just simple job run

Thoughts?

@jlewi
Copy link
Contributor

jlewi commented Apr 24, 2018

If you always want to get the same environment you do this by checking in the ksonnet app into source control.

If you want to consistently deploy the same Kubeflow environment you would do that by pulling the app from source control and deploying it. This is GitOps. There are a number of tools attempting to solve it (e.g. Argo CD, Weave etc...).

I don't think consistently deploying the same environment is the goal of the bootstrapper;there are other tools for that problem.

The goal of the bootstraper is to make it as simple as possible for users to get an optimized Kubeflow deployment.

For this there are two problems we want to solve

  1. Eliminating the friction of downloading ks and running X commands.
  2. Customizing Kubeflow for their K8s cluster

We solve 1. by giving users a YAML manifest that runs a prebuilt image that invokes "quick_start.sh".

We solve 2. by having quick_start.sh invokes the bootstrapper to produce a ksonnet app.

At this point if users want to consistently get the same environment they can and should check their ksonnet app into source control.

@inc0
Copy link
Author

inc0 commented Apr 24, 2018

I think there are more friction points tho:

  1. deploying without internet connection or with poor internet connection
    1.1 not wanting to use github token for whatever reason
  2. repeatable builds - just adding app to source control doesn't mean repeatable builds - what about dependencies? ksonnet deps? etc etc. Having baked image with pinned everything is best way to provide repeatability

now what will be pretty hard to model is config file management - if whole app is essentially a config it's going to be really hard to maintain - imagine diffing 2 setups of kubeflow... I think we should be able to spawn kubeflow from single config file which can be extended and maintained by operator. That's why configmaps are for. Having to maintain local git repo just to have repeatable builds seems like overkill to be honest.
In any case, I think downloading things on container runtime is antipattern that breaks container immutability and should be avoided. If github dies, and it die every now and then, we still want to have deployable project...

@jlewi
Copy link
Contributor

jlewi commented Apr 24, 2018

I think you are conflating issues.

Nothing I'm suggesting requires GitHub

  • We can address the dependency on GitHub and the requirement for a GitHub token by baking
    the Kubeflow ksonnet repository into the image (more or less like you propose)

  • ksonnet requires a registry from which to fetch packages. That can be a git repository or with 0.10 it can be a file location. We can stage that registry however we need to in order to get the desired behavior; e.g. we can use a git repository baked into the image or a file repository baked into the image.

repeatable builds - just adding app to source control doesn't mean repeatable builds - what about dependencies? ksonnet deps? etc etc. Having baked image with pinned everything is best way to provide repeatability

We're not building anything. What we have is a set of config files describing the Kubeflow deployment as ksonnet. This is checked into source control. Those manifests are fully specified; they do not have any dependencies on tools outside source control.

You do not need to check in ksonnet to make these configs deterministic anymore then you need to check in kubectl to make a YAML K8s manifest repeatable.

I think we should be able to spawn kubeflow from single config file which can be extended and maintained by operator. That's why configmaps are for. Having to maintain local git repo just to have repeatable builds seems like overkill to be honest.

This is expanding the scope of the PR well beyond what I'm suggesting. You are effectively using a different mechanism from ksonnet to specify Kubeflow deployments. This is going beyond what I'm suggesting of

  1. eliminating friction around getting started with ksonnet
  2. optimizing Kubeflow for a particular deployment

now what will be pretty hard to model is config file management - if whole app is essentially a config it's going to be really hard to maintain - imagine diffing 2 setups of kubeflow...

This isn't hard for me to imagine at all. Since the K8s community is embracing GitOps, I expect there are or will be numerous tools that will support intelligent diffs of YAML manifests specifying applications. WeaveWorks for example provides kubediff. These tools should operate on the final YAML output; not whatever templates(jsonnet, jinja2, go templates) were used to generate them.

In any case, I think downloading things on container runtime is antipattern that breaks container immutability and should be avoided. If github dies, and it die every now and then, we still want to have deployable project...

See comment above

Here are the changes I'd like to see so we can check it in and iterate

  1. Make it a statefulset

  2. Create a quick_start.sh that

    • Runs bootstraper
    • Emit the ksonnet app to PV
    • Deploys Kubeflow
    • Runs forever to give users easy access to the app via ks exec
  3. Create a YAML file to launch the statefulset so users can just do kubectl create -f

    • Use the default storage class as PV and give users option to comment out if they don't have PV

I suggest moving the following changes to follow on PRs just to scope it down and narrow the discussion.

  1. Eliminate the dependency on GitHub

    • My preferred approach would be cloning kubeflow into the image and then using ksonnet's support for file repositories
    • The bootstrapper is already using a 0.10 version of ksonnet so I think this will work
    • I would suggest doing this in a follow on PR so we can discuss it without conflating with other issues.
  2. Using a configmap to pass parameters to quick_start.sh

  • I would prefer to do this in a follow on PR so we can discuss how to make it easy for users to pass ksonnet parameters without conflating that with other issues.

@inc0 inc0 force-pushed the kubeflow_deploy branch from 2354d89 to e6a9284 Compare April 25, 2018 10:04
@inc0
Copy link
Author

inc0 commented Apr 25, 2018

@jlewi a compromise please;) This is statefulset so workflow will look like this:

kubectl create -f examples/kubeflow_toolbox.yaml
kubectl exec -it -n kubeflow kubeflow-toolbox-0 bash
ks params set blah blah
source ./deploy.sh

That's temporary, because when #720 happens (blocked on ksonnet issue) we will inject bootstrapper into this workflow.

Does this sound better?

@inc0
Copy link
Author

inc0 commented Apr 25, 2018

Also I didn't add PVC here since it doesn't make sense without bootstrapper - I'd need to pre-populate PVC with existing ksonnet app. This seems like overkill for what is temporary solution anyway

@jlewi
Copy link
Contributor

jlewi commented Apr 25, 2018

I can live with that but can we still use PVC. Even without the use of the bootstrapper you still have a ksonnet app that is worth preserving. So you could just copy the ksonnet app from the container filesystem to the PVC if it doesn't exist.

I'm fine submitting just that and not using the bootstraper in this PR

but I wonder if the following would work

  1. Initialize the ksonnet app @ image build time
  2. When script runs copy the skeleton ksonnet app to the desired location
  3. run ks generate for the components
  4. Run bootstrapper
    • Since the components already exist bootstrapper won't rerun the generate command.

@inc0 inc0 force-pushed the kubeflow_deploy branch from 8baddec to 02b7d5a Compare April 25, 2018 16:41
@kunmingg
Copy link
Contributor

@inc0 updated ksonnet version in bootstrap docker (#725), #720 should work now.

@jlewi
Copy link
Contributor

jlewi commented Apr 25, 2018

Reviewed 1 of 7 files at r1, 1 of 7 files at r4.
Review status: 1 of 8 files reviewed at latest revision, 4 unresolved discussions.


deploy.sh, line 1 at r5 (raw file):

ks apply default --token=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

Why do we have deploy.sh & bootstraph.sh?


bootstrap/Dockerfile, line 45 at r1 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I think we should pin in release - like in kubeless (example below) - either way works for me

What do you mean in release?


bootstrap/Dockerfile, line 47 at r1 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

@willingc I agree and that was basis for my environment proposal - kubeflow/community#62 - Operator creates environments for students from common template and just gives namespace for them

@jlewi I agree we shouldn't do ks param set like this. It's a quick PoC and we should solve it asap. Look at my comment above for one potential solution

Resolving this thread because I think we've reached consensus in other channels.


bootstrap/Dockerfile, line 52 at r5 (raw file):

# Ksonnet needs to have kubernetes env setup even for non-k8s actions.
COPY kubeconfig_fake.yml /root/.kube/config

Lets add a TODO to come up with a better solution #722


bootstrap/Dockerfile, line 61 at r5 (raw file):

    ks pkg install kubeflow/core && \
    ks pkg install kubeflow/tf-serving && \
    ks pkg install kubeflow/tf-job && unset GITHUB_TOKEN

Since Kunming updated the ksonnet version does file repository work?


bootstrap/start.sh, line 13 at r5 (raw file):

if [ -v "DEPLOY_JOB" ]; then
    cd /kubeflow
    # Remove fake kubeconfig and start k8s proxy

Add a TODO for issue #722 to come up with a valid solution for creating a .kube/config file in cluster


bootstrap/start.sh, line 15 at r5 (raw file):

    # Remove fake kubeconfig and start k8s proxy
    rm ~/.kube/config
    kubectl proxy --port=8111 &

Why do we need a proxy?


bootstrap/start.sh, line 19 at r5 (raw file):

    # Recreate env since we have proper k8s creds
    ks env rm default
    ks env add default --server=http://127.0.0.1:8111

Why do we need to use the PROXY? Can't we use the downard API to set an environment variable with the master ip? (I think TF job operator does this)


Comments from Reviewable

@inc0
Copy link
Author

inc0 commented Apr 26, 2018

Reviewable causes me issues, so I'll reply here:)

What do you mean in release?

publishing containers with baked in releases - bootstrapper:0.1.0 for example

Since Kunming updated the ksonnet version does file repository work?

I'll check, but regardless we still need #722 to be fixed.

Add a TODO for issue #722 to come up with a valid solution for creating a .kube/config file in cluster

I think proper solution would be to use this: https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod - there is golang binding that would do this, but we need to handle this in bootstrapper or ksonnet. I'd suggest to move everything to InClusterConfig if .kube/config is not present.

Why do we need to use the PROXY? Can't we use the downard API to set an environment variable
with the master ip? (I think TF job operator does this)

I'll check, proxy was straightforward that's why I did that. It's easy if we add mechanism to bootstrapper code tho

@jlewi
Copy link
Contributor

jlewi commented Apr 26, 2018

A couple minor nits but I think this is largely ready to go.

clusterIP: "None"

---
apiVersion: v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 i ran into problems of a missing PV. I removed PersistentVolumeClaim and changed StatefulSet below. See https://cloud.google.com/kubernetes-engine/docs/how-to/stateful-apps

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PV is there for resiliency of app - this is meant to be long running pod that should survive restart of node - hence need for PV

storage: 5Gi

---
apiVersion: apps/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 I used the following def (ignore the image path which uses an AIPG project)

---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
  name: kubeflow-toolbox
  namespace: kubeflow
spec:
  selector:
    matchLabels:
      app: kubeflow-toolbox
  serviceName: kubeflow-toolbox
  template:
    metadata:
      name: kubeflow-toolbox
      labels:
        app: kubeflow-toolbox
    spec:
      containers:
      - name: kubeflow-toolbox
        image: gcr.io/constant-cubist-173123/bootstrapper:v20180426-v0.1.1-10-g02b7d5a-e3b0c4
        env:
        - name: NAMESPACE
          value: "kubeflow"
        - name: DEPLOY_JOB
          value: "1"
        volumeMounts:
        - name: kubeflow-toolbox-pvc
          mountPath: /opt/kubeflow/app
  volumeClaimTemplates:
  - metadata:
      name: kubeflow-toolbox-pvc
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi

deploy.sh Outdated
@@ -0,0 +1 @@
ks apply default --token=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have deploy.sh & bootstrap/deploy.sh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do either of these scripts actually get used anywhere?

@@ -0,0 +1 @@
ks apply default --token=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a shebang line and a description of what this is for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well reason I created deploy.sh is just this ks apply is pretty long and non-trivial to figure out. it's a shortcut. it's easy to remember ./deploy.sh :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I'll remove bootstrap/deploy one, I think it's leftover. but good point about shebang

cp -R /kubeflow /opt/kubeflow/app
fi
cd /opt/kubeflow/app/kubeflow
sleep 1000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tail -f /dev/null is better I think for run for ever

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can do, this runs for years so well...tomato tomato

- name: NAMESPACE
value: "kubeflow"
- name: DEPLOY_JOB
value: "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1? Why not true/false?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to be string with whatever in it. Can be true

metadata:
name: kubeflow

# Headless service because StatefulSet requires it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we only needed a service for a stateful set if we wanted to associate stable DNS names for them.

Copy link
Author

@inc0 inc0 Apr 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To deploy statefulset you need a service. So I had to create this mock service. Look at servicename field in statefulset definition, it won't run without it.

# TODO(inc0): Create rbac role bindings
---
apiVersion: v1
kind: Namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Namespace should be inside the YAML file. This is too dangerous IMO.
If you do kubeflow delete -f kubeflow_toolbox.yaml I don't think we should automatically delete the namespace and everything in the namespace. I only think we should delete the statefulset.

I think users should manually create the namespace and manually delete it if they want to tear all resources down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well point is to be able to quickly pave whole thing. This definition doesn't allow parameters so it will be either kubeflow namespace or default namespace.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's beneficial to keep it until we make it fully parametrized (for example via bootstrapper).

Copy link
Contributor

@jlewi jlewi Apr 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still deploy it into namespace kubeflow and just tell the users to create the namespace kubeflow manually.

kubectl create namespace kubeflow
kubectl create -f https://.../toolbox.yaml

Consider the case where the user had a preexisting namespace kubeflow in their cluster and now they do kubectl delete -f yourmanifest

The idea that this could delete a preexisting namespace in my cluster is not all expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I'll deploy toolbox in default ns and show how to define own ns

@jlewi
Copy link
Contributor

jlewi commented Apr 27, 2018

I think its much better to deploy in the non-default namespace. This way its easy for users to clean up. Plus if you specify namespace: kubeflow users can do a search replace to switch to other namespace if required.

I really don't think telling users to create the namespace manually is that big a deal.

I'll leave it to you to decide.

/lgtm
/approve
/hold

Cancel the hold if you want to submit as is without using the non-default namespace.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@inc0
Copy link
Author

inc0 commented Apr 28, 2018

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 108313a into kubeflow:master Apr 28, 2018
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
* Create deploy job

* move job to examples and remove options

* make job a statefulset

* move app to pvc

* add TODOs

* address review
yanniszark pushed a commit to arrikto/kubeflow that referenced this pull request Feb 15, 2021
* modify some dockerfile to support power

* update more dockerfiles for v2

* modify wrong manager-rest name
surajkota pushed a commit to surajkota/kubeflow that referenced this pull request Jun 13, 2022
* move metadata-db from base to separate overlay
use parameter files for db configuration

* kfdef files use db overlay for metadata component

* update metadata tests

* disable namesuffixhash for generated resources

* add prefix to generated resources

* update unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants