Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for labels with @kubernetes #1235

Closed
dhpollack opened this issue Jan 16, 2023 · 7 comments
Closed

Support for labels with @kubernetes #1235

dhpollack opened this issue Jan 16, 2023 · 7 comments

Comments

@dhpollack
Copy link
Contributor

It would be nice if additional labels were supported. This can be done pretty easily since the code already supports launching pods with labels except the labels are always empty.

@shrinandj
Copy link
Contributor

The labels feature in Kubernetes is definitely very useful and powerful for filtering, querying, attribution, etc.

  • Metaflow also has the notion of tags which allow users to attach arbitrary metadata to runs so as to get similar functionality.
  • Metaflow adds annotations for common metadata: run-id, task-id, etc. to pods that are run in the Kubernetes cluster.

Can you elaborate a bit about the use-cases where tags may not be sufficient?

Also, could having both labels and tags be confusing?

@dhpollack
Copy link
Contributor Author

The labels feature in Kubernetes is definitely very useful and powerful for filtering, querying, attribution, etc.

* Metaflow also has the notion of `tags` which allow users to attach arbitrary metadata to runs so as to get similar functionality.

* Metaflow adds annotations for common metadata: run-id, task-id, etc. to pods that are run in the Kubernetes cluster.

Can you elaborate a bit about the use-cases where tags may not be sufficient?

Also, could having both labels and tags be confusing?

I'll answer your second question first. I believe having both metaflow tags and kubernetes labels would not be confusing. They basically have two different audiences. Metaflow tags are for data scientists and users of metaflow. You guys do a great job in your docs explaining how those can be used. Kubernetes labels are used by infrastructure teams to do things that don't really concern data scientists.

Importantly, many tools that utilize kubernetes labels are not metaflow-aware and often times are not even python-based. I would say the most well known would be using prometheus metrics and grafana dashboards to monitor all pods with a certain kubernetes labels. These kubernetes labels are automatically scraped from kubernetes labels. Importantly, these metrics are part of everything that runs in kubernetes, regardless of if it's a flow in metaflow, a microservice written in Go, or a build run as part of a CI/CD pipeline. For these things it would be difficult or impossible to get any metadata from metaflow.

One other thing that I want to mention is that you already support adding kubernetes labels and add a few labels to jobs launched by metaflow, which is great. But it's currently impossible to add custom labels, although not very difficult to implement because most of the work has already been done since kubernetes labels are already being used.

@dhpollack
Copy link
Contributor Author

One other thing that I didn't mention is that kubernetes labels are a bit more restrictive than metaflow tags with regards to valid characters and length. So you probably don't want to restrict yourselves to conforming any possible metaflow tag to a kubernetes label. However, using key-value in metaflow tags would be useful for advanced searches, but I think that's outside of the scope of this issue.

@shrinandj
Copy link
Contributor

Thanks a lot for clarifying this. I agree that use-cases such as adding prometheus specific labels or labels for similar use-cases would require having K8s specific labels.

Metaflow tags are for data scientists and users of metaflow...Kubernetes labels are used by infrastructure teams to do things that don't really concern data scientists.

But, the person running the flow is typically a single person. Usually the data-scientist, correct? Is the expectation that the data-scientists are told which labels they should add upfront and they would add it in the code?

Consider the case where a user is running a flow in helloworld.py.

The user could run the flow and add a tag in the following manner:
python helloworld.py run --tag crazy_test

If the same user, wanted to have specific labels for a step, the user will need to do something like the following:

from metaflow import FlowSpec, step


class HelloFlow(FlowSpec):
    """
    A flow where Metaflow prints 'Hi'.

    Run this flow to validate that Metaflow is installed correctly.

    """
    @step
    def start(self):
        pass

    @step
    @kubernetes(labels="prometheus.io/interval:30s")
    def hello(self):
        """
        A step for metaflow to introduce itself.

        """
        print("Metaflow says: Hi!")
        self.next(self.end)

    @step
    def end(self):
        pass


if __name__ == "__main__":
    HelloFlow()

Is that correct?

For prometheus like use-cases which actually might apply to all flows from all users, would be it beneficial to have the labels option in a global config option. Metaflow can use a global config file, typically at ~/.metaflowconfig/config. If it had a config option called METAFLOW_KUBERNETES_POD_LABELS="prometheus.io/interval:30s,mylife=myrules", these labels could be added to all pods.

This would avoid adding the labels option in every step.

WDYT?

@dhpollack
Copy link
Contributor Author

Yea, in my PR, I added "KUBERNETES_LABELS" as a config option which does get applied to every pod when you run "--with kubernetes" or with argo workflows. I agree it's less useful in the decorator itself. However since the other kubernetes options are all in the decorator, i think it makes sense to have that as an option.

Out of scope for this PR, i think it would be nice to have a "tags/labels" decorator that could add such things to specific steps or entire flows without relying on a specific decorator.

If you prefer "pod_labels" rather than "labels" i would be ok with that. The one issue is that your code already expects "labels" in kwargs in some places so it kind of makes sense to keep the name. I think either would work but I didn't want to change the name since something was already there. I think in the PR that i have you can do something like '--with kubernetes:labels="label1=val1,label2=val2"' or 'METAFLOW_KUBERNETES_LABELS="label1=val1" python hello world.py run --with kubernetes'

@dhpollack
Copy link
Contributor Author

An example of where you already use the key labels is in the kubernetes_job.py and the argo_workflows.py files.

@saikonen
Copy link
Collaborator

Available since 2.9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants