This is an experiment to create a metrics API for Kubernetes that can be run directly from the Flux leader broker pod. We made this after creating prometheus-flux and wanting a more minimalist design. I'm not even sure it will work, but it's worth a try!
You can install from pypi or from source:
$ python -m venv env
$ source env/bin/activate
$ pip install flux-metrics-api
# or
$ git clone https://github.com/converged-computing/flux-metrics-api
$ cd flux-metrics-api
$ pip install .
# you can also do "pip install -e ."
This will install the executable to your path, which might be your local user bin:
$ which flux-metric-api
/home/vscode/.local/bin/flux-metrics-api
Note that the provided .devcontainer includes an environment for VSCode where you have Flux and can install this and use ready to go!
You'll want to be running in a Flux instance, as we need to connect to the broker handle.
$ flux start --test-size=4
And then start the server. This will use a default port and host (0.0.0.0:8443) that you can customize if desired.
$ flux-metrics-api start
# customize the port or host
$ flux-metrics-api start --port 9000 --host 127.0.0.1
If you want ssl (port 443) you can provide the path to a certificate and keyfile:
$ flux-metrics-api start --ssl-certfile /etc/certs/tls.crt --ssl-keyfile /etc/certs/tls.key
An example of a full command we might run from within a pod:
$ flux-metrics-api start --port 8443 --ssl-certfile /etc/certs/tls.crt --ssl-keyfile /etc/certs/tls.key --namespace flux-operator --service-name custom-metrics-apiserver
If you want to provide custom metrics, you can write a function in an external file that we will read it and add to the server. As a general rule:
- The name of the function will be the name of the custom metric
- You can expect the only argument to be the flux handle
- You'll need to do imports within your function to get them in scope
This likely can be improved upon, but is a start for now! We provide an example file. As an example:
$ flux-metrics-api start --custom-metric ./example/custom-metrics.py
And then test it:
$ curl -s http://localhost:8443/apis/custom.metrics.k8s.io/v1beta2/namespaces/flux-operator/metrics/my_custom_metric_name | jq
{
"items": [
{
"metric": {
"name": "my_custom_metric_name"
},
"value": 4,
"timestamp": "2023-06-01T01:39:08+00:00",
"windowSeconds": 0,
"describedObject": {
"kind": "Service",
"namespace": "flux-operator",
"name": "custom-metrics-apiserver",
"apiVersion": "v1beta2"
}
}
],
"apiVersion": "custom.metrics.k8s.io/v1beta2",
"kind": "MetricValueList",
"metadata": {
"selfLink": "/apis/custom.metrics.k8s.io/v1beta2"
}
}
See --help
to see other options available.
GET /apis/custom.metrics.k8s.io/v1beta2/namespaces//metrics/<metric_name>
Here is an example to get the "node_up_count" metric:
curl -s http://localhost:8443/apis/custom.metrics.k8s.io/v1beta2/namespaces/flux-operator/metrics/node_up_count | jq
{
"items": [
{
"metric": {
"name": "node_up_count"
},
"value": 2,
"timestamp": "2023-05-31T04:44:57+00:00",
"windowSeconds": 0,
"describedObject": {
"kind": "Service",
"namespace": "flux-operator",
"name": "custom-metrics-apiserver",
"apiVersion": "v1beta2"
}
}
],
"apiVersion": "custom.metrics.k8s.io/v1beta2",
"kind": "MetricValueList",
"metadata": {
"selfLink": "/apis/custom.metrics.k8s.io/v1beta2"
}
}
The following metrics are supported:
- node_up_count: number of nodes up in the MiniCluster
- node_free_count: number of nodes free in the MiniCluster
- node_cores_free_count: number of node cores free in the MiniCluster
- node_cores_up_count: number of node cores up in the MiniCluster
- job_queue_state_new_count: number of new jobs in the queue
- job_queue_state_depend_count: number of jobs in the queue in state "depend"
- job_queue_state_priority_count: number of jobs in the queue in state "priority"
- job_queue_state_sched_count: number of jobs in the queue in state "sched"
- job_queue_state_run_count: number of jobs in the queue in state "run"
- job_queue_state_cleanup_count: number of jobs in the queue in state "cleanup"
- job_queue_state_inactive_count: number of jobs in the queue in state "inactive"
We have a docker container, which you can customize for your use case, but it's more intended to be a demo. You can either build it yourself, or use our build.
$ docker build -t flux_metrics_api .
$ docker run -it -p 8443:8443 flux_metrics_api
or
$ docker run -it -p 8443:8443 ghcr.io/converged-computing/flux-metrics-api
Note that this is implemented in Python, but (I found this after) we could also use Go. Specifically, I found this repository useful to see the spec format.
You can then open up the browser at http://localhost:8443/metrics/ to see the metrics!
We use the all-contributors tool to generate a contributors graphic below.
Vanessasaurus 💻 |
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614