Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to Elastic Package Registry #827

Merged
merged 31 commits into from
Jul 7, 2022

Conversation

mrodm
Copy link
Contributor

@mrodm mrodm commented Jul 1, 2022

Relates #797

Add metrics to Elastic Package Registry. These metrics are being sent to Elastic Stack through the APM agent and also they are being exposed in the same application in the "/metrics" endpoint.

These metrics are being instrumented using Prometheus client. The list of metrics exposed in this PR are:

  • epr_service_info (Gauge)
  • epr_in_flight_requests (Gauge)
  • epr_number_indexed_packages (Gauge)
  • epr_http_requests_total (Counter)
  • epr_http_request_duration_seconds_{bucket,sum,count} (Histogram)
  • epr_http_request_size_bytes_{bucket,sum,count} (Histogram)
  • epr_http_response_size_bytes_{bucket,sum,count} (Histogram)
  • epr_storage_indexer_updates_index_success_total (Counter)
  • epr_storage_indexer_updates_index_error_total (Counter)
  • epr_storage_requests_total (Counter)
  • epr_storage_indexer_update_index_duration_seconds_{bucket,sum,count) (Histogram)
  • epr_storage_indexer_get_duration_seconds_{bucket,sum,count} (Histogram)

Some example of metrics that are going to be exposed (to be reviewed buckets yet):

# HELP epr_service_info Version information about this binary
# TYPE epr_service_info gauge
epr_service_info{instance="ec4bd25798d7",version="1.9.1"} 1

# HELP epr_in_flight_requests A gauge of requests currently being served by the http server.
# TYPE epr_in_flight_requests gauge
epr_in_flight_requests 0

# HELP epr_http_requests_total A counter for requests to the http server.
# TYPE epr_http_requests_total counter
epr_http_requests_total{code="200",method="get",path="/categories"} 10
epr_http_requests_total{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip"} 30
epr_http_requests_total{code="200",method="get",path="/package/{packageName:[a-z0-9_]+}/{packageVersion}/"} 30
epr_http_requests_total{code="200",method="get",path="/search"} 40
epr_http_requests_total{code="301",method="get",path="/package/{packageName:[a-z0-9_]+}/{packageVersion}/"} 30
epr_http_requests_total{code="400",method="get",path="/package/{packageName:[a-z0-9_]+}/{packageVersion}/"} 1
epr_http_requests_total{code="404",method="get",path="/package/{packageName:[a-z0-9_]+}/{packageVersion}/"} 2


# HELP epr_http_response_size_bytes A histogram of response sizes for requests to the http server.
# TYPE epr_http_response_size_bytes histogram
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="16"} 0
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="32"} 0
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="64"} 0
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="128"} 0
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="256"} 0
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="512"} 0
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="1024"} 21
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="65536"} 31
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="262144"} 63
epr_http_response_size_bytes_bucket{code="200",method="get",path="/search",le="+Inf"} 63
epr_http_response_size_bytes_sum{code="200",method="get",path="/search"} 4.192611e+06
epr_http_response_size_bytes_count{code="200",method="get",path="/search"} 63

# HELP epr_http_request_duration_seconds A histogram of latencies for requests to the http server.
# TYPE epr_http_request_duration_seconds histogram
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.005"} 57
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.01"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.025"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.05"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.1"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.25"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="0.5"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="1"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="2.5"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="5"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="10"} 63
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/search",le="+Inf"} 63
epr_http_request_duration_seconds_sum{code="200",method="get",path="/search"} 0.13667412999999992
epr_http_request_duration_seconds_count{code="200",method="get",path="/search"} 63

# HELP epr_http_request_size_bytes A histogram of sizes of requests to the http server.
# TYPE epr_http_request_size_bytes histogram
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="16"} 0
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="32"} 0
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="64"} 31
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="128"} 63
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="256"} 63
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="512"} 63
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="1024"} 63
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="65536"} 63
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="262144"} 63
epr_http_request_size_bytes_bucket{code="200",method="get",path="/search",le="+Inf"} 63
epr_http_request_size_bytes_sum{code="200",method="get",path="/search"} 4348
epr_http_request_size_bytes_count{code="200",method="get",path="/search"} 63

@mrodm mrodm force-pushed the add_metrics_prometheus branch from 31cbca6 to 941d546 Compare July 1, 2022 21:36
@elasticmachine
Copy link

elasticmachine commented Jul 1, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-07-07T08:05:30.580+0000

  • Duration: 6 min 25 sec

Test stats 🧪

Test Results
Failed 0
Passed 207
Skipped 0
Total 207

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@mrodm mrodm force-pushed the add_metrics_prometheus branch 2 times, most recently from 7d8206b to 840b435 Compare July 1, 2022 22:15
@jlind23 jlind23 added the Team:Ecosystem Label for the Packages Ecosystem team label Jul 4, 2022
Add metrics for every request in flight requests, response size, request
size and duration for every request including as label code, method and
path.
Uptime of the application could be obtanied using metric
"process_start_time_seconds"
@mrodm mrodm force-pushed the add_metrics_prometheus branch from 840b435 to 352170c Compare July 4, 2022 14:02
@mrodm
Copy link
Contributor Author

mrodm commented Jul 4, 2022

A new middleware has been added to get all the metrics per request, and they have been exposed in /metrics following the prometheus standard.

These first metrics are:

  • Total Requests.
  • In flight requests.
  • Histogram Request size (bytes).
  • Histogram Response size (bytes).
  • Histogram Duration of the request (seconds).

The prometheus client offers also a set of default metrics. From these metrics, at least "uptime" values can be obtained from "process_start_time_seconds" metric:

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.65694381591e+09

For all the metrics except "in flight requests", it has been added as labels: code, method and path. That would allow us to check in more detail the values, e.g. check specific endpoints.

As an example of these:

epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.005"} 0
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.01"} 0
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.025"} 0
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.05"} 0
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.1"} 4
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.25"} 19
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="0.5"} 30
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="1"} 30
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="2.5"} 30
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="5"} 30
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="10"} 30
epr_http_request_duration_seconds_bucket{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip",le="+Inf"} 30
epr_http_request_duration_seconds_sum{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip"} 6.300512742
epr_http_request_duration_seconds_count{code="200",method="get",path="/epr/{packageName}/{packageName:[a-z0-9_]+}-{packageVersion}.zip"} 30

@elastic/ecosystem I was wondering if you could take a look at this PR before adding more metrics (I still need to check about metrics related to storage). I wanted to check if you see any issue about the middleware or how the metrics are being added.

main.go Outdated
@@ -186,6 +189,7 @@ func initAPMTracer(logger *zap.Logger) *apm.Tracer {
if err != nil {
logger.Fatal("Failed to initialize APM agent", zap.Error(err))
}
tracer.RegisterMetricsGatherer(apmprometheus.Wrap(prometheus.DefaultGatherer))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to send the metrics to elastic stack using the APM integration. However, metrics that have labels on it are stored as plain json strings by default. And new dashboards should be created (still not sure how to achieve this in the APM integration).

As an alternative, I've added a new endpoint (/metrics) that exposes all the metrics using the Prometheus format. From that we could use MetricBeat to create our own integration.

WDYT ? Should we discard the APM integration for the time being in favor of the Prometheus endpoint and a new integration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can focus on the prometheus endpoint by now. It is more clear how to build an integration around that. We can add the APM metrics gatherer later, specially seeing that it is basically a one-liner.

For users of APM, some of the information related to requests can be obtained in any case with more detail using the APM data.

@mrodm mrodm marked this pull request as ready for review July 4, 2022 14:46
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, added some comments by now.

main.go Outdated
@@ -186,6 +189,7 @@ func initAPMTracer(logger *zap.Logger) *apm.Tracer {
if err != nil {
logger.Fatal("Failed to initialize APM agent", zap.Error(err))
}
tracer.RegisterMetricsGatherer(apmprometheus.Wrap(prometheus.DefaultGatherer))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can focus on the prometheus endpoint by now. It is more clear how to build an integration around that. We can add the APM metrics gatherer later, specially seeing that it is basically a one-liner.

For users of APM, some of the information related to requests can be obtained in any case with more detail using the APM data.

main.go Outdated
@@ -281,7 +285,9 @@ func getRouter(logger *zap.Logger, config *Config, indexer Indexer) (*mux.Router
router.HandleFunc(signaturesRouterPath, signaturesHandler)
router.HandleFunc(packageIndexRouterPath, packageIndexHandler)
router.HandleFunc(staticRouterPath, staticHandler)
router.Handle("/metrics", promhttp.Handler())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For production environments we probably want to expose this endpoint in other host/port, so metrics are not publicly available. We may need to add a flag for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I'll add those flags and run this metrics endpoint in a separate server (e.g. http.Server).

Doing it in this way, it would also allow me to remove the filtering done in the logging middleware to avoid logging requests to /metrics.

util/metrics.go Outdated Show resolved Hide resolved
util/metrics.go Outdated
}
labels := prometheus.Labels{"path": path}
handler = promhttp.InstrumentHandlerCounter(httpRequestsTotal.MustCurryWith(labels), handler)
handler = promhttp.InstrumentHandlerDuration(httpRequestDurationSeconds.MustCurryWith(labels), handler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APM also collects this kind of duration information. Though I think it is ok to have this also as a metric.

util/metrics.go Outdated
512 * 1024, /* 512KiB */
1024 * 1024, /* 1MiB */
64 * 1024 * 1024, /* 64MiB */
512 * 1024 * 1024, /* 512MiB */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reasoning behind the selection of these buckets? Maybe we can use the current APM data we collect now to select values according to our use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason for those buckets, I've set them just randomly and I was checking how they fit executing some curls.

If APM exposes that data so it can be checked the actual values, I'll check it and update this list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With v2 changes, we don't expect any responses great than a few KBs. In case the Fleet wants to download a ZIP file, it will be redirected (HTTP 301) to the https://package-storage.elastic.co/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I plan now the buckets thinking on the storage v2 ? Buckets with lower values would be needed for sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this http_request_duration_seconds metric that is using the default buckets definition1:

var (
	DefBuckets = [][float64](https://pkg.go.dev/builtin#float64){.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
)

I've been checking data collected by APM, and it looks they could fit well in our scenarios.

Footnotes

  1. https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#section-readme

mrodm and others added 3 commits July 4, 2022 17:09
@mrodm mrodm force-pushed the add_metrics_prometheus branch from 717bfdd to 5e02260 Compare July 4, 2022 16:40
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you post a sample /metrics response in the issue description.

go.mod Show resolved Hide resolved
main.go Outdated
Comment on lines 137 to 140
// If -dry-run=true is set, service stops here after validation
if dryRun {
os.Exit(0)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I wouldn't hide this condition in the initMetricsServer function as it will be surprising to debug. Shall we move it to the main routine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was based on the initServer one, and I must review the usage of dryRun here.

I think the call to os.Exit(0) in initServer could also be moved to main(). I'll check that too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!

Comment on lines -100 to -106
// Do not log requests to the health endpoint
if r.RequestURI == "/health" {
switch r.RequestURI {
case "/health":
// Do not log requests to these endpoints
next.ServeHTTP(w, r)
return
default:
logRequest(logger, next, w, r)
}

logRequest(logger, next, w, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly speaking, I would log all requests. At least we know that some script/bot/logic/ hits the healthcheck. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got tempted by this when migrating to the ECS logger, but it flooded the logs quite a bit. Let's leave it for a separate PR if we do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, I just changed the code here because I wanted to filter outlogs from /metrics too. As part of one of the comments now /metrics is under another host:port (http.ListenAndServe()), so at the end it was not needed.

Not a strong opinion here, I see that this filter in case of /health endpoint could be advantageous to inspect logs later, since it could flood elastic with too many GET requests to /health (they could be filtered in a KQL query too). However, this is hiding from us the actual requests (in logs) to all the endpoints.

I could remove it and log all requests, and check whether or not it causes a problem when we try to diagnose, check logs... Ans if that is the case, we could remove it.
WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got tempted by this when migrating to the ECS logger, but it flooded the logs quite a bit. Let's leave it for a separate PR if we do it.

Ok, I'll keep the same logic and let's try that in a separate PR.

util/metrics.go Outdated
512 * 1024, /* 512KiB */
1024 * 1024, /* 1MiB */
64 * 1024 * 1024, /* 64MiB */
512 * 1024 * 1024, /* 512MiB */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With v2 changes, we don't expect any responses great than a few KBs. In case the Fleet wants to download a ZIP file, it will be redirected (HTTP 301) to the https://package-storage.elastic.co/.

util/metrics.go Outdated Show resolved Hide resolved
@mtojek
Copy link
Contributor

mtojek commented Jul 5, 2022

My fault, I haven't read the whole thread. I can see metrics here.

Few ideas to consider:

  1. Is it possible to collect also EPR version (dimension)? It would be nice to correlate metrics with rollouts.
  2. Do you plan to cover also indexers (GetPackages, number of indexed packages, errors, search time, etc.)?

@jsoriano
Copy link
Member

jsoriano commented Jul 5, 2022

Is it possible to collect also EPR version (dimension)? It would be nice to correlate metrics with rollouts.

This kind of "global" metadata uses to be added in Prometheus as labels of a dummy metric to avoid increasing a lot the size of responses. For example in Kubernetes you find "metrics" like the following ones, instead of adding this metadata to all the metrics of a service or a node. But assigning this info to the proper events in Metricbeat is a bit tricky.

kube_service_info{namespace="kube-system",service="kube-dns",cluster_ip="10.96.0.10",external_name="",load_balancer_ip=""} 1
kube_node_info{node="kind-worker2",kernel_version="5.10.25-linuxkit",os_image="Ubuntu 21.04",container_runtime_version="containerd://1.5.2",kubelet_version="v1.16.15",kubeproxy_version="v1.16.15",provider_id="kind://docker/kind/kind-worker2",pod_cidr="10.244.1.0/24",internal_ip="172.20.0.3"} 1

In istio you can also find some versions this way: https://github.com/elastic/beats/blob/949d7ccd1f866ddbcd73edb8ba0cd7b8af7f918d/x-pack/metricbeat/module/istio/istiod/_meta/testdata/istiod.v1.7.1.plain#L357

The problem of this trick for Metricbeat is that generic Metricbeat collector groups prometheus metrics by labels, so it may be difficult to put the version in all documents, unless we modify the Metricbeat collectors.
Kubernetes metricsets need Go code exactly for that, like here: https://github.com/elastic/beats/blob/949d7ccd1f866ddbcd73edb8ba0cd7b8af7f918d/metricbeat/module/kubernetes/state_service/state_service.go#L77-L87
But this doesn't seem to be available in the generic collector used in integrations.

@mrodm
Copy link
Contributor Author

mrodm commented Jul 5, 2022

It would be great if you post a sample /metrics response in the issue description.

Sure! Added some examples of metrics into the description.

Is it possible to collect also EPR version (dimension)? It would be nice to correlate metrics with rollouts.

This kind of "global" metadata uses to be added in Prometheus as labels of a dummy metric to avoid increasing a lot the size of responses. For example in Kubernetes you find "metrics" like the following ones, instead of adding this metadata to all the metrics of a service or a node. But assigning this info to the proper events in Metricbeat is a bit tricky.

I could try to create a metric with that info at least to be exposed in /metrics. I wanted also to add a node_name or instance label in metrics.

2. Do you plan to cover also indexers (GetPackages, number of indexed packages, errors, search time, etc.)?

I wanted to at least cover number of indexed packages in this PR. I will take a look to the other metrics you mention to check how they can be added.

@mrodm mrodm force-pushed the add_metrics_prometheus branch from 812be9a to 73cca37 Compare July 5, 2022 09:47
main.go Outdated
Comment on lines 157 to 161
// If -dry-run=true is set, service stops here after validation
if dryRun {
os.Exit(0)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to main().
It looks safe to be executed the following lines:

	router := mustLoadRouter(logger, config, combinedIndexer)
	apmgorilla.Instrument(router, apmgorilla.WithTracer(apmTracer))

	return &http.Server{Addr: address, Handler: router}

and in main.go just do the exit call.
Do you think there could be any issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for dry-run is to run the package validation, fail in case of an error, or os.Exit(0) when it's safe. To be 100% sure, you can try running it against a path with packages and see if it doesn't skip validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested in a container, and it seems that works as expected. Here it is the output:

root@7f7596816c73:/package-registry# cat config.yml
package_paths:
  - /packages/development
  - /packages/production
  - /packages/staging
  - /packages/snapshot
root@7f7596816c73:/package-registry# ./package-registry --dry-run
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.958Z","log.origin":{"file.name":"package-registry/main.go","file.line":95},"message":"Package registry started","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.961Z","log.origin":{"file.name":"package-registry/main.go","file.line":247},"message":"Packages paths: /packages/development, /packages/production, /packages/staging, /packages/snapshot","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.961Z","log.origin":{"file.name":"package-registry/main.go","file.line":249},"message":"Cache time for /search: 10m0s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.961Z","log.origin":{"file.name":"package-registry/main.go","file.line":250},"message":"Cache time for /categories: 10m0s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.961Z","log.origin":{"file.name":"package-registry/main.go","file.line":251},"message":"Cache time for all others: 10m0s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.962Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/development","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:54:55.966Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/production","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:01.087Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/staging","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:01.309Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/snapshot","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:02.318Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/development","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:02.807Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/production","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:02.828Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/staging","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:02.932Z","log.origin":{"file.name":"packages/packages.go","file.line":191},"message":"Searching packages in /packages/snapshot","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-07-05T11:55:02.932Z","log.origin":{"file.name":"package-registry/main.go","file.line":269},"message":"1584 package manifests loaded","ecs.version":"1.6.0"}
root@7f7596816c73:/package-registry#

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find confusing to need now so many if dryRun { return }.

Could we refactor this so it is more explicit what is run and what not when dryRun is enabled?

If the idea is to run only the package validation, maybe we can move the initial package loading to its own function, and in main() just do after initializing the logger:

if dryRun {
    initIndexers(logger)
    os.Exit(0)
}

(initIndexers would contain what is done in initServer before the current if dryRun, initServer would also call initIndexers).

@mtojek
Copy link
Contributor

mtojek commented Jul 5, 2022

I could try to create a metric with that info at least to be exposed in /metrics. I wanted also to add a node_name or instance label in metrics.

My intention was to fix the following use case: during the rollout of a new EPR deployment, we can easily determine whether errors are coming from the freshly deployed app.

@mrodm mrodm requested review from jsoriano and mtojek July 5, 2022 17:29
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments, but in general, it's close to getting merged. We can add more metrics in follow-ups.

One thing that is missing is a CHANGELOG entry.

main.go Outdated
@@ -143,6 +152,11 @@ func initMetricsServer(logger *zap.Logger) {
return
}
logger.Info("Starting http metrics in " + metricsAddress)
hostname, found := os.LookupEnv("HOSTNAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is a specific function called os.Hostname.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated to use that function

util/metrics.go Outdated
@@ -104,6 +110,8 @@ var (
// MetricsMiddleware is a middleware used to measure every request received
func MetricsMiddleware() mux.MiddlewareFunc {
// Rergister all metrics
prometheus.MustRegister(ServiceInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ServiceInfo be exposed or can be private?

util/metrics.go Outdated
@@ -0,0 +1,140 @@
// Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put more stuff in util as it will become a trash can eventually. What about another package called metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I will put this code in a new metrics package.
I was thinking in the last changes that this kind of naming utils.ServiceInfo was not meaningful. So, better in its own package.

main.go Outdated

hostname, found := os.LookupEnv("HOSTNAME")
if !found {
hostname = defaultInstanceName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also try with os.Hostname().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you could use both methods 🙂 the HOSTNAME environment variable, and os.Hostname().
Consider creating a function to obtain the hostname from one of these two methods, plus the fallback of using localhost.

search.go Outdated
packages, err := indexer.Get(r.Context(), &opts)
if err != nil {
notFoundError(w, errors.Wrapf(err, "fetching package failed"))
return
}
util.SearchProcessDurationSeconds.Observe(time.Since(start).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are overlapping here a bit with APM. APM is intended for this kind of monitoring, and provides more contextual information, as data of the request, or information about the parent or child spans.
Without some context this metric doesn't say much, but if we start adding context to this metric, as the parameters used, we can end up with metrics explosion.
I would remove it by now as this is covered by APM, and maybe we can reconsider it in the future.

main.go Outdated Show resolved Hide resolved
util/metrics.go Outdated Show resolved Hide resolved
util/metrics.go Outdated

const metricsNamespace = "epr"

// info metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use godocs in this kind of comments, starting with the name what is being described.

Suggested change
// info metric
// ServiceInfo is a metric used to report information about the service. Labels are used for that, its value is not relevant.

Copy link
Contributor Author

@mrodm mrodm Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added those comments just to some kind of grouping. I delete those comments, as every metric variable has a help message, I think it's not needed to add a comment too.

main.go Outdated
Comment on lines 157 to 161
// If -dry-run=true is set, service stops here after validation
if dryRun {
os.Exit(0)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find confusing to need now so many if dryRun { return }.

Could we refactor this so it is more explicit what is run and what not when dryRun is enabled?

If the idea is to run only the package validation, maybe we can move the initial package loading to its own function, and in main() just do after initializing the logger:

if dryRun {
    initIndexers(logger)
    os.Exit(0)
}

(initIndexers would contain what is done in initServer before the current if dryRun, initServer would also call initIndexers).

main.go Show resolved Hide resolved
main.go Outdated
if !found {
hostname = defaultInstanceName
}
util.ServiceInfo.WithLabelValues(version, hostname).Set(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. I don't find WithLabelsValues very intuitive, I had to double check where is defined the names of the labels (I see it is in the definition of ServiceInfo).
Maybe we could use WithLabels.

Suggested change
util.ServiceInfo.WithLabelValues(version, hostname).Set(1)
util.ServiceInfo.WithLabels(prometheus.Labels{"version": version, "hostname": hostname}).Set(1)

Or given the particularity of this "metric", consider adding a method for this, so here we can do something like:

Suggested change
util.ServiceInfo.WithLabelValues(version, hostname).Set(1)
util.SetServiceInfo(version, hostname)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not very intuitive.
As there have been added new metrics using labels (e.g. StorageRequestsTotal), I would keep the first approach you proposed (using With() ), and set/increment all the metrics in the same way.

@@ -160,6 +160,7 @@ func (i *Indexer) updateIndex(ctx context.Context) error {
defer i.m.Unlock()
i.cursor = storageCursor.Current
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a string, we could collect it with something like the service info.

Something else that we could measure here is the count of updates, this may be an interesting metric to see how frequently we are requesting an update.

(Btw, we could instrument more the storage indexer with APM, but leave this for other PRs)

* `/epr/{name}/{name}-{version}.tar.gz`: Download a package
* `/epr/{name}/{name}-{version}.zip`: Download a package
Copy link
Contributor Author

@mrodm mrodm Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the artifact handler, just zip extension is allowed in the regex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch :)

mrodm added 4 commits July 6, 2022 12:43
Removed epr_search_process_duration_seconds metric in favor of creating
specific metrics for storage indexer get and update functions.
@mrodm mrodm requested review from mtojek and jsoriano July 6, 2022 12:39
README.md Outdated
@@ -236,6 +236,19 @@ It will be listening in the given address.

You can read more about this profiler and the available endpoints in the [pprof documentation](https://pkg.go.dev/net/http/pprof).

## Metrics
Package registry is instrumented to expose Prometheus metrics. These metrics are exposed under the `/metrics` endpoint.
By default this endpoint runs `localhost:9000`. These metrics can be scraped like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far APM and pprof are disabled by default, should we do the same with metrics? This would also help to avoid backwards compatibility issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll keep the same approach to avoid any possible backwards compatibility issue.

README.md Show resolved Hide resolved
main.go Outdated

hostname, found := os.LookupEnv("HOSTNAME")
if !found {
hostname = defaultInstanceName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you could use both methods 🙂 the HOSTNAME environment variable, and os.Hostname().
Consider creating a function to obtain the hostname from one of these two methods, plus the fallback of using localhost.

Comment on lines 38 to 39
path, _ := route.GetPathTemplate()
labels := prometheus.Labels{"path": path}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle the error of route.GetPathTemplate(), and set the path label only when available?

Suggested change
path, _ := route.GetPathTemplate()
labels := prometheus.Labels{"path": path}
path, err := route.GetPathTemplate()
if err == nil {
labels := prometheus.Labels{"path": path}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I missed handling this error.

As the prometheus.Labels var must have all the values defined in the metrics (path should have some value), I think I'll go with this:

 			path, err := route.GetPathTemplate()
			if err != nil {
				path := "unknown"
			}
			labels := prometheus.Labels{"path": path}

When I try to query urls that are not defined like https://localhost:8080/notfound, the router.NotFoundHandler is executed and there is no metric about that in metrics endpoint. That's why I think that error did not raise in my tests.

@mrodm mrodm force-pushed the add_metrics_prometheus branch from b560dc6 to 8492d33 Compare July 6, 2022 18:01
Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be complete, all my comments are addressed. Well done!

Once you get a "green light" from Jaime, feel free to merge this PR.

@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

* Update Go version and base Ubuntu image. [#821](https://github.com/elastic/package-registry/pull/821)
* Add support for "threat_intel" category. [#841](https://github.com/elastic/package-registry/pull/841)
* Instrument package registry with Prometheus metrics. [#827](https://github.com/elastic/package-registry/pull/827)
Copy link
Contributor Author

@mrodm mrodm Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I mark this as experimental here and in the Readme? I was thinking in case some metrics (e.g. namings or buckets) need to be updated later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I would say that we can go without special labels like experimental or beta. It's just yet another feature. If it starts failing, we will bugfix it.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mrodm mrodm merged commit 5090808 into elastic:main Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants