-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High cardinality labels #91
Comments
A thought: An alternative to high cardinality labels could be to introduce a complement to labels called metadata, that is allowed to be high cardinality. Having two different things could allow us to impose other restrictions on the metadata type of values. If grep is quick enough, maybe it would work to not index the metadata key/value pairs. But still allow them to be filtered (at "grep speed"). That would allow you to give these high-cardinality key-value pairs blessed UI, for easy filtering, whilst still avoiding the cost of indexing high cardinality fields. |
The metadata we index for each streams has the same restrictions as it does for Prometheus labels: I don't want to say never, but I'm not sure if this is something we're ever likely to support.
Does the support for regexp filtering allow you to filter by a given user-id or request-id and achieve the same end as a second sets of non-index labels? I'd agree its cumbersome, so perhaps adding some jq style language here to make this more natural for json logs would be better? |
@tomwilkie It's all about speed, really. For our current ELK-stack — which we would like to replace with Loki — here are some common tasks:
Could these be supported by regexp filtering, grep style? Perhaps, but it would depend on how quick that filtering/lookup would be. Some stats:
Since our log volumes are so small, maybe it'll be easy to grep through? Regarding your "jq style language" suggestion, I think that's a great idea! Even better would be UI support for key-value filtering on keys in the top-level of that json document. Usually people will have the log message as one key (for example Logging JSON lines seems to be common enough that it's worth building some UI/support around it:
I've copied parts of this comment into the top-level issue, for clarity. Still kept here too, for context/history. |
Would this make sense in regard to storing/searching traces and spans, too? I think I saw that Grafana had recently announced the "LGTM" stack, where T stands for Trace/Tracing. My "impression" on the announcement was that, you may be going to use Loki with optional indices on, say "trace id" and "span id", to make traces stored in Loki searchable, so that it can be an alternative datastore for Zipkin or Jaeger. I don't have a strong opinion if this should be in Loki or not. Just wanted to discuss and get the idea where we should head in relation to the LGTM stack 😃 |
Never mind on my previous comment. I think we don't need to force Loki to blur its scope and break its project goals here. We already have other options like Elasticsearch(obviously), Badger, RocksDB, BoltDB, and so on when it comes to a trace/span storage, as seen in Jaeger jaegertracing/jaeger#760, for example. Regarding distributed logging, nothing prevents us from implementing our own solution with promtail + badger for example, or using Loki in combination with an another distributed logging solution w/ a more richer indexing. I'd just use Loki for light-weight, short-term, cluster-local distributed logging solution. I use Prometheus with a similar purpose for metrics. Loki remains super useful even without high cardinality labels support or secondary indices. Just my two cents 😃 |
I've made a suggestion in point (2) here with a simple way of dealing with high cardinality values. In short: keep the high-cardinality fields out of the label set, but parse them as pseudo-labels which are regenerated on demand when reading the logs. |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Stale bot: I don't think it makes sense to close this issue. Although I understand that high cardinality labels may not be on the immediate roadmap, it's a common feature of log aggregation systems and by keeping it open there will at least be a place where this + workarounds can be discussed. |
This approach is very similar to influxdb, which has tags and fields. Influxdb tags are like loki labels: each set of unique tags defines a time series. Fields are just stored values. Even linear searching would be "better than grep" speed, since if the query contains any labels, loki would already filter down to the relevant chunks. Idea 1Store arbitrary metadata alongside the line. In order not to complicate the query language, I suggest that these stored-but-unindexed metadata fields look like labels with a special format: e.g. they start with a special symbol (e.g.
In terms of the API, submitting records could be done the same way. Unfortunately, it would defeat the grouping of records with same label set, making the API more verbose and making it harder for loki to group together records belonging to the same chunk. Example:
So in the API it might be better to include the metadata inline in each record:
In the grafana UI you could have one switch for displaying labels, and one for displaying metadata. Wild idea 2The same benefits as the above could be achieved if the "line" were itself a JSON record: stored in native JSON, and parsed and queried dynamically like You can do this today, if you just write serialized JSON into loki as a string. Note: this sits very well with the ELK/Beats way of doing things.
The only thing you can't do today is the server-side filtering of queries. Therefore, you'd need to extend the query language to allow querying on fields of the JSON "line", in the main query part outside the label set. I think that using As an incremental improvement, you could then extend the API to natively accept a JSON object, handling the serialization internally:
On reading records back out via the API, it could be optional whether loki returns you the raw line inside a string, or (for valid JSON) gives you the object. There are certainly cases where you'd want to retain the string version, e.g. for logcli. Storing a 1-bit flag against each line, saying whether it's valid JSON or not, could be a useful optimisation. The big benefits of this approach are (a) no changes to underlying data model; (b) you gain much more power dealing with log data which is already natively JSON; (c) you gain the ability to filter on numeric fields, with numeric comparison operators like Aside: for querying IP address ranges, regexps aren't great. One option would be to enhance the query language with IP-address specific operators. Another one is to convert IP addresses to fixed-width hexadecimal (e.g. 10.0.0.1 = "0a000001") before storing them in metadata fields; prefix operations can be converted to regexp mechanically. e.g. |
I like both ideas, especially the second one. For the query language i think JMESPath would make more sense than the jq language as it is already supported for log processing in scrape config: I though about extracting some Info there and setting it as label. What would actually happen when creating high cardinality labels that way? "just" a very big index or would I run into real problems? |
You could end up with Loki trying to create, for example, 4 billion different timeseries, each containing one log line. Since a certain amount of metadata is required for each timeseries, and (I believe) each timeseries has its own chunks, it is likely to explode. Also, trying to read back your logs will be incredibly inefficient, since it will have to merge those 4 billion timeseries back together for almost every query. |
@kiney: I hadn't come across JMESPath before, and I agree it would make sense to use the engine which is already being used. We need something that can be used as a predicate to filter a stream of JSON records, and ideally with a CLI for testing. It looks like JMESPath's jp is the JMESPath equivalent to jq. (NOTE: this is different to the JSON Plotter jp which is what "brew install jp" gives you) jp doesn't seem to be able to filter a stream of JSON objects like jq does. It just takes the first JSON item from stdin and ignores everything else. As a workaround, loki could read batches of lines into JSON lists
It needs to work in batches to avoid reading the whole stream into RAM: therefore I think the most common use case for logs is to filter them and return the entire record for each one selected. It would be interesting to make a list of the most common predicates and see how they are written in both jq and jp. e.g.
(*) Note: regular expressions are supported by jq but not as far as I can tell by JMESPath Just taking the last example:
Compare:
Note the non-obvious requirement for backticks around literals. If you omit them, the expression fails:
String literals can either be written in single quotes, or wrapped in backticks and double-quotes. I find it pretty ugly, but then again, so are complex jq queries. Aside: If you decide to give jq a set of records as a JSON list like we just did with jp, then it's a bit more awkward:
But you probably wouldn't do that anyway with jq. |
There's another way jq could be used: just for predicate testing.
If loki sees the value 'false' it could drop the record, and the value 'true' it passes the original record through unchanged. Any other value would be passed through. This would allow the most common queries to be simplified. |
@candlerb I also consider jq a slightly more pleasant language but JMESPath is already used in Loki and also the standard in other cloud tooling (aws cli, azure cli, ansible...). |
Updates
EDIT: copied to top comment |
* Squashed 'tools/' changes from b783528..1fe184f 1fe184f Bazel rules for building gogo protobufs (grafana#123) b917bb8 Merge pull request grafana#122 from weaveworks/fix-scope-gc c029ce0 Add regex to match scope VMs 0d4824b Merge pull request grafana#121 from weaveworks/provisioning-readme-terraform 5a82d64 Move terraform instructions to tf section d285d78 Merge pull request grafana#120 from weaveworks/gocyclo-return-value 76b94a4 Do not spawn subshell when reading cyclo output 93b3c0d Use golang:1.9.2-stretch image d40728f Gocyclo should return error code if issues detected c4ac1c3 Merge pull request grafana#114 from weaveworks/tune-spell-check 8980656 Only check files 12ebc73 Don't spell-check pki files 578904a Special-case spell-check the same way we do code checks e772ed5 Special-case on mime type and extension using just patterns ae82b50 Merge pull request grafana#117 from weaveworks/test-verbose 8943473 Propagate verbose flag to 'go test'. 7c79b43 Merge pull request grafana#113 from weaveworks/update-shfmt-instructions 258ef01 Merge pull request grafana#115 from weaveworks/extra-linting e690202 Use tools in built image to lint itself 126eb56 Add shellcheck to bring linting in line with scope 63ad68f Don't run lint on files under .git 51d908a Update shfmt instructions e91cb0d Merge pull request grafana#112 from weaveworks/add-python-lint-tools 0c87554 Add yapf and flake8 to golang build image 35679ee Merge pull request grafana#110 from weaveworks/parallel-push-errors 3ae41b6 Remove unneeded if block 51ff31a Exit on first error 0faad9f Check for errors when pushing images in parallel 74dc626 Merge pull request grafana#108 from weaveworks/disable-apt-daily b4f1d91 Merge pull request grafana#107 from weaveworks/docker-17-update 7436aa1 Override apt daily job to not run immediately on boot 7980f15 Merge pull request grafana#106 from weaveworks/document-docker-install-role f741e53 Bump to Docker 17.06 from CE repo 61796a1 Update Docker CE Debian repo details 0d86f5e Allow for Docker package to be named docker-ce 065c68d Document selection of Docker installation role. 3809053 Just --porcelain; it defaults to v1 11400ea Merge pull request grafana#105 from weaveworks/remove-weaveplugin-remnants b8b4d64 remove weaveplugin remnants 35099c9 Merge pull request grafana#104 from weaveworks/pull-docker-py cdd48fc Pull docker-py to speed tests/builds up. e1c6c24 Merge pull request grafana#103 from weaveworks/test-build-tags d5d71e0 Add -tags option so callers can pass in build tags 8949b2b Merge pull request grafana#98 from weaveworks/git-status-tag ac30687 Merge pull request grafana#100 from weaveworks/python_linting 4b125b5 Pin yapf & flake8 versions 7efb485 Lint python linting function 444755b Swap diff direction to reflect changes required c5b2434 Install flake8 & yapf 5600eac Lint python in build-tools repo 0b02ca9 Add python linting c011c0d Merge pull request grafana#79 from kinvolk/schu/python-shebang 6577d07 Merge pull request grafana#99 from weaveworks/shfmt-version 00ce0dc Use git status instead of diff to add 'WIP' tag 411fd13 Use shfmt v1.3.0 instead of latest from master. 0d6d4da Run shfmt 1.3 on the code. 5cdba32 Add sudo c322ca8 circle.yml: Install shfmt binary. e59c225 Install shfmt 1.3 binary. 30706e6 Install pyhcl in the build container. 960d222 Merge pull request grafana#97 from kinvolk/alban/update-shfmt-3 1d535c7 shellcheck: fix escaping issue 5542498 Merge pull request grafana#96 from kinvolk/alban/update-shfmt-2 32f7cc5 shfmt: fix coding style 09f72af lint: print the diff in case of error 571c7d7 Merge pull request grafana#95 from kinvolk/alban/update-shfmt bead6ed Update for latest shfmt b08dc4d Update for latest shfmt (grafana#94) 2ed8aaa Add no-race argument to test script (grafana#92) 80dd78e Merge pull request grafana#91 from weaveworks/upgrade-go-1.8.1 08dcd0d Please ./lint as shfmt changed its rules between 1.0.0 and 1.3.0. a8bc9ab Upgrade default Go version to 1.8.1. 41c5622 Merge pull request grafana#90 from weaveworks/build-golang-service-conf e8ebdd5 broaden imagetag regex to fix haskell build image ba3fbfa Merge pull request grafana#89 from weaveworks/build-golang-service-conf e506f1b Fix up test script for updated shfmt 9216db8 Add stuff for service-conf build to build-goland image 66a9a93 Merge pull request grafana#88 from weaveworks/haskell-image cb3e3a2 shfmt 74a5239 Haskell build image 4ccd42b Trying circle quay login b2c295f Merge branch 'common-build' 0ac746f Trim quay prefix in circle script c405b31 Merge pull request grafana#87 from weaveworks/common-build 9672d7c Push build images to quay as they have sane robot accounts a2bf112 Review feedback fef9b7d Add protobuf tools 10a77ea Update readme 254f266 Don't need the image name in ffb59fc Adding a weaveworks/build-golang image with tags b817368 Update min Weave Net docker version cf87ca3 Merge pull request grafana#86 from weaveworks/lock-kubeadm-version 3ae6919 Add example of custom SSH private key to tf_ssh's usage. cf8bd8a Add example of custom SSH private key to tf_ansi's usage. c7d3370 Lock kubeadm's Kubernetes version. faaaa6f Merge pull request grafana#84 from weaveworks/centos-rhel ef552e7 Select weave-kube YAML URL based on K8S version. b4c1198 Upgrade default kubernetes_version to 1.6.1. b82805e Use a fixed version of kubeadm. f33888b Factorise and make kubeconfig option optional. f7b8b89 Install EPEL repo for CentOS. 615917a Fix error in decrypting AWS access key and secret. 86f97b4 Add CentOS 7 AMI and username for AWS via Terraform. eafd810 Add tf_ansi example with Ansible variables. 2b05787 Skip setup of Docker over TCP for CentOS/RHEL. 84c420b Add docker-ce role for CentOS/RHEL. 00a820c Add setup_weave-net_debug.yml playbook for user issues' debugging. 3eae480 Upgrade default kubernetes_version to 1.5.4. 753921c Allow injection of Docker installation role. e1ff90d Fix kubectl taint command for 1.5. b989e97 Fix typo in kubectl taint for single node K8S cluster. 541f58d Remove 'install_recommends: no' for ethtool. c3f9711 Make Ansible role docker-from-get.docker.com work on RHEL/CentOS. 038c0ae Add frequently used OS images, for convenience. d30649f Add --insecure-registry to docker.conf 1dd9218 shfmt -i 4 -w push-images 6de96ac Add option to not push docker hub images 310f53d Add push-images script from cortex 8641381 Add port 6443 to kubeadm join commands for K8S 1.6+. 50bf0bc Force type of K8S token to string. 08ab1c0 Remove trailing whitespaces. ae9efb8 Enable testing against K8S release candidates. 9e32194 Secure GCP servers for Scope: open port 80. a22536a Secure GCP servers for Scope. 89c3a29 Merge pull request grafana#78 from weaveworks/lint-merge-rebase-issue-in-docs 73ad56d Add linter function to avoid bad merge/rebase artefact 31d069d Change Python shebang to `#!/usr/bin/env python` 52d695c Merge pull request grafana#77 from kinvolk/schu/fix-relative-weave-path 77aed01 Merge pull request grafana#73 from weaveworks/mike/sched/fix-unicode-issue 7c080f4 integration/sanity_check: disable SC1090 d6d360a integration/gce.sh: update gcloud command e8def2c provisioning/setup: fix shellcheck SC2140 cc02224 integration/config: fix weave path 9c0d6a5 Fix config_management/README.md 334708c Merge pull request grafana#75 from kinvolk/alban/external-build-1 da2505d gce.sh: template: print creation date e676854 integration tests: fix user account 8530836 host nameing: add repo name b556c0a gce.sh: fix deletion of gce instances 2ecd1c2 integration: fix GCE --zones/--zone parameter 3e863df sched: Fix unicode encoding issues 51785b5 Use rm -f and set current dir using BASH_SOURCE. f5c6d68 Merge pull request grafana#71 from kinvolk/schu/fix-linter-warnings 0269628 Document requirement for `lint_sh` 9a3f09e Fix linter warnings efcf9d2 Merge pull request grafana#53 from weaveworks/2647-testing-mvp d31ea57 Weave Kube playbook now works with multiple nodes. 27868dd Add GCP firewall rule for FastDP crypto. edc8bb3 Differentiated name of dev and test playbooks, to avoid confusion. efa3df7 Moved utility Ansible Yaml to library directory. fcd2769 Add shorthands to run Ansible playbooks against Terraform-provisioned virtual machines. f7946fb Add shorthands to SSH into Terraform-provisioned virtual machines. aad5c6f Mention Terraform and Ansible in README.md. dddabf0 Add Terraform output required for templates' creation. dcc7d02 Add Ansible configuration playbooks for development environments. f86481c Add Ansible configuration playbooks for Docker, K8S and Weave-Net. efedd25 Git-ignore Ansible retry files. 765c4ca Add helper functions to setup Terraform programmatically. 801dd1d Add Terraform cloud provisioning scripts. b8017e1 Install hclfmt on CircleCI. 4815e19 Git-ignore Terraform state files. 0aaebc7 Add script to generate cartesian product of dependencies of cross-version testing. 007d90a Add script to list OS images from GCP, AWS and DO. ca65cc0 Add script to list relevant versions of Go, Docker and Kubernetes. aa66f44 Scripts now source dependencies using absolute path (previously breaking make depending on current directory). 7865e86 Add -p option to parallelise lint. 36c1835 Merge pull request grafana#69 from weaveworks/mflag 9857568 Use mflag and mflagext package from weaveworks/common. 9799112 Quote bash variable. 10a36b3 Merge pull request grafana#67 from weaveworks/shfmt-ignore a59884f Add support for .lintignore. 03cc598 Don't lint generated protobuf code. 2b55c2d Merge pull request grafana#66 from weaveworks/reduce-test-timeout d4e163c Make timeout a flag 49a8609 Reduce test timeout 8fa15cb Merge pull request grafana#63 from weaveworks/test-defaults git-subtree-dir: tools git-subtree-split: 1fe184f1f5330c4444c4377bef84f2d30e7dc7fe * Use keyed fields in composite literal * Squashed 'tools/' changes from 1fe184f..ccc8316 ccc8316 Revert "Gocyclo should return error code if issues detected" (grafana#124) git-subtree-dir: tools git-subtree-split: ccc831682b5d51e068b17fe9ad482f025abd1fbb
[release-5.6] Add release channel stable-5.6
This also affects handling of OpenTelemetry attributes and TraceContext especially if the set of attributes is not trivial. If you have thousands of service instances in your Loki and you want to do queries like "give me all lines that have a certain action attribute set and failed" you cannot restrict that query on instances. Restricting it by time is also an issue because you might miss some interesting events. Not having it is an issue in detecting root causes of sporadic defects. |
Are there any further updates on this issue.. I see some solutions being built for But is there a plan for supporting user-defined high-cardinality labels over a large time span... |
I think you can do what you need using the LogQL json filter. This extracts "labels" from the JSON body, but they are not logstream labels, so they don't affect cardinality. These are not indexed, so it will perform a linear search across the whole timespan of interest. If you scale up Loki then it might be able to parallelize this across chunks - I'm not sure. But if you want fully indexed searching, then I believe Loki isn't for you. You can either:
|
For others interested in this, this is the most recent "statement" I've seen from the Loki team: |
Yeah, But are we considering some middle ground
I wanted to gauge what in this case would we be willing to extend in loki to support this (without going against its core design principles) |
Loki can implement support for high-cardinality labels in the way similar to VictoriaLogs - it just doesn't associate these labels with log streams. Instead, it stores them inside the stream - see storage architecture details. This allows resolving the following issues:
The high-cardinality labels can then be queried with |
Time flies! Issue number 91 might finally have a nice solution! 🎉
@tomwilkie I'm happy it wasn't never 😄 Six years later, there is a solution on the horizon:
(not GA just yet though) |
Hi @sandstrom, please inform us and your support organization, once this is GA. Kind regards and thanks for the great contribution |
A quick update for people following this. Loki 3.0 has shipped with experimental bloom filters, that are addressing exactly this issue. This is awesome news! 🎉 https://grafana.com/blog/2024/04/09/grafana-loki-3.0-release-all-the-new-features/ I'll keep this open until it's no longer experimental, which is probably in 3.1 I'd guess, unless a team member of Loki wants to close this out already, since it has shipped (albeit in an experimental state) -- I'm fine with either. |
Sounds good to me 😄 |
For many logging scenarios it's useful to lookup/query the log data on high cardinality labels such as request-ids, user-ids, ip-addresses, request-url and so on.
While I realize that high cardinality will affect indices, I think it's worth discussing whether this is something that Loki can support in the future.
There are a lot of logging use-cases where people can easily ditch full-text search that ELK and others provide. But not having easy lookup on log metadata, such as user-id or request-id, is a problem.
I should note that this doesn't necessarily need to be "make labels high cardinality", it could also be the introduction of some type of log-line metadata or properties, that are allowed to be high cardinality and can be queried for.
Example
Our services (nginx, application servers, etc) emit JSON lines[1] like this:
We ingest these into our log storage (which we would like to replace with Loki), and here are some common types of tasks that we currently do:
Bring up logs for a particular user. Usually to troubleshoot some bug they are experiencing. Mostly we know the rough timeframe (for example that it occurred during the past 2 weeks). Such a filter will usually bring up 5-200 entries. If there are more than a few entries we'll usually filter a bit more, on a stricter time intervall or based on other properties (type of request, etc).
Find the logs for a particular request, based on its request id. Again, we'd usually know the rough timeframe, say +/- a few days.
Looking at all requests that hit a particular endpoint, basically filtering on 2-3 log entry properties.
All of these, which I guess are fairly common for a logging system, require high cardinality labels (or some type of metadata/properties that are high cardinality and can be queried).
[1] http://jsonlines.org/
Updates
A few updates, since this issue was originally opened.
This design proposal doesn't directly address high-cardinality labels, but solves some of the underlying problems.
With LogQL you can grep over large amounts of data fairly easily, and find those needles. It assumes that the queries can run fast enough on large log corpora.
There is some discussion in this thread, on making it easier to find unique IDs, such as trace-id and request-id
The text was updated successfully, but these errors were encountered: