Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos-query segv panic on labels regexp #7676

Closed
gberche-orange opened this issue Aug 29, 2024 · 9 comments · Fixed by #7903
Closed

thanos-query segv panic on labels regexp #7676

gberche-orange opened this issue Aug 29, 2024 · 9 comments · Fixed by #7903

Comments

@gberche-orange
Copy link

Thanos, Prometheus and Golang version used:

  • image: docker.io/bitnami/thanos:0.36.1-debian-12-r0
  • image: quay.io/prometheus/prometheus:v2.54.0

Object Storage Provider:
Scality

What happened:

thanos-query segv panic apparently on labels regexp according to stack trace.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

I am not yet able to identify the query that triggers this behavior.

Full logs to relevant components:

Logs

panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                                            
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x941726]                                                                                                                                            
                                                                                                                                                                                                                   
goroutine 10921 [running]:                                                                                                                                                                                         
github.com/prometheus/prometheus/model/labels.(*FastRegexMatcher).MatchString(...)                                                                                                                                 
        /bitnami/blacksmith-sandox/thanos-0.36.1/pkg/mod/github.com/prometheus/prometheus@v0.52.2-0.20240614130246-4c1e71fa0b3d/model/labels/regexp.go:306
github.com/prometheus/prometheus/model/labels.(*Matcher).Matches(0x25b92a0?, {0x7ffc0c37c45c?, 0xc00114e6c8?})                         
        /bitnami/blacksmith-sandox/thanos-0.36.1/pkg/mod/github.com/prometheus/prometheus@v0.52.2-0.20240614130246-4c1e71fa0b3d/model/labels/matcher.go:115 +0xa6
github.com/thanos-io/thanos/pkg/store.LabelSetsMatch({0xc000f26118, 0x1, 0x407d86?}, {0xc000ae1748?, 0x1, 0x7f4cabf9e3e8?})            
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:668 +0x16d                         
github.com/thanos-io/thanos/pkg/store.storeMatchDebugMetadata({0x38f2670?, 0xc000d8cb00?}, {0xc0017cd560?, 0x2, 0x2})                  
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:651 +0x17a                         
github.com/thanos-io/thanos/pkg/store.storeMatches({0x38db378?, 0xc0017cd860?}, {0x38f2670, 0xc000d8cb00}, 0x1919ea62bc0, 0x1919edd1a40, {0xc000f26150, 0x1, 0x1})
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:627 +0x1fa                         
github.com/thanos-io/thanos/pkg/store.(*ProxyStore).matchingStores(0xc0009f7cc0, {0x38db378, 0xc0017cd860}, 0x2?, 0x0?, {0xc000f26150, 0x1, 0x1})
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:595 +0x16e                         
github.com/thanos-io/thanos/pkg/store.(*ProxyStore).Series(0xc0009f7cc0, 0xc000bc4960, {0x38e7ef0, 0xc0016a4720})                      
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/store/proxy.go:320 +0x5f8                       
github.com/thanos-io/thanos/pkg/query.(*querier).selectFn(0xc000bc48c0, {0x38db378, 0xc0017cd740}, 0xc000a52380, {0xc000f260f8?, 0x0?, 0x0?})
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/query/querier.go:346 +0x328                       
github.com/thanos-io/thanos/pkg/query.(*querier).Select.func1()                                                                                                                                                    
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/query/querier.go:293 +0x21d                       
created by github.com/thanos-io/thanos/pkg/query.(*querier).Select in goroutine 10108                                                                                                                              
        /bitnami/blacksmith-sandox/thanos-0.36.1/src/github.com/thanos-io/thanos/pkg/query/querier.go:277 +0x5a9      

Anything else we need to know:

Copy link

dosubot bot commented Aug 29, 2024

Hey there, @gberche-orange! I'm here to help you with any bugs, questions, or contributions you might have. Let's see if we can tackle this issue together!

I found a similar closed issue: receive: Query failure on Seg fault which was last updated on January 27, 2023. This issue was referenced by several other discussions and pull requests, including:

The suggested solution from the duplicate issue discussion points to a possible culprit related to gogo/protobuf#651. The recommended fix is to move away from the unmaintained gogoprotobuf library [1].

Additionally, the LabelSetsMatch function in proxy.go is involved in the stack trace you provided. Here is its implementation:

// LabelSetsMatch returns false if all label-set do not match the matchers (aka: OR is between all label-sets).
func LabelSetsMatch(matchers []*labels.Matcher, lset ...labels.Labels) bool {
	if len(lset) == 0 {
		return true
	}

	for _, ls := range lset {
		notMatched := false
		for _, m := range matchers {
			if lv := ls.Get(m.Name); ls.Has(m.Name) && !m.Matches(lv) {
				notMatched = true;
				break;
			}
		}
		if !notMatched {
			return true;
		}
	}
	return false;
}

This function checks if any of the provided label sets match the given matchers [2].

To address the panic, you might want to investigate the FastRegexMatcher and Matcher implementations in the Prometheus labels package, as the issue seems to be related to a nil pointer dereference in the FastRegexMatcher.

To continue talking to Dosu, mention @dosu.

@GiedriusS
Copy link
Member

Do you have any components that talk to the Query component directly through the StoreAPI?

@MichaHoffmann
Copy link
Contributor

There was a bug in the distributed query engine where we didn't propagate matchers correctly. I think that should be fixed in this but just to make sure... Is this the distributed engine? If not does it happen for the Prometheus engine too?

@gberche-orange
Copy link
Author

Thanks for your responses

Do you have any components that talk to the Query component directly through the StoreAPI?

No, our thanos instance is only queried by Grafana through the query front-end AFAIK

Is this the distributed engine? If not does it happen for the Prometheus engine too?

No. We don't use distributed engine (parameter not set --query.engine=distributed on the query pod, since we were having issues with it as mentioned in #7328 )

@chris-barbour-as
Copy link

Possible that this is related to the problem I'm experiencing in #7844?

Does reverting to Thanos 0.35.1 make the problem go away?

@gberche-orange
Copy link
Author

Possible that this is related to the problem I'm experiencing in #7844?

Does reverting to Thanos 0.35.1 make the problem go away?

Thanks @chris-barbour-as for the heads up! Yes, reverting to Thanos 0.35.1 (through bitnami helm chart version https://artifacthub.io/packages/helm/bitnami/thanos/15.7.15) resolved the issue for me.

@yeya24
Copy link
Contributor

yeya24 commented Nov 12, 2024

@gberche-orange I am looking at the code and it seems related to the store matchers.
Can you share a query that can be used to reproduce this issue? Did you use any store matchers in the UI?

@gberche-orange
Copy link
Author

thanks @yeya24 for looking into this issue.

Yes we're using store matchers in the datasource configuration

Here is a sample storeMatch that we configure in grafana using POST method when querying thanos
storeMatch[]={__address__=~"shared-services-thanos-sidecar.internal.*"}

One of the following prometheus queries from this dashboard seemed to trigger the issue (although it's hard to diagnose which one with our current logging level confiigured )
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L98
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L164
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L234
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L300
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L365
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L428
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L546
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L744
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L907
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L1121
https://github.com/fluxcd/flux2-monitoring-example/blob/9a5473371e210001817ae14a5caf0f86ff66c668/monitoring/configs/dashboards/cluster.json#L1253

@yeya24
Copy link
Contributor

yeya24 commented Nov 13, 2024

@gberche-orange This should be fixed by #7903

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants