[BUG]Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #43030

mantis-toboggan-md · 2023-10-02T19:52:32Z

Rancher Server Setup

Rancher version: v2.8-head efc48ac
Installation option (Docker install/Helm Chart):Helm k3s 1.26.6+k3s1
Proxy/Cert Details: self-signed

Information about the Cluster

Kubernetes version: seen on both 1.27.6-k3s1 and 1.27.6-rke2r1
Cluster Type (Local/Downstream): Downstream cluster provisioned on Digital Ocean

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
Project member with read-only access and monitoring-ui-view cluster role

Describe the bug
GET requests to <rancher url>/k8s/clusters/<cluster id>/v1/endpoints return 500, the /v1 websocket disconnects, and the cluster status changes to "updating" for a couple minutes. The endpoints schema indicates that the user should be able to list endpoints (<endpoint schema>.links.collection is defined). I was only able to reproduce this bug once monitoring was installed.

To Reproduce
as admin, create a downstream cluster and enable monitoring v2
as admin, create a local user user1 and assign it to be project-member with read-only access of a project p1 in the cluster
as admin, create the clusterRoleBinding (monitoring-ui-view, user-1 )
log in as user-1, go to the cluster explorer UI -> monitoring tab

Result
500 error; cluster is 'updating' for a few minutes

Expected Result
GET requests to <rancher url>/k8s/clusters/<cluster id>/v1/endpoints should return a list of endpoints.

Screenshots

Additional context

This was seen while investigating rancher/dashboard#4466 and blocks that issue.

The text was updated successfully, but these errors were encountered:

geethub97 · 2023-10-09T23:28:11Z

Per the docs, this is the expected behavior. A read only user should not have access to links directly from the monitoring panel, only externally. We have raised this issue with product and SURE-7045 has been filed to review the permissions for a read-only user within the monitoring UI.

"A User bound to the View Monitoring Rancher Role only has permissions to access external Monitoring UIs if provided links to those UIs."
(https://ranchermanager.docs.rancher.com/integrations-in-rancher/monitoring-and-alerting/rbac-for-monitoring#users-with-rancher-based-permissions)

If the UI team wants to show an error message that says the read-only user does not have access to the monitoring panel links, I think that would be a viable option while product is determining the exact access rights and limitations for the role.

SURE-7044 has also been filed to add clarifications to the official Rancher documentation for read-only role permissions.

cc: @prachidamle @MKlimuszka

mantis-toboggan-md · 2023-10-10T17:45:24Z

The problem here isn't that the user can't access the links on the monitoring panel, it's that requesting a resource they should be able to list is returning a 500 response and temporarily rendering the cluster unreachable. But even if the UI erroneously requests resources a user can't access the expected response is a 403 error and no impact on cluster health.

With the permission set described above, the user would have access to some subset of endpoints though not he monitoring ones, so a get request to <rancher url>/k8s/clusters/<cluster id>/v1/endpoints should return 200 and include whatever endpoints they can see. The UI checks for the monitoring endpoints and disables the links if not found, which sounds like expected behavior with the view-monitoring-role as it currently is.

cbron · 2023-10-13T18:58:23Z

Tentatively assigned as blocker for now, as this is causing a panic in the agent.

geethub97 · 2023-10-16T22:50:32Z

Cross posting my updates here for visibility.

My initial comment was incorrect (I thought this was related to a different error with the monitoring panel links, that's my bad).

The 500 error while getting the endpoints has been fixed by rancher/steve#132, and that also fixes the problem where the cluster hangs. The endpoints are not returned (per the documentation) because the read-only user, as described in this bug, is not supposed to have access to the internal monitoring panel.

cc: @MbolotSuse @prachidamle

MbolotSuse · 2023-10-17T14:59:17Z

Validation Template

Root Cause

In certain cases (when a user has cluster-wide permissions to a resource type limited by one or more resourceNames) an internal steve function would return nil for the list of items that the user can see and a nil error. This resulted in a consuming function panic'ing when attempting to use this nil result.

In the monitoring use case, this would result in a 500 error and a tunnel disconnected error message when accessing /v1/endpoints as a user who was read-only on one project and had a clusterrolebinding to monitoring-ui-view. This would also cause the agent to restart, which caused the cluster to enter into an updating state.

What was fixed, or what change have occurred

Steve now returns an empty &unstructured.UnstructuredList{} value, rather than a nil value in the above use case. In the monitioring use case, this will result in /v1/endpoints returning an empty list (and valid status code).

Areas or cases that should be tested

The original use case, which is listed below. This should now return an empty list, with a 200 status code, and no panic should be visible in the agent.
Other use cases which use resourceName limited permissions inside of a namespace. These should continue to return values as before. See the below example:

Install Rancher
Create a user
Create a downstream cluster
On the downstream cluster, create a new project
In the project that was just created, create a new namespace
Assign the new user read-only on this project
Install monitoring from Cluster Tools
Create a new project
Move the cattle-monitoring-system namespace to the project that you created in the previous step
In this project, assign the user View Monitoring permissions
Login as the user.
Verify that the links to the monitoring tools (e.g. grafana) are visible, and that users can click on them.
Verify that there are 3 entries in https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints

What areas could experience regressions

Steve use cases with permissions using resourceNames.

Are the repro steps accurate/minimal?

They are accurate, though they are focused on monitoring rather than a minimal steve use case.

Install Rancher
Create a user
Create a downstream cluster
On the downstream cluster, create a new project
In the project that was just created, create a new namespace
Assign the new user read-only on this project
Install monitoring from Cluster Tools
Using kubectl (through the dashboard pod or using the cluster's kubeconfig), assign the user the monitoring-ui-view role. See the docs for an example command (simple example: kubectl create clusterrolebinding my-binding --clusterrole=monitoring-ui-view --user=u-l4npx)
Login as the user.
Go to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints in your browser. Observe that the error tunnel disconnected was returned, and that the cluster agent restarted (old side, new side should return a list with 0 entries, a 200 status code, and no restart of the agent).

Notes

cc: @geethub97

This does not resolve general monitoring permissions (see this comment for more information). These users will need to use a workaround for the monitoring permissions to become visible.

prachidamle · 2023-10-17T18:59:16Z

@mantis-toboggan-md @gaktive Can we remove "status/ui-blocked" from this ticket now?

anupama2501 · 2023-10-18T15:22:26Z

Reproduced the issue on v2.8.0-alpha2:

Created a rancher server on v2.8.0-alpha2
Created a downstream rke2 node driver cluster
Installed monitoring v2 chart in it.
Created a standard user - user1
Added user1 to a project with read-only permissions.
Created a new CRB for the user using kubectl create clusterrolebinding my-binding --clusterrole=monitoring-ui-view --user=u-l4npx
Logged in as the user and navigated to the end point https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints
Verified the user gets a tunnel disconnect.
Logged in as admin in another browser and verified the cluster goes into updating state
Noticed the panic in the cluster-agent with a few restarts

2023-10-18T11:50:20.792748350Z time="2023-10-18T11:50:20Z" level=info msg="Watching metadata for /v1, Kind=ResourceQuota"
2023-10-18T11:50:21.186414761Z I1018 11:50:21.186311      52 trace.go:236] Trace[201384611]: "DeltaFIFO Pop Process" ID:endpoint-controller,Depth:11,Reason:slow event handlers blocking the queue (18-Oct-2023 11:50:20.906) (total time: 279ms):
2023-10-18T11:50:21.186440231Z Trace[201384611]: [279.727178ms] [279.727178ms] END
2023-10-18T11:56:02.035280778Z panic: runtime error: invalid memory address or nil pointer dereference
2023-10-18T11:56:02.035336398Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23d83ce]
2023-10-18T11:56:02.035343549Z 
2023-10-18T11:56:02.035348449Z goroutine 6804 [running]:
2023-10-18T11:56:02.035602112Z k8s.io/apimachinery/pkg/apis/meta/v1/unstructured.(*UnstructuredList).GetResourceVersion(...)
2023-10-18T11:56:02.035615232Z 	/go/pkg/mod/k8s.io/apimachinery@v0.27.4/pkg/apis/meta/v1/unstructured/unstructured_list.go:156
2023-10-18T11:56:02.036530655Z github.com/rancher/steve/pkg/stores/partition.(*ParallelPartitionLister).feeder.func2()
2023-10-18T11:56:02.036549625Z 	/go/pkg/mod/github.com/rancher/steve@v0.0.0-20230901044548-5df31b9c15cc/pkg/stores/partition/parallel.go:190 +0x30e
2023-10-18T11:56:02.036658096Z golang.org/x/sync/errgroup.(*Group).Go.func1()
2023-10-18T11:56:02.036668857Z 	/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x64
2023-10-18T11:56:02.036674107Z created by golang.org/x/sync/errgroup.(*Group).Go
2023-10-18T11:56:02.036807908Z 	/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup
+0xa5

anupama2501 · 2023-10-18T17:46:42Z

Verified fresh install v2.8-head 2e6895d

Test case1

Install Rancher on v2.8-head
Create a standard user - user1
Create a downstream cluster
On the downstream cluster, create a new project
In the project that was just created, create a new namespace
Assign the new user1 read-only on this project
Install monitoring from Cluster Tools
In the system project, add user1 with custom View Monitoring permissions.
Login as the user1.
Go to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints in your browser. Observe that the error tunnel disconnected was returned, and that the cluster agent is not restarted

Test case2

From the above test case after step7, create a new project and assign the user1 to permissions View Monitoring
Move the cattle-monitoring-system project to the this project
Login as the user1 and verify the monitoring links are accessible and no errors seen.

Test case3

Repeat steps 1 through 7 by creating a new user - user2 in step2
Using kubectl create a clusterrolebinding monitoring-ui-view to the downstream cluster.
Login as the user - user2 and the links for monitoring are not clickable [see https://github.com/[Monitoring v2] Links in Dashboard are un-clickable even the user has monitoring-ui-view permission dashboard#4466#issuecomment-1766495804]
Navigate to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints and verified no errors were noticed and the cluster-agent was not restarted.

Test case4

Upgrade use case from v2.7.8 >> v2.8-head

Repeated the steps from test case1
Upgraded rancher server and verified the links are clickable and no errors seen when navigating to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints

martyav · 2023-10-30T16:21:31Z

rancher/dashboard#4466 has a release-note label and comments in that thread reference this issue. Since rancher/dashboard#4466 seems to be about users frustrated by intended Rancher behavior, should I instead be release noting the bug fix recorded here?

mantis-toboggan-md added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Oct 2, 2023

slickwarren added area/monitoring feature/charts-monitoring-v2 team/area3 labels Oct 3, 2023

slickwarren added this to the v2.8.0 milestone Oct 3, 2023

ronhorton mentioned this issue Oct 3, 2023

Error related to services/proxy when viewing metrics in Rancher UI #40253

Closed

gaktive added the status/ui-blocked label Oct 4, 2023

Jono-SUSE-Rancher assigned geethub97 Oct 6, 2023

cbron added the status/blocker label Oct 13, 2023

cbron assigned MbolotSuse Oct 13, 2023

MbolotSuse mentioned this issue Oct 13, 2023

Updating ByNames to not return nil, nil rancher/steve#132

Merged

cbron added status/release-blocker and removed status/blocker labels Oct 16, 2023

cbron assigned sowmyav27 Oct 16, 2023

MbolotSuse mentioned this issue Oct 16, 2023

Bump steve version #43178

Merged

sowmyav27 assigned anupama2501 Oct 17, 2023

MbolotSuse added the [zube]: To Test label Oct 17, 2023

prachidamle removed the status/ui-blocked label Oct 17, 2023

sowmyav27 added [zube]: QA Next up and removed [zube]: To Test labels Oct 17, 2023

anupama2501 removed the [zube]: QA Next up label Oct 18, 2023

anupama2501 added the [zube]: QA Working label Oct 18, 2023

anupama2501 closed this as completed Oct 18, 2023

zube bot added [zube]: Done and removed [zube]: QA Working labels Oct 18, 2023

zube bot removed the [zube]: Done label Jan 17, 2024

samjustus mentioned this issue Jan 18, 2024

[2.7] Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #44084

Closed

rak-phillip mentioned this issue Jan 18, 2024

Various Monitoring/Grafana link issues related to user permissions rancher/dashboard#10213

Closed

This was referenced Feb 7, 2024

Framework improvments for 1.27 support #44315

Closed

[2.7.11] Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #44371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #43030

[BUG]Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #43030

mantis-toboggan-md commented Oct 2, 2023

geethub97 commented Oct 9, 2023

mantis-toboggan-md commented Oct 10, 2023 •

edited

Loading

cbron commented Oct 13, 2023

geethub97 commented Oct 16, 2023

MbolotSuse commented Oct 17, 2023

prachidamle commented Oct 17, 2023

anupama2501 commented Oct 18, 2023

anupama2501 commented Oct 18, 2023

martyav commented Oct 30, 2023

[BUG]Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #43030

[BUG]Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #43030

Comments

mantis-toboggan-md commented Oct 2, 2023

geethub97 commented Oct 9, 2023

mantis-toboggan-md commented Oct 10, 2023 • edited Loading

cbron commented Oct 13, 2023

geethub97 commented Oct 16, 2023

MbolotSuse commented Oct 17, 2023

Validation Template

Root Cause

What was fixed, or what change have occurred

Areas or cases that should be tested

What areas could experience regressions

Are the repro steps accurate/minimal?

Notes

prachidamle commented Oct 17, 2023

anupama2501 commented Oct 18, 2023

anupama2501 commented Oct 18, 2023

Test case1

Test case2

Test case3

Test case4

martyav commented Oct 30, 2023

mantis-toboggan-md commented Oct 10, 2023 •

edited

Loading