Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Requests to /v1/endpoints return 500 and cause the cluster to enter 'updating' state when monitoring is installed #43030

Closed
mantis-toboggan-md opened this issue Oct 2, 2023 · 9 comments
Assignees
Labels
area/monitoring feature/charts-monitoring-v2 kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker team/area3
Milestone

Comments

@mantis-toboggan-md
Copy link
Member

Rancher Server Setup

  • Rancher version: v2.8-head efc48ac
  • Installation option (Docker install/Helm Chart):Helm k3s 1.26.6+k3s1
  • Proxy/Cert Details: self-signed

Information about the Cluster

  • Kubernetes version: seen on both 1.27.6-k3s1 and 1.27.6-rke2r1
  • Cluster Type (Local/Downstream): Downstream cluster provisioned on Digital Ocean

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    Project member with read-only access and monitoring-ui-view cluster role

Describe the bug
GET requests to <rancher url>/k8s/clusters/<cluster id>/v1/endpoints return 500, the /v1 websocket disconnects, and the cluster status changes to "updating" for a couple minutes. The endpoints schema indicates that the user should be able to list endpoints (<endpoint schema>.links.collection is defined). I was only able to reproduce this bug once monitoring was installed.

To Reproduce
as admin, create a downstream cluster and enable monitoring v2
as admin, create a local user user1 and assign it to be project-member with read-only access of a project p1 in the cluster
as admin, create the clusterRoleBinding (monitoring-ui-view, user-1 )
log in as user-1, go to the cluster explorer UI -> monitoring tab

Result
500 error; cluster is 'updating' for a few minutes

Expected Result
GET requests to <rancher url>/k8s/clusters/<cluster id>/v1/endpoints should return a list of endpoints.

Screenshots

Screen Shot 2023-10-02 at 12 44 16 PM

Additional context

This was seen while investigating rancher/dashboard#4466 and blocks that issue.

@geethub97
Copy link
Contributor

Per the docs, this is the expected behavior. A read only user should not have access to links directly from the monitoring panel, only externally. We have raised this issue with product and SURE-7045 has been filed to review the permissions for a read-only user within the monitoring UI.

"A User bound to the View Monitoring Rancher Role only has permissions to access external Monitoring UIs if provided links to those UIs."
(https://ranchermanager.docs.rancher.com/integrations-in-rancher/monitoring-and-alerting/rbac-for-monitoring#users-with-rancher-based-permissions)

If the UI team wants to show an error message that says the read-only user does not have access to the monitoring panel links, I think that would be a viable option while product is determining the exact access rights and limitations for the role.

SURE-7044 has also been filed to add clarifications to the official Rancher documentation for read-only role permissions.

cc: @prachidamle @MKlimuszka

@mantis-toboggan-md
Copy link
Member Author

mantis-toboggan-md commented Oct 10, 2023

The problem here isn't that the user can't access the links on the monitoring panel, it's that requesting a resource they should be able to list is returning a 500 response and temporarily rendering the cluster unreachable. But even if the UI erroneously requests resources a user can't access the expected response is a 403 error and no impact on cluster health.

With the permission set described above, the user would have access to some subset of endpoints though not he monitoring ones, so a get request to <rancher url>/k8s/clusters/<cluster id>/v1/endpoints should return 200 and include whatever endpoints they can see. The UI checks for the monitoring endpoints and disables the links if not found, which sounds like expected behavior with the view-monitoring-role as it currently is.

@cbron
Copy link
Contributor

cbron commented Oct 13, 2023

Tentatively assigned as blocker for now, as this is causing a panic in the agent.

@geethub97
Copy link
Contributor

Cross posting my updates here for visibility.

My initial comment was incorrect (I thought this was related to a different error with the monitoring panel links, that's my bad).

The 500 error while getting the endpoints has been fixed by rancher/steve#132, and that also fixes the problem where the cluster hangs. The endpoints are not returned (per the documentation) because the read-only user, as described in this bug, is not supposed to have access to the internal monitoring panel.

cc: @MbolotSuse @prachidamle

@MbolotSuse
Copy link
Contributor

Validation Template

Root Cause

In certain cases (when a user has cluster-wide permissions to a resource type limited by one or more resourceNames) an internal steve function would return nil for the list of items that the user can see and a nil error. This resulted in a consuming function panic'ing when attempting to use this nil result.

In the monitoring use case, this would result in a 500 error and a tunnel disconnected error message when accessing /v1/endpoints as a user who was read-only on one project and had a clusterrolebinding to monitoring-ui-view. This would also cause the agent to restart, which caused the cluster to enter into an updating state.

What was fixed, or what change have occurred

Steve now returns an empty &unstructured.UnstructuredList{} value, rather than a nil value in the above use case. In the monitioring use case, this will result in /v1/endpoints returning an empty list (and valid status code).

Areas or cases that should be tested

  1. The original use case, which is listed below. This should now return an empty list, with a 200 status code, and no panic should be visible in the agent.
  2. Other use cases which use resourceName limited permissions inside of a namespace. These should continue to return values as before. See the below example:
  • Install Rancher
  • Create a user
  • Create a downstream cluster
  • On the downstream cluster, create a new project
  • In the project that was just created, create a new namespace
  • Assign the new user read-only on this project
  • Install monitoring from Cluster Tools
  • Create a new project
  • Move the cattle-monitoring-system namespace to the project that you created in the previous step
  • In this project, assign the user View Monitoring permissions
  • Login as the user.
  • Verify that the links to the monitoring tools (e.g. grafana) are visible, and that users can click on them.
  • Verify that there are 3 entries in https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints

What areas could experience regressions

Steve use cases with permissions using resourceNames.

Are the repro steps accurate/minimal?

They are accurate, though they are focused on monitoring rather than a minimal steve use case.

  1. Install Rancher
  2. Create a user
  3. Create a downstream cluster
  4. On the downstream cluster, create a new project
  5. In the project that was just created, create a new namespace
  6. Assign the new user read-only on this project
  7. Install monitoring from Cluster Tools
  8. Using kubectl (through the dashboard pod or using the cluster's kubeconfig), assign the user the monitoring-ui-view role. See the docs for an example command (simple example: kubectl create clusterrolebinding my-binding --clusterrole=monitoring-ui-view --user=u-l4npx)
  9. Login as the user.
  10. Go to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints in your browser. Observe that the error tunnel disconnected was returned, and that the cluster agent restarted (old side, new side should return a list with 0 entries, a 200 status code, and no restart of the agent).

Notes

cc: @geethub97

This does not resolve general monitoring permissions (see this comment for more information). These users will need to use a workaround for the monitoring permissions to become visible.

@prachidamle
Copy link
Member

@mantis-toboggan-md @gaktive Can we remove "status/ui-blocked" from this ticket now?

@anupama2501
Copy link
Contributor

Reproduced the issue on v2.8.0-alpha2:

  1. Created a rancher server on v2.8.0-alpha2
  2. Created a downstream rke2 node driver cluster
  3. Installed monitoring v2 chart in it.
  4. Created a standard user - user1
  5. Added user1 to a project with read-only permissions.
  6. Created a new CRB for the user using kubectl create clusterrolebinding my-binding --clusterrole=monitoring-ui-view --user=u-l4npx
  7. Logged in as the user and navigated to the end point https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints
  8. Verified the user gets a tunnel disconnect.
  9. Logged in as admin in another browser and verified the cluster goes into updating state
  10. Noticed the panic in the cluster-agent with a few restarts
2023-10-18T11:50:20.792748350Z time="2023-10-18T11:50:20Z" level=info msg="Watching metadata for /v1, Kind=ResourceQuota"
2023-10-18T11:50:21.186414761Z I1018 11:50:21.186311      52 trace.go:236] Trace[201384611]: "DeltaFIFO Pop Process" ID:endpoint-controller,Depth:11,Reason:slow event handlers blocking the queue (18-Oct-2023 11:50:20.906) (total time: 279ms):
2023-10-18T11:50:21.186440231Z Trace[201384611]: [279.727178ms] [279.727178ms] END
2023-10-18T11:56:02.035280778Z panic: runtime error: invalid memory address or nil pointer dereference
2023-10-18T11:56:02.035336398Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23d83ce]
2023-10-18T11:56:02.035343549Z 
2023-10-18T11:56:02.035348449Z goroutine 6804 [running]:
2023-10-18T11:56:02.035602112Z k8s.io/apimachinery/pkg/apis/meta/v1/unstructured.(*UnstructuredList).GetResourceVersion(...)
2023-10-18T11:56:02.035615232Z 	/go/pkg/mod/k8s.io/apimachinery@v0.27.4/pkg/apis/meta/v1/unstructured/unstructured_list.go:156
2023-10-18T11:56:02.036530655Z github.com/rancher/steve/pkg/stores/partition.(*ParallelPartitionLister).feeder.func2()
2023-10-18T11:56:02.036549625Z 	/go/pkg/mod/github.com/rancher/steve@v0.0.0-20230901044548-5df31b9c15cc/pkg/stores/partition/parallel.go:190 +0x30e
2023-10-18T11:56:02.036658096Z golang.org/x/sync/errgroup.(*Group).Go.func1()
2023-10-18T11:56:02.036668857Z 	/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x64
2023-10-18T11:56:02.036674107Z created by golang.org/x/sync/errgroup.(*Group).Go
2023-10-18T11:56:02.036807908Z 	/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup
+0xa5

@anupama2501
Copy link
Contributor

Verified fresh install v2.8-head 2e6895d

Test case1

  1. Install Rancher on v2.8-head
  2. Create a standard user - user1
  3. Create a downstream cluster
  4. On the downstream cluster, create a new project
  5. In the project that was just created, create a new namespace
  6. Assign the new user1 read-only on this project
  7. Install monitoring from Cluster Tools
  8. In the system project, add user1 with custom View Monitoring permissions.
  9. Login as the user1.
  10. Go to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints in your browser. Observe that the error tunnel disconnected was returned, and that the cluster agent is not restarted

Test case2

  1. From the above test case after step7, create a new project and assign the user1 to permissions View Monitoring
  2. Move the cattle-monitoring-system project to the this project
  3. Login as the user1 and verify the monitoring links are accessible and no errors seen.

Test case3

  1. Repeat steps 1 through 7 by creating a new user - user2 in step2
  2. Using kubectl create a clusterrolebinding monitoring-ui-view to the downstream cluster.
  3. Login as the user - user2 and the links for monitoring are not clickable [see https://github.com/[Monitoring v2] Links in Dashboard are un-clickable even the user has monitoring-ui-view permission dashboard#4466#issuecomment-1766495804]
  4. Navigate to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints and verified no errors were noticed and the cluster-agent was not restarted.

Test case4

Upgrade use case from v2.7.8 >> v2.8-head

  1. Repeated the steps from test case1
  2. Upgraded rancher server and verified the links are clickable and no errors seen when navigating to https://$RANCHER/k8s/clusters/$CLUSTER/v1/endpoints

@martyav
Copy link
Contributor

martyav commented Oct 30, 2023

rancher/dashboard#4466 has a release-note label and comments in that thread reference this issue. Since rancher/dashboard#4466 seems to be about users frustrated by intended Rancher behavior, should I instead be release noting the bug fix recorded here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring feature/charts-monitoring-v2 kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker team/area3
Projects
None yet
Development

No branches or pull requests

10 participants