msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

titansmc · 2020-01-10T09:06:22Z

Describe the bug
Following the basic example in the documentation fails to attach the volume to the Pod.

Environment
Provide accurate information about the environment to help us reproduce the issue.

[root@k3n trident-installer]# ./tridentctl -n trident get backend
+----------------------+----------------+--------------------------------------+--------+---------+
|         NAME         | STORAGE DRIVER |                 UUID                 | STATE  | VOLUMES |
+----------------------+----------------+--------------------------------------+--------+---------+
| ontapnas_10.11.5.186 | ontap-nas      | 57a270cb-051a-4107-8146-1111111e7a5 | online |       2 |
+----------------------+----------------+--------------------------------------+--------+---------+


[root@k3n trident-installer]# ./tridentctl -n trident  version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 19.10.0        | 19.10.0        |
+----------------+----------------+

Docker

Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.39 (downgraded from 1.40)
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:26:28 2019
  OS/Arch:          linux/amd64
  Experimental:     false

k8s version

[root@k3n trident-installer]# kubectl  version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
[root@k3n trident-installer]# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
k1m.domain.com   Ready    master   28d   v1.15.5
k3n.domain.com   Ready    <none>   28d   v1.15.5
k4n.domain.com   Ready    <none>   28d   v1.15.5

To Reproduce
Follow the basic example

Expected behavior
attach the created volume to the Pod

Additional context
I also see in the logs errors related to iSCSI, which I believe we are not using.

time="2019-12-12T09:47:06Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-12-12T09:47:06Z" level=info msg="Trident bootstrapped successfully."
time="2019-12-12T09:47:06Z" level=info msg="Activating plain CSI helper frontend."
time="2019-12-12T09:47:06Z" level=info msg="Activating CSI frontend."
time="2019-12-12T09:47:06Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2019-12-12T09:47:06Z" level=error msg="Error gathering initiator names."
time="2019-12-12T09:47:06Z" level=error msg="Could not get iSCSI initiator name." error="exit status 1"

The text was updated successfully, but these errors were encountered:

kmwm3 · 2020-01-14T07:18:50Z

I am having the same issue.

Docker version 18.06.2-ce
K8s version 1.16.3
Trident version 19.10
Storage driver - ontap-nas

tridentctl logs

"Node info not found." node=<node_name>
"GRPC error: rpc error: code = NotFound desc = node <node_name> was not found"

kubectl describe pod that's requesting the pvc
AttachVolume.Attach failed for volume "pvc-245d157b-f450-4fed-8e0b-29affcb6d53b" : rpc error: code = NotFound desc = node <node_name> was not found

I think this may have something to do with an old install that did not clean up properly? How can we completely remove Trident to try again? I have tried clearing out the trident entries in /var/lib/kubelet and in /var/lib/trident. but to no avail so far.

balaramesh · 2020-01-14T16:04:04Z

@titansmc and @kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

titansmc · 2020-01-14T17:12:28Z

I am using CentOS 7 deployed through kubespray. Cheers.

…

On Tue, Jan 14, 2020, 17:04 Balasubramanian Ramesh Babu < ***@***.***> wrote: @titansmc <https://github.com/titansmc> and @kmwm3 <https://github.com/kmwm3> can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#328?email_source=notifications&email_token=AB6QAYGBBWBFRPJBJWLPPV3Q5XO7LA5CNFSM4KFETPN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI5E5SA#issuecomment-574246600>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6QAYHOU7ECZAWPWTQFEZDQ5XO7LANCNFSM4KFETPNQ> .

kmwm3 · 2020-01-15T14:20:44Z

@titansmc and @kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

I am running vanilla k8s on RHEL 7.7.

teramucho · 2020-02-04T09:35:22Z

i have same issue

the problem is trident did not get my cluster node asset

through log

it only join part of cluster node...

so pvc only mount on specific node, else all failed...

time="2020-02-04T09:18:10Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node
time="2020-02-04T09:18:10Z" level=debug msg="REST API call received." duration="1.523µs" method=PUT requestID=bosjdknr0f3d5tg4cl0g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master02
time="2020-02-04T09:18:10Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=ddp-deveco-master02
time="2020-02-04T09:18:10Z" level=debug msg="REST API call complete." duration=6.158862ms method=PUT requestID=bosjdknr0f3d5tg4cl0g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master02
time="2020-02-04T09:18:17Z" level=debug msg="REST API call received." duration="2.491µs" method=GET requestID=bosjdmfr0f3d5tg4cl10 route=GetVersion uri=/trident/v1/version
time="2020-02-04T09:18:17Z" level=debug msg="REST API call complete." duration="161.897µs" method=GET requestID=bosjdmfr0f3d5tg4cl10 route=GetVersion uri=/trident/v1/version
time="2020-02-04T09:18:34Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node
time="2020-02-04T09:18:34Z" level=debug msg="REST API call received." duration="1.538µs" method=PUT requestID=bosjdqnr0f3d5tg4cl1g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master03
time="2020-02-04T09:18:34Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=ddp-deveco-master03
time="2020-02-04T09:18:34Z" level=debug msg="REST API call complete." duration=5.725727ms method=PUT requestID=bosjdqnr0f3d5tg4cl1g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master03
time="2020-02-04T09:18:58Z" level=debug msg="Storage class updated in cache." name=nfs-client parameters="map[backendType:ontap-nas snapshots:true]" provisioner=csi.trident.netapp.io
time="2020-02-04T09:19:08Z" level=debug msg="REST API call received." duration="3.05µs" method=POST requestID=bosje37r0f3d5tg4cl20 route=AddBackend uri=/trident/v1/backend

gnarl · 2020-02-04T14:36:57Z

@teramucho, Kubernetes calls Trident's API to add the node once it is successfully registered. If a node in the cluster isn't added to Trident then that node may not have properly registered. Check the Trident node and driver registrar sidecar logs for errors. Also, check the kubelet logs. If this doesn't resolve your issue please contact NetApp Support.

gnarl · 2020-02-05T21:38:02Z

All, a fix was just merged to address a situation where K8S DNS is not configured properly which can lead to the error as reported in this issue. Trident patches that contain the fix will be released in the near future. Thanks for your patience.

Closes: #328

gnarl · 2020-02-28T16:14:20Z

This issue was fixed with the Trident 20.01.1 release.

presidenten · 2020-04-25T14:37:22Z

@gnarl Still got the issue on one of our clusters:

 $ tridentctl -n trident get backend
+------------------+----------------+--------------------------------------+--------+---------+
|       NAME       | STORAGE DRIVER |                 UUID                 | STATE  | VOLUMES |
+------------------+----------------+--------------------------------------+--------+---------+
| <redacted>       | ontap-nas      | <redacted>                           | online |       1 |
+------------------+----------------+--------------------------------------+--------+---------+
$
$
$ tridentctl -n trident version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 20.01.1        | 20.01.0        |
+----------------+----------------+

Trident cant find a few of the nodes in the cluster:

time="2020-04-25T14:28:49Z" level=error msg="Node info not found." node=node020
time="2020-04-25T14:28:49Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node020 was not found"
time="2020-04-25T14:28:49Z" level=error msg="Node info not found." node=node020
time="2020-04-25T14:28:49Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node020 was not found"
time="2020-04-25T14:28:50Z" level=error msg="Node info not found." node=node018
time="2020-04-25T14:28:50Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node018 was not found"
time="2020-04-25T14:28:50Z" level=error msg="Node info not found." node=node018
time="2020-04-25T14:28:50Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node018 was not found"

Any ideas what to try to get them up and running?

These machines were correctly connected before. Now we reinstalled the cluster (as training for new ops) and then the nodes dont get added anymore.

ramancde · 2020-05-31T14:01:28Z

Is there a latest update on this issue.
Do we have the fix

bigg01 · 2020-06-29T15:42:03Z

I have this problem to running OCP4 80% of the nodes are working the other 20% fails.

./tridentctl version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 20.04.0 | 20.04.0 |
+----------------+----------------+

Server Version: 4.4.9
Kubernetes Version: v1.17.1+912792b

The node is missing because the Trident object was not created
"oc get tridentnode"

gnarl · 2020-07-22T20:04:42Z

Hi @presidenten, @ramancde, and @bigg01 we've investigated the issue and have not been able to reproduce it. If you see the issue again please contact NetApp support and provide Trident logs so that we can determine what is causing the issue.

torirevilla · 2020-08-06T21:05:47Z

There are two likely scenarios why Trident does not find a Kubernetes node. It can be because of a networking issue within Kubernetes or a DNS issue. The Trident node daemonset that runs on each Kubernetes node must be able to communicate with the Trident controller to register the node with Trident. If networking changes occurred after Trident was installed this problem may only be observed with new Kubernetes nodes that are added to the cluster.

khatrig · 2020-09-03T14:32:06Z

There are two likely scenarios why Trident does not find a Kubernetes node. It can be because of a networking issue within Kubernetes or a DNS issue. The Trident node daemonset that runs on each Kubernetes node must be able to communicate with the Trident controller to register the node with Trident. If networking changes occurred after Trident was installed this problem may only be observed with new Kubernetes nodes that are added to the cluster.

This matches the kind of issue I am facing. Only newly added nodes won't register with the trident. I tried restarting the trident pods, tried removing/adding the impacted nodes but nothing helps. There have been no networking changes on the cluster and I don't see any networking/DNS related issues on the cluster.

Any pointers on how I can investigate this further?

oleimann · 2020-09-03T14:59:23Z

The same error about not finding the node (not registered with Trident controller) seems to happen with K8s 1.17 and Trident 20.07 when the Autoscaler of Kubernetes adds a node to bring a pod in - the PV for the pod doesn't get added as a consequence, and the Pod is Pending.
Do nodes in the "free pool" need to be prepared with Trident somehow, so the daemon is available when the Node starts up, and it can register ?)

gnarl · 2020-09-03T18:01:47Z

@khatrig and @oleimann,

As indicated above we haven't been able to reproduce this issue yet. Please open a case with NetApp support so that we can collect additional information.

To open a case with NetApp, please go to https://mysupport.netapp.com/site/.

Bottom left, Click on 'Contact Support'
Find the appropriate number from your region to call in, or login.
Note: Trident is not listed on the page, but is a supported product by NetApp based on a supported Netapp storage SN.
Open the case on the NetApp storage SN, and provide the description of the problem.
Be sure to mention the product is Trident on Kubernetes, and provide the details. Mention this GitHub.
The case will be directed to Trident support engineers for response.

khatrig · 2020-09-15T07:18:55Z

In my case, it turned out to be an issue with DNS on some nodes, trident-csi pod running on some nodes could not resolve trident-csi.trident service hence could not register the node.

gnarl · 2020-09-15T12:13:19Z

@khatrig thanks for updating this issue.

gnarl · 2020-10-07T13:04:05Z

For everyone that encountered this reported issue it was determined that either a DNS or a networking issue kept the Trident node DaemonSet from registering with the Trident controller. Commit 8e51987 improves the Info log message to help the Trident user resolve this registration issue.

clintonk added bug tracked labels Feb 5, 2020

netapp-ci closed this as completed in 6aba514 Feb 5, 2020

netapp-ci pushed a commit that referenced this issue Feb 26, 2020

Use downward API instead of relying on K8S DNS to find Trident service

0301668

Closes: #328

gnarl reopened this Jun 30, 2020

netapp-ci closed this as completed in 8e51987 Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

titansmc commented Jan 10, 2020

kmwm3 commented Jan 14, 2020

balaramesh commented Jan 14, 2020

titansmc commented Jan 14, 2020 via email

kmwm3 commented Jan 15, 2020

teramucho commented Feb 4, 2020

gnarl commented Feb 4, 2020

gnarl commented Feb 5, 2020

gnarl commented Feb 28, 2020

presidenten commented Apr 25, 2020 •

edited

Loading

ramancde commented May 31, 2020

bigg01 commented Jun 29, 2020 •

edited

Loading

gnarl commented Jul 22, 2020

torirevilla commented Aug 6, 2020 •

edited

Loading

khatrig commented Sep 3, 2020 •

edited

Loading

oleimann commented Sep 3, 2020

gnarl commented Sep 3, 2020

khatrig commented Sep 15, 2020

gnarl commented Sep 15, 2020

gnarl commented Oct 7, 2020

msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

Comments

titansmc commented Jan 10, 2020

kmwm3 commented Jan 14, 2020

balaramesh commented Jan 14, 2020

titansmc commented Jan 14, 2020 via email

kmwm3 commented Jan 15, 2020

teramucho commented Feb 4, 2020

gnarl commented Feb 4, 2020

gnarl commented Feb 5, 2020

gnarl commented Feb 28, 2020

presidenten commented Apr 25, 2020 • edited Loading

ramancde commented May 31, 2020

bigg01 commented Jun 29, 2020 • edited Loading

gnarl commented Jul 22, 2020

torirevilla commented Aug 6, 2020 • edited Loading

khatrig commented Sep 3, 2020 • edited Loading

oleimann commented Sep 3, 2020

gnarl commented Sep 3, 2020

khatrig commented Sep 15, 2020

gnarl commented Sep 15, 2020

gnarl commented Oct 7, 2020

presidenten commented Apr 25, 2020 •

edited

Loading

bigg01 commented Jun 29, 2020 •

edited

Loading

torirevilla commented Aug 6, 2020 •

edited

Loading

khatrig commented Sep 3, 2020 •

edited

Loading