Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sleeping node is marked as dead #1578

Closed
1 of 10 tasks
snagytx opened this issue Feb 3, 2021 · 30 comments · Fixed by #1714
Closed
1 of 10 tasks

Sleeping node is marked as dead #1578

snagytx opened this issue Feb 3, 2021 · 30 comments · Fixed by #1714
Labels
bug Something isn't working

Comments

@snagytx
Copy link

snagytx commented Feb 3, 2021

Describe the bug

In 6.1.1(zwavejs2mqtt v1.0.4) AEON Labs - Multisensor 6 (Device ID: 134-100-258 (0x0086-0x0064-0x0102) when docker container would start and the device would be sleeping it would be loaded from cache and marked as RestartFromCache. In the latest it seems that the device is marked as dead until the device sends something.

21:07:35.825 CNTRLR   [Node 019] trying to load device config
21:07:35.850 CNTRLR   [Node 019] device config loaded
...
21:07:36.055 CNTRLR   [Node 019] Beginning interview - last completed stage: RestartFromCache
21:07:36.056 CNTRLR » [Node 019] pinging the node...
...
21:07:36.421 SERIAL » 0x010800131301002506d5                                              (10 bytes)
21:07:36.422 DRIVER » [Node 019] [REQ] [SendData]
                      │ transmit options: 0x25
                      │ callback id:      6
                      └─[NoOperationCC]
...
21:07:43.419 CNTRLR   [Node 019] Node 19 did not respond after 1 attempts, it is presumed dead
21:07:43.425 CNTRLR   [Node 019] The node is dead.
..
21:07:43.445 CNTRLR   [Node 019] ping failed: Failed to send the command after 1 attempts (Status No
                      Ack)
21:07:43.454 CNTRLR   [Node 019] ManufacturerSpecificCC: doing a partial interview...
21:07:43.455 CNTRLR   [Node 019] [~] [Manufacturer Specific] interviewComple [Endpoint 0] [internal]
                      te: true => true
...
21:07:43.482 CNTRLR   [Node 019] VersionCC: doing a partial interview...
21:07:43.483 CNTRLR   [Node 019] [~] [Version] interviewComplete: true => tr [Endpoint 0] [internal]
                      ue
21:07:43.484 CNTRLR   [Node 019] trying to load device config
21:07:43.503 CNTRLR   [Node 019] device config loaded
21:07:43.504 CNTRLR   [Node 019] ZWavePlusCC: doing a partial interview...
21:07:43.504 CNTRLR   [Node 019] [~] [Z-Wave Plus Info] interviewComplete: t [Endpoint 0] [internal]
                      rue => true
21:07:43.506 CNTRLR   [Node 019] BatteryCC: doing a partial interview...
21:07:43.511 CNTRLR » [Node 019] querying battery status...
21:07:43.512 CNTRLR » [Node 019] pinging the node...
...
21:07:46.262 DRIVER » [Node 019] [REQ] [SendData]
                      │ transmit options: 0x25
                      │ callback id:      9
                      └─[NoOperationCC]
...
21:07:53.185 CNTRLR   [Node 019] Node 19 did not respond after 1 attempts, it is presumed dead
...
21:07:53.205 CNTRLR   [Node 019] ping failed: Failed to send the command after 1 attempts (Status No
                      Ack)
21:07:53.206 CNTRLR   [Node 019] Interview attempt (1/5) failed, node is dead.
...

21:15:58.798 SERIAL « 0x010e00040013063105012a0316cd0027                                  (16 bytes)
21:15:58.805 CNTRLR   [Node 019] [Multilevel Sensor] Air temperature: metadata updated  [Endpoint 0]
21:15:58.809 CNTRLR   [Node 019] [~] [Multilevel Sensor] Air temperature: 79.3 => 79    [Endpoint 0]
21:15:58.811 SERIAL » [ACK]                                                                   (0x06)
21:15:58.817 DRIVER « [Node 019] [REQ] [ApplicationCommand]
                      └─[MultilevelSensorCCReport]
                          type:  Air temperature
                          scale: Fahrenheit
                          value: 79
21:15:58.829 CNTRLR   [Node 019] The node is now alive.
21:15:58.830 CNTRLR   [Node 019] Beginning interview - last completed stage: RestartFromCache
21:15:58.831 CNTRLR » [Node 019] pinging the node...
21:15:59.008 CNTRLR   [Node 019] The node is ready to be used

Device information

Which device(s) is/are affected (make/model)?
What are the node IDs?Node 019

Last Known Working Configuration

  • New device

  • Previously working device (node-zwave-js)

    • Which library version/docker image/adapter version?
    • Have you made any recent configuration changes to the device? Describe.
  • Previously working device (other platform)

    • Which platform?
    • Have you made any recent configuration changes to the device? Describe.

Installation information
How did you install node-zwave-js?

  • zwavejs2mqtt (latest) docker image
  • zwavejs2mqtt (dev) docker image
  • ioBroker.zwave2 adapter
  • Pkg
  • Manual Docker build
    • node-zwave-js branch: fix-root-to-endpoint-mapping
    • zwavejs2mqtt branch: HEAD detached at v1.0.4
  • Manually built (as described in the docs)
  • Other:
@snagytx snagytx added the bug Something isn't working label Feb 3, 2021
@AlCalzone
Copy link
Member

Before we're hunting ghosts, please try 6.1.2 too, although I guess the relevant change is in there too.

How is the multisensor powered? Battery or USB? Did you switch that since inclusion?

@snagytx
Copy link
Author

snagytx commented Feb 3, 2021

I built a new image:

node-zwave-js
HEAD detached at v6.1.2
nothing to commit, working tree clean

zwavejs2mqtt:
HEAD detached at v1.0.4
nothing to commit, working tree clean

The issue persists - the sleeping node starts as dead.

@AlCalzone
Copy link
Member

How is the multisensor powered? Battery or USB? Did you switch that since inclusion?

@snagytx
Copy link
Author

snagytx commented Feb 3, 2021

Forgot to mention that, it's battery and always was battery.

Anything in the cache I should be looking for, for this device?

@snagytx
Copy link
Author

snagytx commented Feb 3, 2021

16:05:01.065 CNTRLR « [Node 019] received response for protocol info:
basic device class: Routing Slave
generic device class: Multilevel Sensor
specific device class: Routing Multilevel Sensor
is a listening device: true
is frequent listening: false
is a routing device: true
is a secure device: unknown
is a beaming device: true
maximum baud rate: 40000 kbps
version: 4

@AlCalzone
Copy link
Member

AlCalzone commented Feb 3, 2021

is a listening device: true

That looks different. If the node was included with battery power, it should not say that it is listening here.
Unfortunately I have no battery to test this on mine.

Please try excluding it, then reset it and include it while definitely not on USB.

@snagytx
Copy link
Author

snagytx commented Feb 3, 2021

I know this is not something "supported" but I manually updated the cache json file and marked the node as "isListening: false" and restarted the container. This is my production environment and I have automations configured bases on the motion sensor of this device. It would be really impactful to exclude it and re-include it - I'll have to change all my automations to use the new node ID.

All the fields in "Control Panel" are empty, except "Status" which is "Unknown", "Interview stage" which is "Complete".

And remains like that until the end of the scan.

2021-02-03 23:47:43.196 INFO ZWAVE: Node 19 ready: AEON Labs - ZW100 (Multisensor 6)
2021-02-03 23:47:43.197 INFO ZWAVE: Controller status: Scan completed
2021-02-03 23:47:43.198 INFO ZWAVE: Network scan complete. Found: 9 nodes

I have attached the zwavejs log file
zwavejs_1.log

@snagytx
Copy link
Author

snagytx commented Feb 4, 2021

I excluded and re-included the device, as node 36, same behavior as above, all the fields in "Control Panel" are empty, except "Status" which is "Unknown" and "Interview stage" which is "Complete".

For some reason it seems that the node is marked asleep at the very end of the scanning.

I have attached the log.

zwavejs_1.log

@AlCalzone
Copy link
Member

I find this in both logs:

01:12:40.639 DRIVER » [Node 036] [REQ] [SendData]
                      │ transmit options: 0x25
                      │ callback id:      124
                      └─[BatteryCCGet]
...
01:12:47.544 DRIVER « [REQ] [SendData]
                        callback id:     124
                        transmit status: NoAck

which means that the node did not acknowledge the receipt of the message we sent it. If the node is mains-powered, that often means there is some connectivity issue or that the node is dead.

If it is battery-powered, this is perfectly normal happens while it is asleep.

Unfortunately, both your logs don't include the inclusion and the initial interview, so I have to assume that the device is still operating in USB mode. Did you factory-reset it after exclusion?

@AlCalzone AlCalzone added waiting for info ⏳ and removed bug Something isn't working labels Feb 7, 2021
@snagytx
Copy link
Author

snagytx commented Feb 7, 2021

Yes, I did a factory reset after excluding the node. I did NOT use any USB cables for this inclusion. I just unscrew it from the wall mount, brought it closer to the computer, from where I can access zwavejs control panel, did the exclusion, did a factory reset, did the inclusion.

@AlCalzone
Copy link
Member

Okay, can you re-interview the device (refreshInfo) and send me a complete log of that?

@snagytx
Copy link
Author

snagytx commented Feb 7, 2021

I just dod a node re-interview, see attached.
Node036.log

@AlCalzone
Copy link
Member

There's a bit missing before the start of the file :(

@snagytx
Copy link
Author

snagytx commented Feb 7, 2021

is this better?

Node036.log

@AlCalzone
Copy link
Member

Interestingly, now it shows that it is not a listening device (i.e. sleeps) and it was properly detected:

18:51:33.906 CNTRLR « [Node 036] received response for protocol info:
                      basic device class:    Routing Slave
                      generic device class:  Multilevel Sensor
                      specific device class: Routing Multilevel Sensor
                      is a listening device: false <== HERE!
                      is frequent listening: false
                      is a routing device:   true
                      is a secure device:    unknown
                      is a beaming device:   true
                      maximum baud rate:     40000 kbps
                      version:               4
18:51:33.909 CNTRLR   [Node 036] The node is asleep.

I have the same device USB powered and it reads true in the same line.

About half an hour later the device actually wakes up and the interview goes through:

19:26:58.675 DRIVER « [Node 036] [REQ] [ApplicationCommand]
                      └─[WakeUpCCWakeUpNotification]
19:26:58.677 CNTRLR « [Node 036] received wakeup notification
19:26:58.683 CNTRLR   [Node 036] The node is now awake.

So... all good?

@snagytx
Copy link
Author

snagytx commented Feb 7, 2021

Yes, the main issue here is that on zwavejs2mqtt container reset the behavior of the device changes in the last releases. It used to me marked at sleeping and restored from cache, right at the beginning of the start sequence. Now it’s marked and unknown, or something similar - as mentioned a few posts back, until the end the the start sequence. In large networks it can take 10-15 minutes to complete the start sequence. This means that for 10-15 minutes the device might not be usable. Since it’s not initialized form cache until the end.

@AlCalzone
Copy link
Member

AlCalzone commented Feb 7, 2021

Please check that again. I'm guessing that was some weird interaction with the device being asleep but the controller thinking that it cannot sleep. Especially since the listening flag has changed from your first log to the last one.

Since 6.1.2, sleeping devices are almost immediately marked as asleep (and ready if they have been interviewed before).

@snagytx
Copy link
Author

snagytx commented Feb 8, 2021

Pease see attached the screenshot of the Control Panel, you can see how it starts.

Node 21 and 33 take the most to complete, 21 completed first, then 33. The moment 33 completed then 36 switched to Sleep and RestartFromCache.

Screen Shot 2021-02-08 at 8 21 43 AM
Screen Shot 2021-02-08 at 8 22 10 AM
Screen Shot 2021-02-08 at 8 23 38 AM
Screen Shot 2021-02-08 at 8 30 41 AM
zwavejs_1.log

@snagytx
Copy link
Author

snagytx commented Feb 8, 2021

Forgot to mention that I'm using the latest zwavejs2mqtt:dev, revision b9d066b9f069ae81b4fab37d5bdb95fde8f0e7bd

zwave-js 6.1.3

@AlCalzone
Copy link
Member

Did you intentionally restart the driver at 14:20:54.151?

@snagytx
Copy link
Author

snagytx commented Feb 8, 2021

Yes, I did because of:

14:20:45.246 DRIVER Restoring the network from cache failed: SyntaxError: Unexpected end of JSON input

Sometime when I stop the docker container the cache file, f4c9e959.json, get's cleared - size 0b, so I have a process to backup the json file before I stop the container, and restore the cache if f4c9e959.json is empty before I start again - sometime I forget to check.

@AlCalzone
Copy link
Member

Whoa that shouldn't happen. @robertsLando any idea how we could avoid stopping the container before zwave-js has shut down and saved the cache?

That could also explain why the interview first starts with stage "None" and then suddenly after the restart it is at RestartFromCache. Your cache files (our info about the node) get out of sync and confuse the hell out of zwave-js.

I wouldn't be surprised if the f4c9e959.json says the interview was completed, but f4c9e959.values.jsonl doesn't have some of the necessary info it should have at that point.

@AlCalzone
Copy link
Member

@snagytx about the delay: While sleeping nodes no longer delay the interview of non-sleeping nodes, it seems that their status is not correctly propagated to applications after a restart (which causes zwavejs2mqtt to not display the correct info). I'll raise an issue to track that part.

@snagytx
Copy link
Author

snagytx commented Feb 8, 2021

This is what I backup:

cp -p store/f4c9e959.* ${BK_FINAL}
cp -p store/customDevices.json ${BK_FINAL}
cp -p store/settings.json ${BK_FINAL}
cp -p store/nodes.json ${BK_FINAL}

I do the backup, before I stop the container, but I don't have to restore the files every time, even when I don't need to restore the file the outcome in the UI is the same - this is why I didn't mention it before.

@AlCalzone
Copy link
Member

cp -p store/f4c9e959.* ${BK_FINAL}

not sure - does cp overwrite by default?

Because I am seeing this in the log:

before restart:
14:20:45.671 CNTRLR   [Node 036] Interview stage completed: ProtocolInfo

after restart:
14:21:17.462 CNTRLR   [Node 036] Beginning interview - last completed stage: RestartFromCache

Which means the stage in the cache file was "Completed", not "ProtocolInfo"

@snagytx
Copy link
Author

snagytx commented Feb 8, 2021

I did a restore from backup between 14:20:45.671 and 14:21:17.46

@AlCalzone
Copy link
Member

Alright, then some of what I'm seeing starts to make sense. Thinking out loud here:

After the restore, the status of the node was Complete. zwave-js starts and resets that to RestartFromCache, so it can do the queries it does on startup.

14:21:17.462 CNTRLR   [Node 036] Beginning interview - last completed stage: RestartFromCache

You don't see this change, since it is not propagated to the UI. @robertsLando when/how do you update the node's interview stage in the UI?

The device is not immediately marked as ready although it should be (#1657), so zwavejs2mqtt does not show the information we have so far.

A few minutes later, the node is finally marked as asleep, causing it to also be marked as ready.

14:30:32.534 CNTRLR   [Node 036] The node is asleep.

You now see the RestartFromCache stage (which is correct) until it wakes up and can be queried for its current values.

@robertsLando
Copy link
Member

Whoa that shouldn't happen. @robertsLando any idea how we could avoid stopping the container before zwave-js has shut down and saved the cache?

We should listen for the container kill command with process.on and try to gracefully close it there

@AlCalzone
Copy link
Member

Should I open an issue in zjs2m?

@robertsLando
Copy link
Member

Ok! Not 100% sure it wil work but we could try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants