Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devices stop communicating randomly #19747

Open
Alfy1080 opened this issue Nov 17, 2023 · 202 comments
Open

Devices stop communicating randomly #19747

Alfy1080 opened this issue Nov 17, 2023 · 202 comments
Labels
problem Something isn't working

Comments

@Alfy1080
Copy link

What happened?

Hello. I am not a zigbee expert so apologies if i provide incomplete information. I will try my best to mention everything.

Since 2 days ago my z2m instance in home assistant started acting up randomly.

At random times, one or multiple zigbee devices stop executing commands.
ex1: Aqara TRV does not change the target temperature when i try to change it from either home assistant or z2m interface, the last seen status keeps increasing as if there is no communication between the TRV and the coordinator.
ex2: Philips Hue Lightstrip does not switch on or off whenever i attempt to toggle it either from home assistant or z2m.

I have set the logging level to debug and when i try to send a command to the stuck device, the command shows up in the logs without any error whatsoever.

Power cycling the stuck device or pressing the pairing button (where applicable) does nothing.
The only thing that seems to get my zigbee network back up and running temporarily is restarting z2m which makes me think that this is caused by something in z2m and not the devices that fail. Right after the z2m restart, all stuck devices start communicating again for a while until at some point either the same devices or others present the same behaviour as before.

My setup:
Zigbee dongle: Home Assistant SkyConnect flashed with the latest firmware available through the web flasher here https://skyconnect.home-assistant.io/firmware-update/
Zigbee2Mqtt: Latest addon version available for Home Assistant (1.33.2-1)
Home Assistant Core version: 2023.11.2
Home Assistant Supervisor version: 2023.11.3
OS: Debian 12
Server: Dell OptiPlex 9020 Micro, Core i7-4790t 3.90GHz, 16GB DDR3, SSD

What did you expect to happen?

I expected that no device will get in a frozen state where i can not issue commands to it or receive state changes from it. At least not as often as once every few minutes/hours

How to reproduce it (minimal and precise)

There is no replication steps that i can imagine. This issue can happen even when no zigbee device receives any command at all. I even left home for a few hours and after i left i restarted zigbee2mqtt to make sure it's all in working order and does not get any commands from anyone since nobody was home. When i returned home one of the Aqara TRVs was stuck.

Zigbee2MQTT version

1.33.2

Adapter firmware version

7.2.2.0 build 190

Adapter

Home Assistant SkyConnect

Debug log

log.txt

@Alfy1080 Alfy1080 added the problem Something isn't working label Nov 17, 2023
@Alfy1080
Copy link
Author

Here you can see i have updated the temperature on 3 Aqara TRVs:

Living_Room_TRV: 21.5
Kids_Room_TRV: 17.5
Bedroom_TRV: 22.5

debug 17-11-2023 18:24:29: Received MQTT message on 'zigbee2mqtt/Living_Room_TRV/set/occupied_heating_setpoint' with data '21.5' debug 17-11-2023 18:24:29: Publishing 'set' 'occupied_heating_setpoint' to 'Living_Room_TRV' debug 17-11-2023 18:24:29: Received MQTT message on 'zigbee2mqtt/Kids_Room_TRV/set/occupied_heating_setpoint' with data '17.5' debug 17-11-2023 18:24:29: Publishing 'set' 'occupied_heating_setpoint' to 'Kids_Room_TRV' debug 17-11-2023 18:24:29: Received MQTT message on 'zigbee2mqtt/Bedroom_TRV/set/occupied_heating_setpoint' with data '22.5' debug 17-11-2023 18:24:29: Publishing 'set' 'occupied_heating_setpoint' to 'Bedroom_TRV' info 17-11-2023 18:24:29: MQTT publish: topic 'zigbee2mqtt/Bedroom_TRV', payload '{"away_preset_temperature":null,"battery":100,"calibrate":null,"calibrated":null,"child_lock":"UNLOCK","device_temperature":27,"internal_heating_setpoint":30,"last_seen":"2023-11-17T18:24:29+02:00","linkquality":164,"local_temperature":23.5,"occupied_heating_setpoint":23,"power_outage_count":0,"preset":"manual","schedule":null,"schedule_settings":null,"sensor":"external","setup":false,"system_mode":"heat","update":{"installed_version":2590,"latest_version":2590,"state":"idle"},"update_available":null,"valve_alarm":false,"valve_detection":"ON","voltage":3300,"window_detection":"OFF","window_open":null}' info 17-11-2023 18:24:29: MQTT publish: topic 'zigbee2mqtt/Bedroom_TRV', payload '{"away_preset_temperature":null,"battery":100,"calibrate":null,"calibrated":null,"child_lock":"UNLOCK","device_temperature":27,"internal_heating_setpoint":30,"last_seen":"2023-11-17T18:24:29+02:00","linkquality":164,"local_temperature":23.5,"occupied_heating_setpoint":22.5,"power_outage_count":0,"preset":"manual","schedule":null,"schedule_settings":null,"sensor":"external","setup":false,"system_mode":"heat","update":{"installed_version":2590,"latest_version":2590,"state":"idle"},"update_available":null,"valve_alarm":false,"valve_detection":"ON","voltage":3300,"window_detection":"OFF","window_open":null}' debug 17-11-2023 18:24:30: Received Zigbee message from 'Bedroom_TRV', type 'attributeReport', cluster 'hvacThermostat', data '{"occupiedHeatingSetpoint":2250}' from endpoint 1 with groupID 0 info 17-11-2023 18:24:30: MQTT publish: topic 'zigbee2mqtt/Bedroom_TRV', payload '{"away_preset_temperature":null,"battery":100,"calibrate":null,"calibrated":null,"child_lock":"UNLOCK","device_temperature":27,"internal_heating_setpoint":30,"last_seen":"2023-11-17T18:24:30+02:00","linkquality":168,"local_temperature":23.5,"occupied_heating_setpoint":22.5,"power_outage_count":0,"preset":"manual","schedule":null,"schedule_settings":null,"sensor":"external","setup":false,"system_mode":"heat","update":{"installed_version":2590,"latest_version":2590,"state":"idle"},"update_available":null,"valve_alarm":false,"valve_detection":"ON","voltage":3300,"window_detection":"OFF","window_open":null}'

Out of these 3 TRVs only the Bedroom_TRV actually executed the command. The other two completely ignored it and their last seen status did not update when i changed the target temperature:

image
Attached another full log after changing the target temperatures:
log.txt

@Ra72xx
Copy link

Ra72xx commented Nov 18, 2023

I have a similar problem with the Aqara thermostats (#19342). Sometimes the communication seems to fail, however Z2M does not notice that and pretends everything is right.

@Alfy1080
Copy link
Author

Forgot to mention that I have fully erased zigbee2mqtt from my system, reinstalled and reconfigured z2m from scratch, re-paired all my devices again to my coordinator and the exact same issue is still happening randomly. As a workaround I have set up an automation in home assistant to restart the z2m addon every 30 minutes just to make sure my TRVs don't stay stuck for too long and causes my heating to run forever. I have also ordered a Sonoff Dongle-P to rule out the possibility that the SkyConnect is broken, but that will arrive in one or two days so I'm still waiting to test that.
Also replaced my USB extension cable to make sure i'm not using a faulty cable but that didn't improve the situation either.

@Alfy1080
Copy link
Author

I have reflashed my skyconnect dongle and re-paired everything from scratch, just to rule out the possibility of the last firmware update flashing improperly and corrupting my dongle, but the issue is still there. I have attached the latest debug log here.
z2mlog.txt

@PeterKawa
Copy link

PeterKawa commented Nov 20, 2023

EDIT3: 03/24/2024:
still stable! 👍


EDIT2 03/15/2024:
Seems to getting solved, I'm now testing with z2m 1.36.0-dev commit: 56feb77 'edge', and I updated the Sonoff dongle-E's firmware to revision: 7.4.1.0, results are here.
With the EZSP driver it's pretty stable now for the first day.


EDIT:
This seems to go unnoticed?
I couldn't find a 'normal' way to downgrade..
So, I just restored HA from before Oct. 1st, but I had days of work to re-pair all zigbee devices, most of them needed to be re-paired multiple times.

It's now running Z2M v1.33.0 again, and as I expected, without any issue.


Initial issue post:

What happened?
Similar issue here, but with "Sonoff Dongle Plus E"
I'm aware of it's experimental state, but it has run just fine without any issue, for the last six months....

It started when the add-on was updated to v.1.33.2, but it took a while before I noticed, and realized this v.1.33.2 update can contain bugs for my setup.
I checked my wifi & zigbee channels, wifi 1 and zigbee 20 should be fine.
Can't discover useful info in the warnings / errors in the logs.
I updated to Edge, but it results in exactly the same behaviour.

Sometimes restarting z2m resolves the unresponsiveness, sometimes one device stays unresponsive.
And, quite often 5 to 8 devices suddenly have a Offline status, but when I turn each of them on/off in the frontend, the stautus is online suddenly. I can't operate them as entity in the normal HA interface, only using the switches in z2m frontend until the status is online again.

I hope my herdsman log will reveal anything.

What did you expect to happen?
I expected my zigbee devices to respond as they should

How to reproduce it (minimal and precise)
Install update v1.33.2 should do the trick

Zigbee2MQTT version
v.1.33.2
and since nov. 18th: 1.33.2-dev commit: ad4bed8

Adapter firmware version
6.10.3.0 build 297

Adapter
Sonoff Zigbee 3.0 USB Dongle Plus ZBDongle-E (with 50cm extension cable, USB2)

Debug log
herdsman-log.txt

Device log
error-log.txt

Devices
TS011F: 12
TS0501A: 4
lumi.sensor_wleak.aq1: 4
lumi.sensor_magnet.aq2: 4
lumi.weather: 4
lumi.sensor_motion.aq2: 3
TRADFRIbulbE14WWclear250lm: 3
01MINIZB: 1
TS0201: 1
TS0601: 1
TRADFRI control outlet: 1
(Router devices: 21, End-devices: 17)

@Ra72xx
Copy link

Ra72xx commented Nov 21, 2023

I only recently switched from Conbee II / Deconz to Skyconnect / Z2M and so I can't judge if the problem appeared only with recent versions of Z2M. The setup was more stable with Deconz (which I didn't expect)!
I also have improved my Zigbee situation by putting the stick on an 2.0 USB hub and a 2m cable away from the rest of the setup. Nevertheless, I have quite a lot of such occurrences (devices getting "out of sync" with what state Z2M thinks they are, devices getting offline or dropping silently out of the network).. Sometimes operating such devices from the frontend makes them available again, but not always this is recognized by Home Assistant. Sometimes they have to be re-paired.
Some types of devices seem to be prone to this problem, e.g. the Aqara thermostats (or the problem is more obvious with a heating schedule in autumn than with random lights or sensors).
If this problem was, as you said, not present in previous versions, I hope that it is not a fundamental problem...

@Alfy1080
Copy link
Author

UPDATE: My new Sonoff Zigbee Dongle-P arrived yesterday and I have replaced my Skyconnect with it and re-paired everything to it. So far since yesterday I had no devices that got stuck the same way as they did on the Skyconnect dongle. I only had an issue with some roller shade motors which was fixed by flashing the latest firmware on the sonoff dongle and re-pairing the motors in z2m.

So it seems to me that at least in my case, the issue is caused by a combination of Skyconnect and zigbee2mqtt version since as i've said previously, this issue started happening recently even though i have been running the same setup in terms of dongle, routers and client devices for over a year now.

@Ra72xx
Copy link

Ra72xx commented Nov 21, 2023

That would be extremely annoying as I switched 70 devices from Deconz to Z2M/Skyconnect only recently to get rid of intermittent instability and seemingly now I traded that in for persistent instability. (Yes, I know, Skyconnect is only not yet officially supported...).
EDIT: Something similar here: #19648

@Aleborg
Copy link

Aleborg commented Nov 21, 2023

I have the same issue, using Sonoff Zigbee Dongle-E

Devices affected:
TS110E_1gang_1
TS110E_2gang_1
TS0001_power
TS0001_switch_module

Randomly stops responding to commands but seems to be reporting state.
A restart of zigbee2mqtt temporarily resolves it

@Ra72xx
Copy link

Ra72xx commented Nov 22, 2023

I tried to change QoS, but this doesn't help (however, as QoS is only between MQTT and Z2M, I admit I did not really expect anything from this setting).

@Ra72xx
Copy link

Ra72xx commented Nov 22, 2023

Maybe related discussion: #19763

@JonasSL
Copy link

JonasSL commented Nov 23, 2023

I experience the exact same problem after updating to the same Z2M version. I also use a Skyconnect as coordinator, and I am also having troubles with my Aqara TRV.

A restart of Z2M fixes the issue.

@helgek
Copy link

helgek commented Nov 23, 2023

Having a similar issue, randomly one of my devices would not respond to commands coming from Z2M, results in a timeout error: #13993 (comment) - coordinator is the highly praised SLZB-0 (based on stable chipset CC2652P recommended by Z2M, not the experimental EFR32MG21 chipset).

@lennartgrunau
Copy link

Hey everyone! I have the same issue. The devices (in my case only IKEA LED2005R5 (3x)) randomly disconnect and show up as unavailable - and offline in z2m.
However those devices still react to commands to a zigbee group that they are placed in. So, while I cannot control the devices directly, I can control them via a zigbee group, which is super weird to me.

Can anyone recreate this?

Thanks & Cheers

Zigbee2MQTT version
1.33.2 commit: unknown
Coordinator type
EZSP v9
Coordinator revision
7.1.1.0 build 273
Frontend version
0.6.142
Zigbee-herdsman-converters version
15.106.0
Zigbee-herdsman version
0.21.0

Home Assistant Sky Connect as a coordinator.

@Ra72xx
Copy link

Ra72xx commented Nov 27, 2023

Is there any way to push issues like this? The random communication loss is something which is a complete deal breaker for me, as I can no longer rely on my Zigbee network for lights, radiators and switches of all kind. Seemingly every day I have to count my devices and check the thermostats if they are really operating etc.

However this and quite a few other bug reports concerning similar problems I've been monitoring on this list for the last few days don't seem to get any developer attention and are probably simply getting lost because of newer issues.

IMHO a network problem which results in loosing communication to network members without even noticing the user but pretending everything is fine is one of the worst things which can happen.

@digitalkaoz
Copy link

i experienced the same with my setup (Aquara Thermostats & Sonoff Dongle Plus) running on 1.33.2

So a downgrade to 1.33.0 fixes the random unresponsiveness to commands? im mostly a "fail forward person", so is there anything i can do except downgrading? looking into the logs, i see some warnings, i dont know if they are related:

warn  2023-11-27 08:56:15: zigbee-herdsman-converters:aqara_trv: Unknown key 641 =4��44
warn  2023-11-27 08:56:15: zigbee-herdsman-converters:aqara_trv: Unknown key 643 = 0
warn  2023-11-27 08:56:15: zigbee-herdsman-converters:aqara_trv: Unknown key 644 = 0

@Ra72xx
Copy link

Ra72xx commented Nov 27, 2023

I also see those messages with the strange keys some times, but not regularly.

@trackhacs
Copy link

i am in on this too. just moved from seperate docker HA / Z2M to HAOS with Z2M addon under proxmox.

Also switched from ConbeeII to ZBDongle (EZSP v12) with 7.3.1.0 build 176.

Ikea lamp which worked fine for at least one year now gets randomly unavailable. Toogling state in Z2M guide revokes it though. sometimes it also comes back alone.

@Ra72xx
Copy link

Ra72xx commented Dec 1, 2023

Unfortunately, nobody's interested in the problem :-(

@PeterKawa
Copy link

PeterKawa commented Dec 1, 2023

@digitalkaoz wrote:

So a downgrade to 1.33.0 fixes the random unresponsiveness to commands?

In my case it did, Robert, it works flawless as it did before, no issues at all, running v1.33.0.

@Ra72xx Hmmm... The update's log v1.34.0-1 of today Dec. 1st, shows nothing about this issue or a fix for it (yet). I'm not updating for sure.

@trackhacs
Copy link

i just updated - i give feedback if behavior improved.

@Aleborg
Copy link

Aleborg commented Dec 1, 2023

I updated earlier today, no difference, would say that it got a little worse...

@xEcho1
Copy link

xEcho1 commented Dec 2, 2023

Had the same problem when i updated to v1.34.0-1 from 1.33.0-1. My Aqara TRV's would not respond to anything, while my Ikea Tradfri lights would still work. Im using the Sonoff Dongle Plus V2. Restoring my backup of v1.33.0-1 makes them respond again.

@trackhacs
Copy link

yes - can unfortunately confirm - its better with - but happens and is a problem.

Zigbee2MQTT version
[1.34.0](https://github.com/Koenkk/zigbee2mqtt/releases/tag/1.34.0) commit: [unknown](https://github.com/Koenkk/zigbee2mqtt/commit/unknown)
Coordinator type
EZSP v12
Coordinator revision
7.3.1.0 build 176
Coordinator IEEE Address
0xe0798dfffe741678
Frontend version
0.6.147
Zigbee-herdsman-converters version
15.130.1
Zigbee-herdsman version
0.25.0

@Ra72xx
Copy link

Ra72xx commented Dec 4, 2023

Has anybody tried simply downgrading to the Docker container 1.33.0 and keeping the database? I have no backup to restore...

@digitalkaoz
Copy link

@Ra72xx that didnt work for me, i had to repair all devices

@Ra72xx
Copy link

Ra72xx commented Dec 4, 2023

So we have to wait until somebody takes note of this problem (which seemingly doesn't happen) and fixes it going forward.
If I have to re-pair everything, there is some temptation to give ZHA a chance... I started using Z2M only a few weeks ago instead of Deconz, and this problem is the famous first impression :-(.

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 14, 2024

Spamming devices. Maybe this is a reason of huge log file and the large size of my database (HA recorder);

Change your Z2M log level to warn (unless you are troubleshooting something). ember will still log important stuff, even with warn. With spammy devices, info level creates a lot of MQTT publish lines. For example, your 16MB 12h log file is reduced to 35KB by removing these. That should make your hardware happy.

As a consequence, sometimes there is a delay in the response (from devices) and execution of commands;

You are on release version, The slight delay is likely due to the config for message processing interval. This has been lowered and should no longer be perceptible after April release.

We will need more time to establish "acceptable values" for counters, but in the meantime, here are a few more details (a bit technical though) on the counters that seem relevant according to your logs (remember, these are cleared right after they are logged, meaning each log accounts for the past hour):

  • MAC_TX_UNICAST_RETRY: The MAC retried a unicast Data or Command frame after initial Tx attempt.
  • MAC_TX_UNICAST_FAILED: The MAC unsuccessfully transmitted a unicast Data or Command frame.
  • PHY_CCA_FAIL_COUNT: The number of times the PHY layer was unable to transmit due to a failed CCA (Clear Channel Assessment) attempt.
  • NWK_FRAME_COUNTER_FAILURE: A message was dropped at the Network layer because the NWK frame counter was not higher than the last message seen from that source.
  • NEIGHBOR_STALE: A neighbor table entry became stale because it had not been heard from.

Overall, I'd say you likely have another 802.11 network (WiFi, Zigbee) nearby that is creating trouble (too many failed CCA attempts), and you probably have at least one bad router (too much traffic is being retried/lost). I'm seeing a few too many route errors related to 0x6A86, I'd start investigating there.

@RomchikL
Copy link

RomchikL commented Mar 14, 2024

Nerivec, I'm grateful for you help!

You are on release version, The slight delay is likely due to the config for message processing interval. This has been lowered and should no longer be perceptible after April release.

In the discussion you told that the normal delay (for stable version of z2m) is about 1 sec.
But sometimes I have a delay is about 5-6 sec.

Overall, I'd say you likely have another 802.11 network (WiFi, Zigbee) nearby that is creating trouble (too many failed CCA attempts)

Yeah, you are right. My wi-fi router is near the stick (~40cm), but I splitted the channels to minimize problems: Z2M is on 11 channel and wi-fi net is on 3 one.

you probably have at least one bad router (too much traffic is being retried/lost). I'm seeing a few too many route errors related to 0x6A86, I'd start investigating there.

Hmm, the map shows that I have 5 types of routers (15 devices in total):

My map

image

I'd start investigating there

How to understand which router is bad?

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 14, 2024

Yeah, you are right. My wi-fi router is near the stick (~40cm), but I splitted the channels to minimize problems: Z2M is on 11 channel and wi-fi net is on 3 one.

Use a 2.4GHz WiFi scan app with your phone, one that tells which channels are most used around you. That will tell you about any WiFi in range in your area, not just yours. Note: changing the channel will currently require re-pairing all devices.

but in a metal box

Depending on the metal, that definitely could cause signal problems.

How to understand which router is bad?

When looking at routing errors popping up in your logs (Received network/route error...), identify which devices are causing the most. Usually there is a problem when too many of these show up. Note that routing errors may not necessarily mean a bad router, it can also be nearby interferences (or both...).
You can find information in the zigbee2mqtt.io page of the device. Usually if the device is know to be a bad router (or just bad), it will be specified.
You can also add one or more dedicated routers at strategic positions (need a socket, and an old USB phone charger to plug it in -or something similar-). It's not very expensive, and with the web flasher, the procedure is now a lot easier (plug-flash-pair). It can help a lot to cover an area where routers are having trouble providing good service (which is often the case with Tuya routers).

@PeterKawa
Copy link

PeterKawa commented Mar 14, 2024

@Nerivec

Very good! Let me know how it goes over the next few days regarding your previous troubles.

Yeah thanks! Didn't work out for me with the ember driver for now, it did run very fine for about an hour, and then pretty much all devices got sluggish and unresponsive. I will report with logs in the right place, conversation #21462

BUT, I have great news: I got the idea to 'just' switch to the EZSP driver, after reading this

If you had been using v13 for previous ezsp tests (no backup available), and you keep your network settings in configuration.yaml the same (meaning they will match that of the adapter the last time it ran), after swapping for ember, it should simply take over where you left off

, and it worked out very well.
I have a day with almost no 'numb' devices woohoo! And I disabled "disable_tuya_default_response" again, after all devices were paired again.

Do you still have the error you had with the Sky? I can give you an explanation/solution for future reference.

Sorry but not anymore, I'm a bit of an unorganized person, but I'll test it later on with a cloned VM. I was now working with my main VM, and was a bit in a hurry to get it running again :P.
I will continue testing the Sky with the clone VM one of these days!

Did you end up looking into the content of that huge configuration.yaml file before you made your changes? I'm very curious as to what could have messed with it and how.

Also sorry but no. Idk why, but I deleted it (I'm not a 'trashcan' user, which, for me, is a very good invention ;-) )
But who knows, I'll also test the ember driver+sonoff-E on my test VM

Change your Z2M log level to warn

That's a good idea, I changed it as well

Again, thanks a mil!

@RomchikL
Copy link

Use a 2.4GHz WiFi scan app with your phone, one that tells which channels are most used around you. That will tell you about any WiFi in range in your area, not just yours.

I used the ability of my wi-fi router to scan, the result is below. It is clear that the situation is not static, neighborhood routers regular change the channel depending on interference. But the average number of networks per channel is ~10-15 (the max value is 20-25). Yes, it's a lot, but I can't control the neighborhood networks.

Wi-fi scan

image

Depending on the metal, that definitely could cause signal problems.

The steel box, yes, it can cause problems, but it's located pretty close to the stick. And, according to the Koenkk answer, I can't turn off the router property, just add another one next to it.

Usually if the device is know to be a bad router (or just bad), it will be specified.

My router devices do not have such marks on their pages. By default, they are not bad. Probably, a lot of networks on channel 11 cause such problems. I keep checking.

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 15, 2024

According to your scan, your best bet would probably be to switch to channel 20 or 25 to move further away from WiFi range. Channel 11 is definitely crowded already.
I usually recommend using an app on the phone because you can test in various corners of the house. A neighbor WiFi might not show up "as interfering as much" from the static location of your WiFi router/access point, but be causing trouble with a specific router in one corner, because that corner is much closer to it.
Obviously changing the Zigbee channel is a bit of a pain if you have lots of devices, but when you have some time to re-pair everything, I'd advise it in your situation. Hopefully we can make that transition smoother in the future. 😉

@SPGWhistler
Copy link

1.36.0-dev com

THANK YOU! I was pulling my hair out. I don't know if I had an old version or something but for me, this just started happening a few days ago - random devices would just stop working - and the only fix was restarting z2m. Hopefully this commit fixes the issue.

@Freestylerrr
Copy link

For me the problems appeared in version 1.3.6.0 too. In last days random devices stop accepting the z2m commands.

It started with all of the Tuya smart plugs after the update to 1.3.6.0 and they stopped responding to any commands, but worked locally (when I pressed the button). I unpaired and re-paired all of them and they started to work.

Few days ago a two Danfoss Ally thermostats stopped responding to all commands.

My hw is RPi4 with external SSD and Sonoff Dongle Plus E

Yesterday and today one Danfoss Ally Thermostat and two Tuya smart plugs.

The symptomps are always the same, but the affected devices differs:

  • the device shows as available in Z2M
  • the device reports LQI
  • the last seen is updating

If I try for example turn off the plug, the switch in HA moves, but goes instantly backwards and nothing happens. Same if I try to send update for Danfoss external temp sensor or change the temperature.

Sometimes restart z2m helps, sometimes I need to reboot the whole device (RPi4) or maybe I'm too impatient to wait and restart of z2m and wait few minutes is enough, I don't see any errors in z2m logs,

The worse thing is that I can't tell which devices stopped to work because in z2m everything looks alright. I will find out only when automation does nothing or when I want to turn on/off plug or something else

Is there a way how to at least know that some device is in this state?

What would you guys recommend? I don't blame the devices because I don't need to reboot them or do something with them, the restart of z2m or RPi fixes it (sadly just for hours or days), so it should not be some faulty router or something.

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 20, 2024

@Freestylerrr Can you give the following details:

  • driver ezsp or ember
  • MQTT broker & version
  • HA version

First, make sure the PI is (still) getting enough power; this is a very common problem. You'll find lots of resources on this with a quick web search.

Also make sure the commands going from HA go all the way to Z2M, that MQTT isn't causing trouble in the middle (check MQTT logs for errors, can also check HA logs for errors while you're at it -after activating one of the problem switches-).

Then you can try this: Shut down Z2M. Unplug the coordinator. Wait for a minute. Plug the coordinator back in. Restart Z2M. Then see if the troubles re-appear in the next few days.

PS: Forcing the OS to restart when Z2M/adapter is starting is definitely not ideal. You should avoid that in the future unless there is a good reason.

@Freestylerrr
Copy link

@Nerivec

driver: ezsp
Mosquitto broker version: 6.4.0

HA version:

Core: 2024.3.1
Supervisor: 2024.03.0
Operating System: 12.1
Frontend: 20240307.0

I checked MQTT Broker log, when the problem was present, but I didn't see any errors. But actually I don't exactly know how the broker works. There is not much output in log and when I do some action, it does not log there (maybe it's just disabled).

This is part of the broker log, I clicked on button on 16:59, but nothing was logged (the action was performed though).

2024-03-20 16:52:29: New connection from 172.30.32.2:39180 on port 1883.
2024-03-20 16:52:29: Client closed its connection.
2024-03-20 16:54:29: New connection from 172.30.32.2:39412 on port 1883.
2024-03-20 16:54:29: Client closed its connection.
2024-03-20 16:54:53: Saving in-memory database to /data//mosquitto.db.
2024-03-20 16:56:29: New connection from 172.30.32.2:48626 on port 1883.
2024-03-20 16:56:29: Client closed its connection.
2024-03-20 16:58:29: New connection from 172.30.32.2:33364 on port 1883.
2024-03-20 16:58:29: Client closed its connection.

In HA log I see some errors, but nothing related to Z2M (there is google API which throws error, because it's garbage, some errors for Helium integration etc.)

In Z2M log I have only this (i changed the log from info to warn, because there was a lot of spam from MQTT: publish).

This is since last reboot:

[13:31:59] INFO: Preparing to start...
[13:31:59] INFO: Socat not enabled
[13:32:01] INFO: Starting Zigbee2MQTT...
Zigbee2MQTT:error 2024-03-20 13:32:19: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:error 2024-03-20 13:32:19: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:error 2024-03-20 13:32:20: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:error 2024-03-20 13:32:20: Entity 'homeassistant/sensor' is unknown

@albsch
Copy link

albsch commented Mar 20, 2024

PS: Forcing the OS to restart when Z2M/adapter is starting is definitely not ideal. You should avoid that in the future unless there is a good reason.

What is the reason for that? Is Z2M not robust against restarts at any time?

@Freestylerrr
Copy link

@Nerivec
So it lasted 4 days. Today this errors appeared in log and the TRV stopped responding to commands. Maybe some other devices too, but I found it at night, so I didn't have time to look what works and what not. Simple restart of Z2M and it's working again.

error 2024-03-24 15:03:26: Publish 'set' 'state' to 'Smart Plug - Living Room at window' failed: 'Error: Command 0xa4c13833d082358c/1 genOnOff.off({}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 19566 - 1 - 153 - 6 - 11 after 10000ms)'
error 2024-03-24 19:43:48: Publish 'set' 'external_measured_room_sensor' to 'Bedroom Thermostat - Danfoss Ally' failed: 'Error: Write 0x5cc7c1fffed49467/1 hvacThermostat({"danfossExternalMeasuredRoomSensor":2187}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":4678,"transactionSequenceNumber":null,"writeUndiv":false}) failed (sendZclFrameToEndpointInternal error)'

@Ra72xx
Copy link

Ra72xx commented Mar 25, 2024

Same here for me: Zigbee seems to stop working every few days, but a restart fixes it. This seems a bit different from the original problem, where devices dropped out randomly.
Like you, I did no debugging but simply restarted, as I hope that the new ember driver will help (I still use the old driver).

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 25, 2024

Very hard to say without herdsman debug (new logging coming soon, should make it easier to switch it on when needed). But the timeout followed by sendZclFrameToEndpointInternal error is making me think something is going wrong with the connection to the adapter. You may want to double-check everything involved there. Cables still ok? Power to the USB still ok? Nothing overheating? etc...

@Freestylerrr
Copy link

Freestylerrr commented Mar 26, 2024

@Nerivec

Below are two graphs for CPU Temp and CPU Usage, sadly I don't see any entity for Power in System Monitor, but the RPi4 and the power adapter is half year old so I doubt that this is the problem. Everything (WiFi and Bluetooth devices) except Z2M works without any problems.

The dongle is in USB 2.0 port (previously it was in 3.0) and is directly there (without extension cable). I discovered that the problem is with Tuya smart plugs, they are the devices which stops responding to commands (and of course everything for what they doing a router). But I hardly thing that the problem is in hardware, cause simple Z2M addon reboot fixes everything and I'm not the only one as this topic shows.

RPi4 CPU Temp

image

RPi4 CPU Usage

image

EDIT: And I discovered that if any devices is stuck the map in Z2M stops loading (I can let it for 20 minutes but no map is shown)

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 26, 2024

But I hardly thing that the problem is in hardware, cause simple Z2M addon reboot fixes everything and I'm not the only one as this topic shows.

Problem is, this topic has had about half a dozen different reasons for the "looks like the same issue"... As such, the archive of posts here is not really helpful in debugging the actual problems of one specific user. It's just the generic problem that arises when something on the network is not doing what it is supposed to do, which can range from interferences to failing hardware.

Rebooting Z2M closes and re-opens the port, and reboots the adapter, so it does not exclude something affecting the hardware (or the hardware's software).

The PI is well-known for causing all kinds of interferences with dongles, so first thing I'd do is put an extension cord in your case; it may not be the solution, but it sure can't hurt.
A USB port can also be missing a few watts of power without affecting anything but the stability of the dongle. If you have a powered USB hub (with its own power adapter), you can give that a try...

That said, since you seem to be saying that only specific devices are affected in your case, it's less likely to be a problem on the PI's side. One thing is for sure, if the map is stuck for 20 minutes and did not error out, something went very wrong. Each request done when loading the map is on a 10sec timeout (it can do quite a few, depending on the size of your network), so if a device times out and crashes the loading of the map for some reason, you should see it pretty quickly, and the error should show up in the logs.

@Freestylerrr
Copy link

Each request done when loading the map is on a 10sec timeout (it can do quite a few, depending on the size of your network), so if a device times out and crashes the loading of the map for some reason, you should see it pretty quickly, and the error should show up in the logs.

Alright, which log should I check, for now I have Z2M set to warn log level.

It's hard to debug, I'm not turning on and off the plugs every 5 mins, so sometimes I will find out after hours or days that something is not working, because in Z2M everything looks good (device shows as available, last seen time is updating etc.). Is there any automation which I can run to find out that there is a problem?

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 26, 2024

Z2M logs, with HA, there are several places you can get them:

  • Z2M UI, Logs tab
  • HA Settings > System > Logs > top-right corner pick Z2M (or Z2M edge)
  • HA Settings > Add-ons > Z2M (or Z2M edge) > Log tab
  • Older logs in /config/zigbee2mqtt/log (if you haven't changed the data path config in Z2M)

If an error is thrown during the loading of the map, you should also see a red notification appear, as usual.

Have you confirmed 100% that other devices (not tuya plugs) are working properly when one of the tuya plugs isn't? Because your last logs show a plug, and a thermostat. And that really looked like a transaction problem with the adapter (as in, Z2M couldn't reach the adapter).


This HA automation can help you find devices that become unavailable (in HA) for a certain duration, but since you say it's not showing as unavailable, not sure it will help:

platform: state
entity_id:
  - switch.my_switch
  - switch.my_switch2
to: unavailable
from:
  - "on"
  - "off"
for:
  hours: 0
  minutes: 5
  seconds: 0

@Freestylerrr
Copy link

I don't get an error during loading the map, it just loading forever and even after 20 mins is not loaded (this happens only if some device not responding to commands, otherwise it will load in 10-15sec).

It depends, the devices which not work are changing. Sometimes it's only one plug sometimes two and other time three (always Tuya plugs). But if one plug stops working than all devices for what the plug doing a router stops working too.

The TRV is "behind" a smart plug (plug does router for it according to map).

I orderer 0.5m extension cable, 12W power supply and USB 2.0 hub with external power source. Will see if it helps, but I'm little bit sceptical :/.

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 26, 2024

Tuya is known to have some pretty bad routers, but it seems to be a bit random here which doesn't work when. Are they all the same model?
That map behavior is a bit strange, it should error out. You would have to enable herdsman debug to know more about this though... We're working on making logging easier, should be available soon.

I'm guessing the TRV is also a router, so cutting off its power with a smart plug would definitely be bad for the mesh. I'd advise to plug it in directly, to ensure it's never cut off power completely. I wouldn't advise to plug a router on a router in general; even if you don't turn it on/off (often/at all), you still run the risk that the bottom one will have an issue, and that will kill the top one, effectively killing two routers that may be needed for several devices around that location.

I orderer 0.5m extension cable, 12W power supply and USB 2.0 hub with external power source.

Worse case scenario it does not help for that particular problem, but you'll have less interferences, and less strain on the PI, so your network will thank you for it anyway 😉

I can't remember if you mentioned the firmware version you had with the Dongle-E, if it's 6.10.3, you should try updating to 7.3.1 (or 7.4.1 directly if you want to be able to try the ember driver).

@helgek
Copy link

helgek commented Mar 27, 2024

Z2M logs, with HA, there are several places you can get them:

  • Z2M UI, Logs tab
  • HA Settings > System > Logs > top-right corner pick Z2M (or Z2M edge)
  • HA Settings > Add-ons > Z2M (or Z2M edge) > Log tab
  • Older logs in /config/zigbee2mqtt/log (if you haven't changed the data path config in Z2M)

If an error is thrown during the loading of the map, you should also see a red notification appear, as usual.

Have you confirmed 100% that other devices (not tuya plugs) are working properly when one of the tuya plugs isn't? Because your last logs show a plug, and a thermostat. And that really looked like a transaction problem with the adapter (as in, Z2M couldn't reach the adapter).

This HA automation can help you find devices that become unavailable (in HA) for a certain duration, but since you say it's not showing as unavailable, not sure it will help:

platform: state
entity_id:
  - switch.my_switch
  - switch.my_switch2
to: unavailable
from:
  - "on"
  - "off"
for:
  hours: 0
  minutes: 5
  seconds: 0

@Nerivec Thank you for this summary. @Koenkk Has it been considered to make the older logs available through UI? The current situation through shell is just a nightmare (I'm running Z2M independently from HA).

@Freestylerrr
Copy link

Tuya is known to have some pretty bad routers, but it seems to be a bit random here which doesn't work when. Are they all the same model?

Yes they are the same model with same firmware through OTA.

I'm guessing the TRV is also a router, so cutting off its power with a smart plug would definitely be bad for the mesh. I'd advise to plug it in directly, to ensure it's never cut off power completely. I wouldn't advise to plug a router on a router in general; even if you don't turn it on/off (often/at all), you still run the risk that the bottom one will have an issue, and that will kill the top one, effectively killing two routers that may be needed for several devices around that location.

TRV is EndDevice, see screen below. I'm not "pluging" anything, the network decides alone for what is doing router, right?

image

I can't remember if you mentioned the firmware version you had with the Dongle-E, if it's 6.10.3, you should try updating to 7.3.1 (or 7.4.1 directly if you want to be able to try the ember driver).

Yes, the coordinator is 6.10.3.0 build 297

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 27, 2024

Yes they are the same model

Then did you check the device's page in Z2M docs to make sure known issues weren't reported on there?

the network decides alone for what is doing router, right?

Each device is responsible for finding the best router around itself (as long as it has a proper firmware...).

What I meant by plugging a router on a router, is physically plugging a smart device on a smart socket, when both are routers; so not your case here since it's a battery-powered TRV.

@Freestylerrr
Copy link

Freestylerrr commented Mar 28, 2024

@Nerivec

Yes, there is only info about the device can shutdown sometimes, but that's not my case.

Yesterday I installed the hub and extension cable and in less than 24 hours the problem is back. 6 smart plugs does not responds to the commands, all of them are online and last seen time is refreshing. The error in log is for different plug which works just fine, no other errors for this plugs...

[20:24:00] INFO: Preparing to start...
[20:24:01] INFO: Socat not enabled
[20:24:04] INFO: Starting Zigbee2MQTT...
Zigbee2MQTT:error 2024-03-27 20:24:21: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:error 2024-03-27 20:24:21: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:error 2024-03-27 20:24:22: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:error 2024-03-27 20:24:23: Entity 'homeassistant/sensor' is unknown
Zigbee2MQTT:warn 2024-03-27 20:24:31: Failed to ping ' Smart Plug - Bed' (attempt 1/1, Read 0x70b3d52b6003d668/1 genBasic(["zclVersion"], {"timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 63699 - 1 - 4 - 0 - 1 after 10000ms))
Zigbee2MQTT:error 2024-03-28 16:32:10: Publish 'set' 'external_measured_room_sensor' to 'Living Room Thermostat - Danfoss Ally' failed: 'Error: Write 0xa46dd4fffe37ec42/1 hvacThermostat({"danfossExternalMeasuredRoomSensor":2208}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":4678,"transactionSequenceNumber":null,"writeUndiv":false}) failed (sendZclFrameToEndpointInternal error)'

If I turn one of the plug on, I see it as on in HA, if I turn it off I see it as off, so it gets data from it, but I can't control them.

The map is now trying to load for like 10 mins without error, but it looks like stuck in a loop.

EDIT: I flashed the coordinator firmware to 7.4.1 [GA], renamed the coordinator_backup.json and started ember, I'm surprised that no re-pairing devices was needed, is it normal or did i something wrong.

Now I'm getting route errors, is it a problem?

Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_MANY_TO_ONE_ROUTE_FAILURE for "19566".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".
Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804".

@Nerivec
Copy link
Collaborator

Nerivec commented Mar 28, 2024

I'm surprised that no re-pairing devices was needed, is it normal or did i something wrong.

As long as both firmware are properly built, and no breaking changes are present, the update should not require re-pairing. All good 👍

Route errors every now and then is fine (it's just the mesh "living"). You can also get several just after starting Z2M, that's fine, it will adjust automatically. But after that, if you have many, especially from one device (or from devices in the same physical area), then that means there may be troubles there.

If you see more route errors from that device with address 39804 (0x9B7C in hex), check it out, and check which router is its parent (use the network map). See if one of the two has known issues.

@vdiogo
Copy link

vdiogo commented Mar 28, 2024

I too experience the same issue (devices stop communicating randomly), and after re-pairing my ~30x devices 3-4 times already, I decided NOT TO generate the network map and, because of this or something else, no devices have disconnected since my last "interview" (~1 week now.

@Freestylerrr
Copy link

Router for this device is one of the plugs (well I don't have other routers than plugs, because I don't have any bulbs). But there is dosens of this route errors. According to the map, this plug does router of two other devices but no errors for them, only the for the TRV (0x9B7C)

Did change the logging folder? I don't see no new files in zigbee2mqtt/log the last file is two hours old.

@BlackRockSoul
Copy link

I've got a similar problem.
I'm using ZBMINI for lightning and also as a router as well. It randomly goes "Offline", but when I press image or change state it goes "Online" again! Wow!
But while it's offline it doesn't respond to HA actions and it also seems that other devices gets disconnected (or losing connection) so even after it's "Online" again, switches or temperature sensors that were connected to this router are not available and need to be repaired again.

Sometimes ZBMINI stops responding completely.
Let's imagine a situation where I have a chain connection between routers from Controller -> ZB1 -> ZB2 -> ZB3.
And when ZB3 doesn't respond, most likely ZB1 and ZB2 are also "Offline". Usually, if I update ZB1 and ZB2 (so it's online again), ZB3 will also start working.

In logs I don't see anything but Failed to ping 'Some device' (attempt 1/2, ...

It's been happening much, much more often lately. Updating the controller firmware, changing the adapter to ember and completely repairing the entire network has not helped. It's driving me crazy.

Zigbee2MQTT version: 1.36.1
Controller: ZBDongle-E 7.4.2 [GA]
Adapter: ezsp/ember (just switched to ember - nothing has changed or got worse)

@FabriceReynolds
Copy link

SImilar problem to @BlackRockSoul but with Candeo Dimmer modules.

I have one or two of the Candeo HK-DIM-A dimmable light switches that go offline every few hours always bringing the same two ZBMini down with them. I repair one or both of the dimmeers and that fixes it for a while.

My problems started a few weeks ago but I can't say what triggered the issue. I have a few more of these dimmers and those have been pretty solid so I'm wondering if the two that fail are bad and it's only since upgrading Z2M over time that the problem now surfaces. I recently upgraded the dongle and set ember to see if that would fix it but no such luck.

I get these a lot in my logs: ROUTE_ERROR_MANY_TO_ONE_ROUTE_FAILURE and not sure if that is related

Zigbee2MQTT version: 1.37.0
Controller: ZBDongle-E 7.4.2 [GA]
Adapter: ezsp/ember

@papperone
Copy link

I've many zigbee device (multiple brands and type, either battery or AC powered) and no issue but...
I recently added 5 "TS0001_switch_module" to control some lights and just these 5 behave randomly not accepting commands with error "z2m: Publish 'set' 'state' to 'Luce Terrazzo 5' failed:" but if I use the physical button connected to them, I can see the real status of the light in Z2M changing accordingly, if I toggle the switch in Z2M again the error as above; this behaviour changes randomly along the day, e.g. some times all works fine (rarely) most often some of them randomly become unresponsive to commands; I'm sure it's not a mesh issue nor poor signals (those 5 are installed in different place, some near to coordinator some far, bt all shows this random problem); any solution I can have or I need to simply trash them and move to another brand/type???

PS: any other device I have works flawlessly!

I use Z2M as Addon for HA, below the coordinator details:

Versione Zigbee2MQTT [1.40.1]
Tipo Coordinator: EZSP v8
Revisione Coordinator: 6.7.9.0 build 405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
problem Something isn't working
Projects
None yet
Development

No branches or pull requests