-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Devices stop communicating randomly #19747
Comments
Here you can see i have updated the temperature on 3 Aqara TRVs: Living_Room_TRV: 21.5
Out of these 3 TRVs only the Bedroom_TRV actually executed the command. The other two completely ignored it and their last seen status did not update when i changed the target temperature:
|
I have a similar problem with the Aqara thermostats (#19342). Sometimes the communication seems to fail, however Z2M does not notice that and pretends everything is right. |
Forgot to mention that I have fully erased zigbee2mqtt from my system, reinstalled and reconfigured z2m from scratch, re-paired all my devices again to my coordinator and the exact same issue is still happening randomly. As a workaround I have set up an automation in home assistant to restart the z2m addon every 30 minutes just to make sure my TRVs don't stay stuck for too long and causes my heating to run forever. I have also ordered a Sonoff Dongle-P to rule out the possibility that the SkyConnect is broken, but that will arrive in one or two days so I'm still waiting to test that. |
I have reflashed my skyconnect dongle and re-paired everything from scratch, just to rule out the possibility of the last firmware update flashing improperly and corrupting my dongle, but the issue is still there. I have attached the latest debug log here. |
EDIT3: 03/24/2024: EDIT2 03/15/2024: EDIT: It's now running Z2M v1.33.0 again, and as I expected, without any issue. Initial issue post: What happened? It started when the add-on was updated to v.1.33.2, but it took a while before I noticed, and realized this v.1.33.2 update can contain bugs for my setup. Sometimes restarting z2m resolves the unresponsiveness, sometimes one device stays unresponsive. I hope my herdsman log will reveal anything. What did you expect to happen? How to reproduce it (minimal and precise) Zigbee2MQTT version Adapter firmware version Adapter Debug log Device log Devices |
I only recently switched from Conbee II / Deconz to Skyconnect / Z2M and so I can't judge if the problem appeared only with recent versions of Z2M. The setup was more stable with Deconz (which I didn't expect)! |
UPDATE: My new Sonoff Zigbee Dongle-P arrived yesterday and I have replaced my Skyconnect with it and re-paired everything to it. So far since yesterday I had no devices that got stuck the same way as they did on the Skyconnect dongle. I only had an issue with some roller shade motors which was fixed by flashing the latest firmware on the sonoff dongle and re-pairing the motors in z2m. So it seems to me that at least in my case, the issue is caused by a combination of Skyconnect and zigbee2mqtt version since as i've said previously, this issue started happening recently even though i have been running the same setup in terms of dongle, routers and client devices for over a year now. |
That would be extremely annoying as I switched 70 devices from Deconz to Z2M/Skyconnect only recently to get rid of intermittent instability and seemingly now I traded that in for persistent instability. (Yes, I know, Skyconnect is only not yet officially supported...). |
I have the same issue, using Sonoff Zigbee Dongle-E Devices affected: Randomly stops responding to commands but seems to be reporting state. |
I tried to change QoS, but this doesn't help (however, as QoS is only between MQTT and Z2M, I admit I did not really expect anything from this setting). |
Maybe related discussion: #19763 |
I experience the exact same problem after updating to the same Z2M version. I also use a Skyconnect as coordinator, and I am also having troubles with my Aqara TRV. A restart of Z2M fixes the issue. |
Having a similar issue, randomly one of my devices would not respond to commands coming from Z2M, results in a timeout error: #13993 (comment) - coordinator is the highly praised SLZB-0 (based on stable chipset CC2652P recommended by Z2M, not the experimental EFR32MG21 chipset). |
Hey everyone! I have the same issue. The devices (in my case only IKEA LED2005R5 (3x)) randomly disconnect and show up as unavailable - and offline in z2m. Can anyone recreate this? Thanks & Cheers Zigbee2MQTT version Home Assistant Sky Connect as a coordinator. |
Is there any way to push issues like this? The random communication loss is something which is a complete deal breaker for me, as I can no longer rely on my Zigbee network for lights, radiators and switches of all kind. Seemingly every day I have to count my devices and check the thermostats if they are really operating etc. However this and quite a few other bug reports concerning similar problems I've been monitoring on this list for the last few days don't seem to get any developer attention and are probably simply getting lost because of newer issues. IMHO a network problem which results in loosing communication to network members without even noticing the user but pretending everything is fine is one of the worst things which can happen. |
i experienced the same with my setup (Aquara Thermostats & Sonoff Dongle Plus) running on 1.33.2 So a downgrade to 1.33.0 fixes the random unresponsiveness to commands? im mostly a "fail forward person", so is there anything i can do except downgrading? looking into the logs, i see some warnings, i dont know if they are related:
|
I also see those messages with the strange keys some times, but not regularly. |
i am in on this too. just moved from seperate docker HA / Z2M to HAOS with Z2M addon under proxmox. Also switched from ConbeeII to ZBDongle (EZSP v12) with 7.3.1.0 build 176. Ikea lamp which worked fine for at least one year now gets randomly unavailable. Toogling state in Z2M guide revokes it though. sometimes it also comes back alone. |
Unfortunately, nobody's interested in the problem :-( |
@digitalkaoz wrote:
In my case it did, Robert, it works flawless as it did before, no issues at all, running v1.33.0. @Ra72xx Hmmm... The update's log v1.34.0-1 of today Dec. 1st, shows nothing about this issue or a fix for it (yet). I'm not updating for sure. |
i just updated - i give feedback if behavior improved. |
I updated earlier today, no difference, would say that it got a little worse... |
Had the same problem when i updated to v1.34.0-1 from 1.33.0-1. My Aqara TRV's would not respond to anything, while my Ikea Tradfri lights would still work. Im using the Sonoff Dongle Plus V2. Restoring my backup of v1.33.0-1 makes them respond again. |
yes - can unfortunately confirm - its better with - but happens and is a problem.
|
Has anybody tried simply downgrading to the Docker container 1.33.0 and keeping the database? I have no backup to restore... |
@Ra72xx that didnt work for me, i had to repair all devices |
So we have to wait until somebody takes note of this problem (which seemingly doesn't happen) and fixes it going forward. |
Change your Z2M log level to
You are on release version, The slight delay is likely due to the config for message processing interval. This has been lowered and should no longer be perceptible after April release. We will need more time to establish "acceptable values" for counters, but in the meantime, here are a few more details (a bit technical though) on the counters that seem relevant according to your logs (remember, these are cleared right after they are logged, meaning each log accounts for the past hour):
Overall, I'd say you likely have another 802.11 network (WiFi, Zigbee) nearby that is creating trouble (too many failed CCA attempts), and you probably have at least one bad router (too much traffic is being retried/lost). I'm seeing a few too many route errors related to |
Nerivec, I'm grateful for you help!
In the discussion you told that the normal delay (for stable version of z2m) is about 1 sec.
Yeah, you are right. My wi-fi router is near the stick (~40cm), but I splitted the channels to minimize problems: Z2M is on 11 channel and wi-fi net is on 3 one.
Hmm, the map shows that I have 5 types of routers (15 devices in total):
How to understand which router is bad? |
Use a 2.4GHz WiFi scan app with your phone, one that tells which channels are most used around you. That will tell you about any WiFi in range in your area, not just yours. Note: changing the channel will currently require re-pairing all devices.
Depending on the metal, that definitely could cause signal problems.
When looking at routing errors popping up in your logs ( |
Yeah thanks! Didn't work out for me with the ember driver for now, it did run very fine for about an hour, and then pretty much all devices got sluggish and unresponsive. I will report with logs in the right place, conversation #21462 BUT, I have great news: I got the idea to 'just' switch to the EZSP driver, after reading this
, and it worked out very well.
Sorry but not anymore, I'm a bit of an unorganized person, but I'll test it later on with a cloned VM. I was now working with my main VM, and was a bit in a hurry to get it running again :P.
Also sorry but no. Idk why, but I deleted it (I'm not a 'trashcan' user, which, for me, is a very good invention ;-) )
That's a good idea, I changed it as well Again, thanks a mil! |
I used the ability of my wi-fi router to scan, the result is below. It is clear that the situation is not static, neighborhood routers regular change the channel depending on interference. But the average number of networks per channel is ~10-15 (the max value is 20-25). Yes, it's a lot, but I can't control the neighborhood networks.
The steel box, yes, it can cause problems, but it's located pretty close to the stick. And, according to the Koenkk answer, I can't turn off the router property, just add another one next to it.
My router devices do not have such marks on their pages. By default, they are not bad. Probably, a lot of networks on channel 11 cause such problems. I keep checking. |
According to your scan, your best bet would probably be to switch to channel 20 or 25 to move further away from WiFi range. Channel 11 is definitely crowded already. |
THANK YOU! I was pulling my hair out. I don't know if I had an old version or something but for me, this just started happening a few days ago - random devices would just stop working - and the only fix was restarting z2m. Hopefully this commit fixes the issue. |
For me the problems appeared in version 1.3.6.0 too. In last days random devices stop accepting the z2m commands. It started with all of the Tuya smart plugs after the update to 1.3.6.0 and they stopped responding to any commands, but worked locally (when I pressed the button). I unpaired and re-paired all of them and they started to work. Few days ago a two Danfoss Ally thermostats stopped responding to all commands. My hw is RPi4 with external SSD and Sonoff Dongle Plus E Yesterday and today one Danfoss Ally Thermostat and two Tuya smart plugs. The symptomps are always the same, but the affected devices differs:
If I try for example turn off the plug, the switch in HA moves, but goes instantly backwards and nothing happens. Same if I try to send update for Danfoss external temp sensor or change the temperature. Sometimes restart z2m helps, sometimes I need to reboot the whole device (RPi4) or maybe I'm too impatient to wait and restart of z2m and wait few minutes is enough, I don't see any errors in z2m logs, The worse thing is that I can't tell which devices stopped to work because in z2m everything looks alright. I will find out only when automation does nothing or when I want to turn on/off plug or something else Is there a way how to at least know that some device is in this state? What would you guys recommend? I don't blame the devices because I don't need to reboot them or do something with them, the restart of z2m or RPi fixes it (sadly just for hours or days), so it should not be some faulty router or something. |
@Freestylerrr Can you give the following details:
First, make sure the PI is (still) getting enough power; this is a very common problem. You'll find lots of resources on this with a quick web search. Also make sure the commands going from HA go all the way to Z2M, that MQTT isn't causing trouble in the middle (check MQTT logs for errors, can also check HA logs for errors while you're at it -after activating one of the problem switches-). Then you can try this: Shut down Z2M. Unplug the coordinator. Wait for a minute. Plug the coordinator back in. Restart Z2M. Then see if the troubles re-appear in the next few days. PS: Forcing the OS to restart when Z2M/adapter is starting is definitely not ideal. You should avoid that in the future unless there is a good reason. |
driver: ezsp HA version: Core: 2024.3.1 I checked MQTT Broker log, when the problem was present, but I didn't see any errors. But actually I don't exactly know how the broker works. There is not much output in log and when I do some action, it does not log there (maybe it's just disabled). This is part of the broker log, I clicked on button on 16:59, but nothing was logged (the action was performed though). 2024-03-20 16:52:29: New connection from 172.30.32.2:39180 on port 1883. In HA log I see some errors, but nothing related to Z2M (there is google API which throws error, because it's garbage, some errors for Helium integration etc.) In Z2M log I have only this (i changed the log from info to warn, because there was a lot of spam from MQTT: publish). This is since last reboot: [13:31:59] INFO: Preparing to start... |
What is the reason for that? Is Z2M not robust against restarts at any time? |
@Nerivec error 2024-03-24 15:03:26: Publish 'set' 'state' to 'Smart Plug - Living Room at window' failed: 'Error: Command 0xa4c13833d082358c/1 genOnOff.off({}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 19566 - 1 - 153 - 6 - 11 after 10000ms)' |
Same here for me: Zigbee seems to stop working every few days, but a restart fixes it. This seems a bit different from the original problem, where devices dropped out randomly. |
Very hard to say without herdsman debug (new logging coming soon, should make it easier to switch it on when needed). But the timeout followed by |
Below are two graphs for CPU Temp and CPU Usage, sadly I don't see any entity for Power in System Monitor, but the RPi4 and the power adapter is half year old so I doubt that this is the problem. Everything (WiFi and Bluetooth devices) except Z2M works without any problems. The dongle is in USB 2.0 port (previously it was in 3.0) and is directly there (without extension cable). I discovered that the problem is with Tuya smart plugs, they are the devices which stops responding to commands (and of course everything for what they doing a router). But I hardly thing that the problem is in hardware, cause simple Z2M addon reboot fixes everything and I'm not the only one as this topic shows. RPi4 CPU Temp RPi4 CPU Usage EDIT: And I discovered that if any devices is stuck the map in Z2M stops loading (I can let it for 20 minutes but no map is shown) |
Problem is, this topic has had about half a dozen different reasons for the "looks like the same issue"... As such, the archive of posts here is not really helpful in debugging the actual problems of one specific user. It's just the generic problem that arises when something on the network is not doing what it is supposed to do, which can range from interferences to failing hardware. Rebooting Z2M closes and re-opens the port, and reboots the adapter, so it does not exclude something affecting the hardware (or the hardware's software). The PI is well-known for causing all kinds of interferences with dongles, so first thing I'd do is put an extension cord in your case; it may not be the solution, but it sure can't hurt. That said, since you seem to be saying that only specific devices are affected in your case, it's less likely to be a problem on the PI's side. One thing is for sure, if the map is stuck for 20 minutes and did not error out, something went very wrong. Each request done when loading the map is on a 10sec timeout (it can do quite a few, depending on the size of your network), so if a device times out and crashes the loading of the map for some reason, you should see it pretty quickly, and the error should show up in the logs. |
Alright, which log should I check, for now I have Z2M set to warn log level. It's hard to debug, I'm not turning on and off the plugs every 5 mins, so sometimes I will find out after hours or days that something is not working, because in Z2M everything looks good (device shows as available, last seen time is updating etc.). Is there any automation which I can run to find out that there is a problem? |
Z2M logs, with HA, there are several places you can get them:
If an error is thrown during the loading of the map, you should also see a red notification appear, as usual. Have you confirmed 100% that other devices (not tuya plugs) are working properly when one of the tuya plugs isn't? Because your last logs show a plug, and a thermostat. And that really looked like a transaction problem with the adapter (as in, Z2M couldn't reach the adapter). This HA automation can help you find devices that become unavailable (in HA) for a certain duration, but since you say it's not showing as unavailable, not sure it will help: platform: state
entity_id:
- switch.my_switch
- switch.my_switch2
to: unavailable
from:
- "on"
- "off"
for:
hours: 0
minutes: 5
seconds: 0 |
I don't get an error during loading the map, it just loading forever and even after 20 mins is not loaded (this happens only if some device not responding to commands, otherwise it will load in 10-15sec). It depends, the devices which not work are changing. Sometimes it's only one plug sometimes two and other time three (always Tuya plugs). But if one plug stops working than all devices for what the plug doing a router stops working too. The TRV is "behind" a smart plug (plug does router for it according to map). I orderer 0.5m extension cable, 12W power supply and USB 2.0 hub with external power source. Will see if it helps, but I'm little bit sceptical :/. |
Tuya is known to have some pretty bad routers, but it seems to be a bit random here which doesn't work when. Are they all the same model? I'm guessing the TRV is also a router, so cutting off its power with a smart plug would definitely be bad for the mesh. I'd advise to plug it in directly, to ensure it's never cut off power completely. I wouldn't advise to plug a router on a router in general; even if you don't turn it on/off (often/at all), you still run the risk that the bottom one will have an issue, and that will kill the top one, effectively killing two routers that may be needed for several devices around that location.
Worse case scenario it does not help for that particular problem, but you'll have less interferences, and less strain on the PI, so your network will thank you for it anyway 😉 I can't remember if you mentioned the firmware version you had with the Dongle-E, if it's 6.10.3, you should try updating to 7.3.1 (or 7.4.1 directly if you want to be able to try the |
@Nerivec Thank you for this summary. @Koenkk Has it been considered to make the older logs available through UI? The current situation through shell is just a nightmare (I'm running Z2M independently from HA). |
Yes they are the same model with same firmware through OTA.
TRV is EndDevice, see screen below. I'm not "pluging" anything, the network decides alone for what is doing router, right?
Yes, the coordinator is 6.10.3.0 build 297 |
Then did you check the device's page in Z2M docs to make sure known issues weren't reported on there?
Each device is responsible for finding the best router around itself (as long as it has a proper firmware...). What I meant by plugging a router on a router, is physically plugging a smart device on a smart socket, when both are routers; so not your case here since it's a battery-powered TRV. |
Yes, there is only info about the device can shutdown sometimes, but that's not my case. Yesterday I installed the hub and extension cable and in less than 24 hours the problem is back. 6 smart plugs does not responds to the commands, all of them are online and last seen time is refreshing. The error in log is for different plug which works just fine, no other errors for this plugs... [20:24:00] INFO: Preparing to start... If I turn one of the plug on, I see it as on in HA, if I turn it off I see it as off, so it gets data from it, but I can't control them. The map is now trying to load for like 10 mins without error, but it looks like stuck in a loop. EDIT: I flashed the coordinator firmware to 7.4.1 [GA], renamed the coordinator_backup.json and started ember, I'm surprised that no re-pairing devices was needed, is it normal or did i something wrong. Now I'm getting route errors, is it a problem? Received network/route error ROUTE_ERROR_SOURCE_ROUTE_FAILURE for "39804". |
As long as both firmware are properly built, and no breaking changes are present, the update should not require re-pairing. All good 👍 Route errors every now and then is fine (it's just the mesh "living"). You can also get several just after starting Z2M, that's fine, it will adjust automatically. But after that, if you have many, especially from one device (or from devices in the same physical area), then that means there may be troubles there. If you see more route errors from that device with address 39804 ( |
I too experience the same issue (devices stop communicating randomly), and after re-pairing my ~30x devices 3-4 times already, I decided NOT TO generate the network map and, because of this or something else, no devices have disconnected since my last "interview" (~1 week now. |
Router for this device is one of the plugs (well I don't have other routers than plugs, because I don't have any bulbs). But there is dosens of this route errors. According to the map, this plug does router of two other devices but no errors for them, only the for the TRV (0x9B7C) Did change the logging folder? I don't see no new files in zigbee2mqtt/log the last file is two hours old. |
SImilar problem to @BlackRockSoul but with Candeo Dimmer modules. I have one or two of the Candeo HK-DIM-A dimmable light switches that go offline every few hours always bringing the same two ZBMini down with them. I repair one or both of the dimmeers and that fixes it for a while. My problems started a few weeks ago but I can't say what triggered the issue. I have a few more of these dimmers and those have been pretty solid so I'm wondering if the two that fail are bad and it's only since upgrading Z2M over time that the problem now surfaces. I recently upgraded the dongle and set ember to see if that would fix it but no such luck. I get these a lot in my logs: Zigbee2MQTT version: 1.37.0 |
I've many zigbee device (multiple brands and type, either battery or AC powered) and no issue but... PS: any other device I have works flawlessly! I use Z2M as Addon for HA, below the coordinator details:
|
What happened?
Hello. I am not a zigbee expert so apologies if i provide incomplete information. I will try my best to mention everything.
Since 2 days ago my z2m instance in home assistant started acting up randomly.
At random times, one or multiple zigbee devices stop executing commands.
ex1: Aqara TRV does not change the target temperature when i try to change it from either home assistant or z2m interface, the last seen status keeps increasing as if there is no communication between the TRV and the coordinator.
ex2: Philips Hue Lightstrip does not switch on or off whenever i attempt to toggle it either from home assistant or z2m.
I have set the logging level to debug and when i try to send a command to the stuck device, the command shows up in the logs without any error whatsoever.
Power cycling the stuck device or pressing the pairing button (where applicable) does nothing.
The only thing that seems to get my zigbee network back up and running temporarily is restarting z2m which makes me think that this is caused by something in z2m and not the devices that fail. Right after the z2m restart, all stuck devices start communicating again for a while until at some point either the same devices or others present the same behaviour as before.
My setup:
Zigbee dongle: Home Assistant SkyConnect flashed with the latest firmware available through the web flasher here https://skyconnect.home-assistant.io/firmware-update/
Zigbee2Mqtt: Latest addon version available for Home Assistant (1.33.2-1)
Home Assistant Core version: 2023.11.2
Home Assistant Supervisor version: 2023.11.3
OS: Debian 12
Server: Dell OptiPlex 9020 Micro, Core i7-4790t 3.90GHz, 16GB DDR3, SSD
What did you expect to happen?
I expected that no device will get in a frozen state where i can not issue commands to it or receive state changes from it. At least not as often as once every few minutes/hours
How to reproduce it (minimal and precise)
There is no replication steps that i can imagine. This issue can happen even when no zigbee device receives any command at all. I even left home for a few hours and after i left i restarted zigbee2mqtt to make sure it's all in working order and does not get any commands from anyone since nobody was home. When i returned home one of the Aqara TRVs was stuck.
Zigbee2MQTT version
1.33.2
Adapter firmware version
7.2.2.0 build 190
Adapter
Home Assistant SkyConnect
Debug log
log.txt
The text was updated successfully, but these errors were encountered: