-
-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apcsmart
lost communication with UPS results in intense syslog flood
#704
Comments
Hitting the same behaviour now, any progress on this? |
I am not aware of anyone addressing this specifically, so probably fair to say it is a bug, and probably it is still present. Tested PRs for throttling the message emission (maybe slower backoff to retry connecting?) are welcome. |
Just had the same happen to me on a Raspberry Pi. Filled my 250GB SSD which subsequently made the Home Assistant database get corrupted. No way of catching it that quickly since it happened while I was sleeping. I'm not happy about this at all. Any solution or workaround to this? I've just disabled nut for now. |
Was that also with apcsmart driver? Probably a solution in NUT could be to
throttle it sending the error message (or add a config toggle for that
effect - e.g. send disconnect infos once at all, or once every N minutes).
With HA involved, the practical solution would also depend on getting
modern NUT running there instead of the older package (see wiki for
contributed article about custom-building a container).
Another vector could be to configure your syslog daemon log rotation and/or
throttling of same messages (would help storage at least, if not cpu
stress).
Finally, try to figure out the nature of disconnects and how to cause a
reconnect or driver restart - PRs welcome. This would be an actual fix :)
Jim
…On Tue, May 2, 2023, 13:53 Oli Cooper ***@***.***> wrote:
Just had the same happen to me on a Raspberry Pi. Filled my 250GB SSD
which subsequently made the Home Assistant database get corrupted. No way
of catching it that quickly since it happened while I was sleeping. I'm not
happy about this at all.
Any solution or workaround to this? I've just disabled nut for now.
—
Reply to this email directly, view it on GitHub
<#704 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMPTFEFZ7ZG3NKD34H3TITXEDYSNANCNFSM4HY2BTEQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello! |
Not that I'm aware of, nobody posted related PRs or investigation notes... In-driver throttling could be relatively easy, there are a few precedents in other NUT programs. Restarting the driver or the connection if the situation persists (and if it does mean the running copy of driver becomes useless for data collection) might kickstart from that same timestamp tracking involved in throttling. Getting to the root of it (faulty HW/FW? any mitigations?) would be harder. |
baaaad :(
|
Just in case, would you care to contribute your script and a README about it to NUT, for others to use? Wiki page https://github.com/networkupstools/nut/wiki/Troubleshooting-eventual-disconnections-(Data-stale) refers to a couple of such scripted and documented know-how remedies, one more would be good... At least, better have a bad way forward for the time being, than none at all. |
Finally got to dig around the code and try to guess the issue.
So, my guess is that at some point the UPS controller pushes out some message into the buffer, so FD is sort of ready to be read. But something fails while reading it, maybe the Supposedly, at this point the buffer is not empty and Further attempts to After a few quiet retries, the driver bombards the controller with Some ideas to try would be:
UPDATE: One more thing of note: in |
…owly [networkupstools#704] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…lush the incoming data buffer [networkupstools#704] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…econnect every 60 retry attempts [networkupstools#704] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…e fix it [networkupstools#704] Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Posted a PR based on the findings and guesses above. Testing would be very much welcome. |
apcsmart
lost communication with UPS results in intense syslog flood
Did anyone manage to try testing the proposed code? Instructions are in the PR #2564 |
Bump? I'm eager to merge that change, and head towards a new NUT release, but would love confirmations first that it does not actually break the driver into a worse state than now :) |
compiled 2.8.2, service starts OK, but does not accept connections to itself at start. After some minutes connects OK. log: root@ha:/etc/nut# service nut-driver start Aug 20 21:33:37 ha upsdrvctl[6662]: [d] unrecognized Aug 20 21:29:54 ha nut-server[6415]: Running as foreground process, not saving a PID file root@ha:/etc/nut# upsc -L but.... after 5 minutes... root@ha:/var/log# upsdrvctl start |
Just to clarify: there are several programs (wrapped into services) connecting not to "itself" but to each other, e.g. the driver (e.g. Logs seem to indicate that the driver did not start, and so You were starting it by |
In any case, thanks for the report - also in the area that I also thought the two copies of the driver with modern protocol model would find the older running copy and arrange... something. Not sure what |
after a while (~20h) driver kills syslog with flood write :)
13773 root 20 0 440556 86688 5812 R 148.3 4.3 25:49.76 rsyslogd service nut-server stop logs: kernel.log messages: syslog: |
Whoa, wait. It seems you've built from commit 575d423 which is current tip of master branch. The PR is still open - its changes are not in your build. Can you please retry with its actual source branch? :) e.g.
|
of course, wait a moment |
Cheers, any news, by chance? :) |
i'm sorry for delay... ) root@ha:/var/log.hdd# service nut-server status Warning: journal has been rotated since unit was started, output may be incomplete. |
Super! And no more log-floods? |
In case of a hiccup that would flood before, it should log "recovered the connection" now eventually. Off-topic, |
Also, nowadays |
as a result i think i have bad china USB to RS232 converter, not fully transparent for UPS. ===autorun failed ===after manual upsdrvctl start |
reconnect log: |
Looks great now, thanks! |
PR merged, keeping issue open for HCL/DDL update reminder. |
Hello! Tell me, is your patch already in some release or do you have to apply it yourself? |
Currently it is only on master branch, should be easy to build though (see wiki footer links). Will be part of release 2.8.3 and later. UPDATE: link: https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests |
Thank you very much for the work done! After all, it was a time bomb. If the USB to serial converter fell off, the disk became tightly clogged! But now it doesn't. It even tries to reconnect, although I didn’t have time to plug the converter back in and the driver closed. I compiled it in Gentoo as version 9999. If you need any tests on this topic, write to me. My UPS Smart UPS SC620 2 pieces.
|
@vadegdadeg : I suppose, as a data point - a DDL dump with modern driver builds (to know what they report, what not, how correctly) would be great. Can be directly as a PR to https://github.com/networkupstools/nut-ddl/ |
Hi,
I get this issue second time, nutups lost communication with UPS (via USB/Serial cable) and nut tools and syslog start eating all 4 cores (cpu quickly reach temperature 78C ), it produce huge log file (my poor SD card...) and it produce about 4500lines in log per second !
Entries in log looks:
That USB/serial is only temporary solution, later UPS will be connected directly to onboard UART but this is insane amount of error messages and rate.
Is this a bug or there is a option to limit this error messages ?
/Tomi
The text was updated successfully, but these errors were encountered: