-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of lockfile on crash/restart #7
Comments
Those storms must have been quite something! There's quite a bit to unpack in that log:
The lockfile needs some thinking about though - basically there's no way to guarantee that a crash won't leave a lockfile behind so the recovery needs to happen on service start. There are a couple of options I can think of right off the bat:
Let me give it some thought.. |
There is a lot more in the log, if you give me a few tidbits of keywords to look for I can try to ferret out more gotchas, but we'd all like the ultimate goal of this beast to be unattended and robust. First and foremost, I need it to forward alarms and fortunately I'm using Pushover for that, which does a nice job of sending an event summary, a snap of the door, etc. to my phone/desktop/whatever so it does not depend on Tunnel. But who's in the building, door control, siren silence, and all that rubbish is dependent on Tunnel and I don't want to have to futz with trying to manage a wayward process from my phone (or write a watchdog script if I can help it). |
Thanks - think there's probably enough for me to start looking at it, but may ask you for more info once I get deeper into it. Or ... if there's nothing sensitive in the log file maybe attach it as a zip? |
Hi, Are you running tunnel v0.8.2+ on the machine that crashed? I'm curious because:
(in the actual code |
That would be strange, because v0.8.1 does not run as a daemon. Now you got me wondering if I managed some sort of frankeninstall, double-checking: uhppoted-tunnel installed in /etc/uhppoted/uhpoted-tunnel and referenced as such in /etc/systemd/system/uhpoted-tunnel.service going over to the releases dir cloned from git as uhppoted_v0.8.2/uhppoted-tunnel, the MD5 is bcca16c461b9f2b721b0f3630ca74659, so I'm pretty sure it is what it is... Is the go toolchain that messed up on ARM? |
Ummmmmmmmmm .. I get very different checksums to yours for the v0.8.2 binaries (downloaded from the releases page):
Did you build it locally? Oh, wait, of course you did .. ARM64:
|
No, I'm much too lazy, lol... (NOTE THE RELEASES DIR!) $ sha1sum uhppoted-tunnel |
Ok, I get the same checksum now ... which is bizarre. I really hope it doesn't mean all those top level binaries are wrong - but you might want to switch to the uhppoted-tunnel releases while I figure out what is going on. |
FYI, turns out the checksums are different because the top-level release builds were not being built with Now to figure why the logs and the source code do not match.. |
Ok, at this point in time I am reasonably convinced that when your machine crashed it was running tunnel v0.8.1:
I'm afraid I have absolutely no idea how that could have happened - but here are some things you could look at if you wanted to:
I'm going to look at the |
Going back in time to find when this run was launched before this epic storm... Went forward in the logs and pulled the part where it crashed when it found the stale pid file... This is the RPi-4/Debian-11 (Bullseye) build, pulled as bin off git, as lazy as possible. |
Ok, am going to admit to being completely baffled by the concurrent map access issue:
As far as I can see that can only occur in v0.8.1. Am going to put that on the back burner for now (pending another storm and/or inspiration :-)) and focus on the lockfile problem which is at least tangible. |
Had another wind storm, and power was glitchy, this time the terminus (end-node with the broadcast) just caught a simple reboot, which left the lockfile behind. Of course it was a major pain because the ISP rolled the IP assignment and I spent an hour divining a way to figure that out so I can SSH into the thing, but the node lacks any form of orderly shutdown on power failure so there's nothing in the log to confirm what happened other than to assume that the .pid file was simply left behind. BTW, after profiling system usage, I'm about to migrate this function to a Pi Zero-W with a camera, so it will be moving from a 4core 64bit to 1core 32bit. There will be a python app that will be snapping pics and short video segments when events are captured from the access controller, pics will be posted to the pivot node near real-time while vids forward on a rate-limited basis for record keeping. Since all this seems to fit the Zero-W, I can return the Pi-4 to the shop for another project since they're pretty hard to get right now. |
That's quite an area you live in! Have you considered using something like e.g. Tailscale for the SSH? re. lockfile. Am working on it - turns out there's a lot more to lock files than one would think (the official Go implementation has been stuck in limbo for like 5 years). However - as an interim fix you could set the lock file to live in the /tmp folder which is (supposedly) cleared on reboot e.g.
re. camera. Huh! Interesting idea!! And yeah, a Pi 4 is way overkill for most of this stuff :-). |
Oh, snap... Why didn't I think of that! I'm editing the unit spec rn... Thanks Thanks for the tip on Tailscale, I'll give that a shot. VNC has been a go-to, but begrudgingly since they eliminated the free tier for anything but 30-day trials (and spawning trial credentials is even more annoying than dealing with finding the new IP) EDIT: Wow, Tailscale is cool: My very own SDN! Thank you so much...now I can eliminate all these firewall holes! |
Tailscale is neat and does seem to have pretty much got it right - unlike all the other hacky and marginal solutions that peope have resorted to over the years. Although I see they do have a CVE today. |
Just read through that, since my install was today, I have the latest, but #7 is nasty, like log4j nasty. Not running http through it, but ran the curl tests to be sure there are no gotchas. |
Ref. #7. Eeesh, security researchers do not think in straight lines - looks like Tailscale is more or less on top of it though. You managed to figure out logging to /dev/null ? By default tunnel logs to stdout and stderr which means the service typically logs to syslog (depending on how your system is configured of course). Editing the command-line in the unit file to redirect stdout to /dev/null should do it. |
logging: I saw the log level directive in the description, set it to warn in the Service description. Hey, here's a weird inconsistency; In my typical laziness I pull the 0.8.2 arm linux binary to load up on the pi-zero-w. It crashes out with Illegal Instruction, so I go, oh, built with 64bit. No big deal, install golang and build it from source. Pi-zero, so go make dinner, it'll be a while... come back, stick the binary where I typically do (as /etc/uhppoted/uhppoted-tunnel) only this time when I go to deamonize it, it complains about not finding the .toml file. That's funny, the 64bit binary downloaded from git does not do that on the pi4. Made a uhppoted-tunnel.toml with some defaults I liked, including cert and key file locations, etc. Well, it ignores those directives, but it stopped complaining. TL;DR: The .toml madness hints there is something out of sync with the pre-built binaries on git and the source. |
Just checked the code again and actually it defaults to /var/log/uhppoted/uhppoted-tunnel.log - something must have been weird with my config yesterday when I was checking it - so the log level thing is the way to go. re. TOML. The repo had a partially completed implementation for Allow tunnel configuration from a config file which wasn't quite finished so you built against an incomplete development version (I'd merged the branch a bit early for other reasons). Am working on the remaining bits and pieces for Linux - should be done today/tomorrow. FYI, there's a sample TOML file in the examples. The Makefile has some examples, but basically:
or (for the default TOML file):
|
OMG, you're going to love this, and I feel like an idiot: I have no idea how this happened, but there were both versions of uhppoted-tunnel running concurrently on this system, and looking at /etc/systemd/system I don't see how it is even possible. It was hidden because I am running different client sessions to different hosts, so I expect to see a few out there when running a ps, but until I stopped all the services and saw one still hanging around, and pointing to /home/pi which is not even an active user directory(!) I got suspicious. This Pi is definitely haunted... mentioned before this was a Buster->Bullseye upgrade, which is contraindicated due to hits on the networking stack, well, methinks there's other weirdness with apt and make. This one gets wiped and a do-over. The pi-zero is now up and running and is a clean build from 0.8.2. |
Ha! I knew it :-) :-) :-) - but basically 'Whew!' - am really, really happy to have that very puzzling anomaly explained. Oh, BTW - figured out the hack for routing service logs to /dev/null - if you run the service with the |
Quick update on this - tunnel is pretty much good to go:
I need to migrate the lockfile and unit file changes to the other modules so a release is still a couple of weeks away - but if you need the updates they're in the development version in the repo. |
Thank you, all of these should be quite helpful! Per your prior recommendations a few weeks ago, updated the unit file, and the combination of net online and lockfile in /tmp appear to have solved the disorderly shutdown/restart issue on the Pi-4 (which I now call FrankenBullseye due to it's many, many issues). Power's been weird as well, and in the interest in making sure the solution is robust I've avoided putting these on the UPS and everything seems to recover as I'd hope...that's a huge win! I've not noticed any spurious crashes of Tunnel on either Bullseye machine (Franken or otherwise; FrankenBullseye the machine who when through the now 'discouraged' version upgrade from Buster, which left it's networking somewhat unstable). I'll wait for the next release and push those out over the following weekend when available. |
Oh, that sounds great!!! Really glad to hear that, and thanks for the update! The |
v0.8.3 is available for whenever you have time to try it out (and a convenient storm!). I'll leave this issue open for a bit.. |
'k, closing this - feel free to reopen if you need to. |
I've spent the period since the last release beating uhppoted-tunnel in some pretty terrible ways; the vendor's app in monitor mode (5s updates) constantly, there's a python script running as a daemon to pull status and events parsing them into its own db (5s updates), there's another python daemon running as a UDP CRUD service on the same port that monitors the vendor's app MS Access database and if it sees any changes in the group and username tables it pushes them out to the remote site db using the same secure tunnel service (just uses a 'magic' device serial number, opcodes, and special data hash.)
Topology:
Win10-udp---udp-RPi-tls--------tls-VPS-tls------tls-RPi-udp---udp-UHPPOTE
db daemon db daemon
udp db client udp db server
udb db client
This entire mess has been working pretty well all this time with the new release, and a few random network hits, the system recovers after about 10minutes or so. And I've been doing development of the local and remote db applications, which means I've been nuking the event tables and re-syncing thousands of events several times a day over the tunnel all while the app is up and running.
Today, however, with several power hits hitting pretty much every node (and sub-node) in every permutation imaginable, the remote tunnel node got lost in such a way that it left the lockfile behind. Looking through syslog, there's a ton of lines, starting off after a reboot:
Nov 4 15:32:14 netmon systemd[1]: session-5.scope: Succeeded.
Nov 4 15:55:11 netmon rc.local[841]: fatal error: concurrent map read and map write
Nov 4 15:55:11 netmon rc.local[841]: goroutine 5329 [running]:
Nov 4 15:55:11 netmon rc.local[841]: runtime.throw({0x3c4576, 0x21})
Nov 4 15:55:11 netmon rc.local[841]: #011runtime/panic.go:992 +0x5c fp=0x2856ee0 sp=0x2856ecc pc=0x47bf8
Nov 4 15:55:11 netmon rc.local[841]: runtime.mapaccess2_fast32(0x378680, 0x2800480, 0x306ee)
Nov 4 15:55:11 netmon rc.local[841]: #011runtime/map_fast32.go:62 +0x194 fp=0x2856ef0 sp=0x2856ee0 pc=0x202ac
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/router.(*Router).get(0x625478, 0x306ee)
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/router/router.go:92 +0x3c fp=0x2856f50 sp=0x2856ef0 pc=0x2b74c8
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/router.(*Switch).Received(0x280e988, 0x306ee, {0x2a73006, 0x1c, 0x7fa}, 0x2a700f0)
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/router/router.go:62 +0x24 fp=0x2856f64 sp=0x2856f50 pc=0x2b7108
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/tunnel/tls.(*tlsClient).received(0x28a00a0, {0x2a73000, 0x22, 0x800}, 0x280e988, {0x4415b4, 0x29a2000})
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/tunnel/tls/tls_client.go:162 +0x128 fp=0x2856fcc sp=0x2856f64 pc=0x339c24
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/tunnel/tls.(*tlsClient).listen.func1()
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/tunnel/tls/tls_client.go:150 +0x54 fp=0x2856fec sp=0x2856fcc pc=0x339a88
Nov 4 15:55:11 netmon rc.local[841]: runtime.goexit()
Nov 4 15:55:11 netmon rc.local[841]: #011runtime/asm_arm.s:824 +0x4 fp=0x2856fec sp=0x2856fec pc=0x7a140
Nov 4 15:55:11 netmon rc.local[841]: created by github.com/uhppoted/uhppoted-tunnel/tunnel/tls.(*tlsClient).listen
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/tunnel/tls/tls_client.go:149 +0xfc
Nov 4 15:55:11 netmon rc.local[841]: goroutine 1 [chan receive, 77 minutes]:
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/commands.(*Run).run(0x625900, 0x28a80d8, {0x440ad8, 0x284e180}, 0x280e6b8, 0x289c140)
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/commands/run.go:259 +0x228
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/commands.(*Run).exec(0x625900, 0x28a80d8, {0x440ad8, 0x284e180}, 0x280e6b8)
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/commands/run_linux.go:66 +0x27c
Nov 4 15:55:11 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/commands.(*Run).Execute.func1(0x28a80d8, {0x440ad8, 0x284e180}, 0x280e6b8)
Nov 4 15:55:11 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/commands/run_linux.go:33 +0x3c
.
then ends up
.
Nov 4 15:55:12 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/tunnel/tunnel.go:60 +0x24c
Nov 4 15:55:12 netmon rc.local[841]: goroutine 33 [select, 77 minutes]:
Nov 4 15:55:12 netmon rc.local[841]: github.com/uhppoted/uhppoted-tunnel/tunnel/tls.(*tlsClient).connect.func1()
Nov 4 15:55:12 netmon rc.local[841]: #011github.com/uhppoted/uhppoted-tunnel/tunnel/tls/tls_client.go:109 +0x154
Nov 4 15:55:12 netmon rc.local[841]: created by github.com/uhppoted/uhppoted-tunnel/tunnel/tls.(*tlsClient).connect
Nov 4 15:55:12 netmon systemd[1]: rc-local.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
All of this seems to happen before the system has really booted. Maybe the unit file needs to be modified to wait for full multi-user or maybe network? On this RPi4 running Bullseye, I have about a 1min network wait to deal with the dumpster fire that is Debian-11/RPi's handling of the WiFi drivers.
In any event, there seems to be something missing in clean-up; if it crashes, it should take the lockfile with it, or the unit script should look for the pid and 86 both it and the file and start anew.
The text was updated successfully, but these errors were encountered: