stuck in futex #11749

justinpryzby · 2021-03-15T04:11:43Z

I haven't heard back about how to debug the previous issue (11641), so here goes anyway.

Postgres autovacuum worker process stuck in futex. I have seen this ~~twice~~ 3 times before on another server.
The process does not die with SIGINT, and I imagine if I SIGKILL it, it won't die, and I'll need to reboot the server.

[pryzbyj@ts-db-new ~]$ ps -O lstart,wchan=wwwwwwwwwwwwwwwwwwww 12583
PID STARTED wwwwwwwwwwwwwwwwwwww S TTY TIME COMMAND
12583 Wed Mar 3 03:41:23 2021 futex_wait_queue_me S ? 00:00:59 postgres: autovacuum worker ....

rincebrain · 2021-03-16T03:04:03Z

2.0.1 had #11463 but that was fixed (according to all the reporters...) in 2.0.2 and newer, so if you're sure you're still reproducing on 2.0.3, that means it's not that...probably.

You could follow what was done in #11463 (comment) and see what it says (assuming it says anything).

An alternate test you could run to exclude an issue like #11463 would be to try running a newer kernel (e.g. the kernel-ml package from EPEL would work), since that issue was specific to older versions of the kernel.

justinpryzby · 2021-03-16T03:47:20Z

On Mon, Mar 15, 2021 at 08:04:21PM -0700, Rich Ercolani wrote: 2.0.1 had #11463 but that was fixed (according to all the reporters...) in 2.0.2 and newer, so if you're sure you're still reproducing on 2.0.3, that means it's not that...probably.

It's occurring right now for a customer running zfs 2.0.3 under linux RHEL 3.10.0-1160.15.2.el7.x86_64. Note, #11463 talks about 100% sys CPU, and that's not the case here. In the cases I've seen, postgres autovacuum/autoanalyze is stuck in futex, and fails to stop, but I *am* able to kill -9 it. The stuck process is like this: ***@***.*** ~]$ sudo strace -p 17087 strace: Process 17087 attached futex(0x7fdbf9e42760, FUTEX_WAIT_PRIVATE, 2, NULL

You could follow what was done in #11463 (comment) and see what it says (assuming it says anything).

I got just: [474919.233221] Loglevel set to 1 I tried some more sysrqs and got this (it looks like my strace is visible here as ptrace, even though I've cancelled the strace??). [475154.098057] postmaster S ffff8ae7bfc5acc0 0 17087 24559 0x00000080 [475154.098059] Call Trace: [475154.098062] [<ffffffff98787dd9>] schedule+0x29/0x70 [475154.098064] [<ffffffff98112436>] futex_wait_queue_me+0xc6/0x130 [475154.098066] [<ffffffff981131db>] futex_wait+0x17b/0x280 [475154.098068] [<ffffffff980b049b>] ? recalc_sigpending+0x1b/0x70 [475154.098071] [<ffffffff980b05e6>] ? dequeue_signal+0x86/0x180 [475154.098073] [<ffffffff98114f46>] do_futex+0x106/0x5a0 [475154.098075] [<ffffffff980b303b>] ? ptrace_notify+0x5b/0x90 [475154.098078] [<ffffffff9803bdf5>] ? tracehook_report_syscall_exit+0x45/0xe0 [475154.098080] [<ffffffff98115460>] SyS_futex+0x80/0x190 [475154.098083] [<ffffffff9813e8e6>] ? __audit_syscall_exit+0x1f6/0x2b0 [475154.098086] [<ffffffff98794f92>] system_call_fastpath+0x25/0x2a [475154.098087] postmaster S ffff8ae42a8c8640 0 21830 24559 0x00000080 It's probably not useful to you, but I was surprised that I was able to get a user backtrace. This involves: autovacuum, extended statistics, postgis extension, and apparently a signal.. (gdb) bt #0 0x00007fdbf9b875cc in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00007fdbf9b03b12 in _L_lock_16654 () from /lib64/libc.so.6 #2 0x00007fdbf9b00753 in malloc () from /lib64/libc.so.6 #3 0x00007fdbfc21e018 in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2 #4 0x00007fdbfc2239e8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2 #5 0x00007fdbecf503e0 in lwgeom_wagyu_interruptRequest () from /usr/pgsql-13/lib/postgis-3.so #6 0x00007fdbecf0d7f1 in handleInterrupt () from /usr/pgsql-13/lib/postgis-3.so #7 <signal handler called> #8 0x00007fdbf9b701ec in brk () from /lib64/libc.so.6 #9 0x00007fdbf9b70286 in sbrk () from /lib64/libc.so.6 #10 0x00007fdbf9b03e49 in __default_morecore () from /lib64/libc.so.6 #11 0x00007fdbf9afa49a in systrim.isra.2 () from /lib64/libc.so.6 #12 0x00007fdbf9afc15a in _int_free () from /lib64/libc.so.6 #13 0x00000000008cbc90 in AllocSetReset (context=0x2c70ec0) at aset.c:607 #14 0x00000000008d1aa1 in MemoryContextResetOnly ***@***.***=0x2c70ec0) at mcxt.c:174 #15 0x00000000008d1b1b in MemoryContextResetOnly ***@***.***=0x2c70ec0) at mcxt.c:160 #16 0x00000000008cbc20 in AllocSetDelete (context=0x2c70ec0) at aset.c:653 #17 0x00000000007588ce in BuildRelationExtStatistics ***@***.***=0x31f32b8, totalrows=83498, ***@***.***=30000, ***@***.***=0x3294448, ***@***.***=933, ***@***.***=0x328bce8) at extended_stats.c:197 #18 0x00000000005aa91d in do_analyze_rel ***@***.***=0x31f32b8, ***@***.***=0x0, acquirefunc=<optimized out>, relpages=39622, ***@***.***=false, ***@***.***=false, ***@***.***=13, params=0x3012c7c, params=0x3012c7c) at analyze.c:606 #19 0x00000000005aacd1 in analyze_rel (relid=<optimized out>, relation=<optimized out>, ***@***.***=0x3012c7c, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:263 #20 0x000000000061dde3 in vacuum (relations=0x3292568, ***@***.***=0x3012c7c, bstrategy=<optimized out>, ***@***.***=0x3014ec8, ***@***.***=true) at vacuum.c:466 #21 0x00000000007093a0 in autovacuum_do_vac_analyze (bstrategy=0x3014ec8, tab=0x3012c78) at autovacuum.c:3137 #22 do_autovacuum () at autovacuum.c:2467 #23 0x00000000007097d6 in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694 #24 0x0000000000709929 in StartAutoVacWorker () at autovacuum.c:1488 #25 0x0000000000718118 in StartAutovacuumWorker () at postmaster.c:5593 #26 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5300 #27 <signal handler called> #28 0x00007fdbf9b70983 in __select_nocancel () from /lib64/libc.so.6 #29 0x0000000000487370 in ServerLoop () at postmaster.c:1703 #30 0x0000000000719108 in PostmasterMain ***@***.***=3, ***@***.***=0x2b45280) at postmaster.c:1412 #31 0x0000000000488bbd in main (argc=3, argv=0x2b45280) at main.c:210

…

-- Justin

rincebrain · 2021-03-16T05:06:21Z

It's occurring right now for a customer running zfs 2.0.3 under linux RHEL 3.10.0-1160.15.2.el7.x86_64. Note, #11463 talks about 100% sys CPU, and that's not the case here. In the cases I've seen, postgres autovacuum/autoanalyze is stuck in futex, and fails to stop, but I am able to kill -9 it. The stuck process is like this: @.*** ~]$ sudo strace -p 17087 strace: Process 17087 attached futex(0x7fdbf9e42760, FUTEX_WAIT_PRIVATE, 2, NULL

Okay, so that's one idea eliminated, now for...everything else.

You might find strace -f more informative, since one thread waiting on a futex() call basically means "I'm waiting for another thread to finish", and I don't believe strace will show you the other threads of a process without -f. So it's probably another thread, not the "main" thread of 17087, that's stuck somewhere interesting. (Which one? Dunno. Possibly whichever one is also waiting on some call. Possibly one that's stuck in an endless loop and not making any progress. Story unclear.)

In particular, I'd really like to find a thread that's "stuck" on something in OpenZFS, otherwise it's something of a hard sell that OpenZFS is to blame for the problem. (Other evidence like "we're running this setup without OpenZFS on a number of systems and this never happens" would also be reasonably compelling.)

You suggested in your comment that at least one of the systems reproducing this right now is a production system - do you have any non-production systems that you can reproduce this on, that we might be able to try more exciting things with down the line?

Were these systems running fine for a long time and then this problem suddenly cropped up, or is this the initial setup period of the systems and you have no long-running history to judge whether this would have happened on earlier software versions?

Do other reads/writes to the relevant zpool still work normally while this is going on?

justinpryzby · 2021-03-16T05:19:55Z

On Mon, Mar 15, 2021 at 10:06:38PM -0700, Rich Ercolani wrote: > It's occurring right now for a customer running zfs 2.0.3 under linux RHEL 3.10.0-1160.15.2.el7.x86_64. Note, #11463 talks about 100% sys CPU, and that's not the case here. In the cases I've seen, postgres autovacuum/autoanalyze is stuck in futex, and fails to stop, but I *am* able to kill -9 it. The stuck process is like this: ***@***.*** ~]$ sudo strace -p 17087 strace: Process 17087 attached futex(0x7fdbf9e42760, FUTEX_WAIT_PRIVATE, 2, NULL Okay, so that's one idea eliminated, now for...everything else. You might find strace -f more informative, since one thread waiting on a futex() call basically means "I'm waiting for another thread to finish", and I don't believe strace will show you the other threads of a process without -f. So it's probably another thread, not the "main" thread of 17087, that's stuck somewhere interesting. (Which one? Dunno. Possibly whichever one is also waiting on some call. Possibly one that's stuck in an endless loop and not making any progress. Story unclear.)

postgres doesn't use threads. $ ps -Lf 17087 UID PID PPID LWP C NLWP STIME TTY STAT TIME CMD postgres 17087 24559 17087 0 1 Mar10 ? Ss 1:03 postgres: autovacuum worker ts

In particular, I'd really like to find a thread that's "stuck" on something in OpenZFS, otherwise it's something of a hard sell that OpenZFS is to blame for the problem. (Other evidence like "we're running this setup without OpenZFS on a number of systems and this never happens" would also be reasonably compelling.) You suggested in your comment that at least one of the systems reproducing this right now is a production system - do you have any non-production systems that you can reproduce this on, that we might be able to try more exciting things with down the line?

Yes, I'm seeing this several times on production systems since upgrading to openzfs-2 beginning in January. I don't have a reproduction recipe, otherwise I could use a test VM. Depending on how exciting, I can try some things on the production system, in particular if it's already stuck (like now).

Were these systems running fine for a long time and then this problem suddenly cropped up, or is this the initial setup period of the systems and you have no long-running history to judge whether this would have happened on earlier software versions? Do other reads/writes to the relevant zpool still work normally while this is going on?

Yes, everything else worked fine, except in cases that the postgres process had previously locked a table, in which case something else that tries to use it will be suck waiting.

…

-- Justin

rincebrain · 2021-03-16T05:37:02Z

postgres doesn't use threads.

Okay, great, then it's waiting on another process. Time to go look at the different Postgres processes and see whether any of them are stuck in an exciting way.

In particular, I'd really like to find a thread that's "stuck" on something in OpenZFS, otherwise it's something of a hard sell that OpenZFS is to blame for the problem. (Other evidence like "we're running this setup without OpenZFS on a number of systems and this never happens" would also be reasonably compelling.) You suggested in your comment that at least one of the systems reproducing this right now is a production system - do you have any non-production systems that you can reproduce this on, that we might be able to try more exciting things with down the line?
Yes, I'm seeing this several times on production systems since upgrading to openzfs-2 beginning in January. I don't have a reproduction recipe, otherwise I could use a test VM. Depending on how exciting, I can try some things on the production system, in particular if it's already stuck (like now).

Yeah, but one of my questions was going to be "does this happen with, say, MD+XFS", which is understandably not something you can try on a production system (though it sounds like it didn't happen pre-OpenZFS 2.0, so it's probably reasonable to assume for now it wouldn't).

If you need these systems working Now(tm), it might be reasonable to downgrade them to 0.8.6 and try to reproduce this on a testbed even if you don't have a perfect reproduction recipe premade. (I believe it's still the case that the main "zfs" CentOS/RHEL repo ships 0.8.X RPMs and you need "zfs-testing" to get 2.0.X, so this should be pretty straightforward as long as you didn't already run "zpool upgrade".)

justinpryzby · 2021-03-16T17:39:23Z

On Mon, Mar 15, 2021 at 10:37:18PM -0700, Rich Ercolani wrote: > postgres doesn't use threads. Okay, great, then it's waiting on another process. Time to go look at the different Postgres processes and see whether any of them are stuck in an exciting way.

I don't think so. The stuck autoanalyze is the earliest-started process started that's still running, except for postgres' own processes - if they were stuck, I think the whole system would be unusable. postgres=# SELECT datname, backend_start, wait_event, state, backend_type, query FROM pg_stat_activity ORDER BY backend_start LIMIT 11; | 2021-03-10 10:01:55.623695-06 | CheckpointerMain | | checkpointer | | 2021-03-10 10:01:55.623971-06 | BgWriterHibernate | | background writer | | 2021-03-10 10:01:55.624275-06 | WalWriterMain | | walwriter | | 2021-03-10 10:01:55.624573-06 | AutoVacuumMain | | autovacuum launcher | | 2021-03-10 10:01:55.625218-06 | LogicalLauncherMain | | logical replication launcher | ts | 2021-03-10 14:23:20.093884-06 | | active | autovacuum worker | autovacuum: ANALYZE child.eric_enodeb_cell_20210310 I'm not 100% sure it's an issue with zfs. I found a bunch of similar bug reports regarding issue in malloc/free and signal handlers. I'm planning to upgrade libc6 and postgis packages. Thanks for your suggestions so far.

…

-- Justin

rincebrain · 2021-03-16T23:00:04Z

On Mon, Mar 15, 2021 at 10:37:18PM -0700, Rich Ercolani wrote: > postgres doesn't use threads. Okay, great, then it's waiting on another process. Time to go look at the different Postgres processes and see whether any of them are stuck in an exciting way.
I don't think so. The stuck autoanalyze is the earliest-started process started that's still running, except for postgres' own processes - if they were stuck, I think the whole system would be unusable.

Did you look, or just conclude that it's not possible?

I'm not particularly convinced it's a ZFS issue either, since nothing has pointed to ZFS.

Could you share some of those bug report links, just for reference by anyone reading this thread in the future?

justinpryzby · 2021-04-01T17:08:16Z

On Tue, Mar 16, 2021 at 04:00:20PM -0700, Rich Ercolani wrote: > On Mon, Mar 15, 2021 at 10:37:18PM -0700, Rich Ercolani wrote: > postgres doesn't use threads. Okay, great, then it's waiting on another process. Time to go look at the different Postgres processes and see whether any of them are stuck in an exciting way. > I don't think so. The stuck autoanalyze is the earliest-started process started that's still running, except for postgres' own processes - if they were stuck, I think the whole system would be unusable. Did you look, or just conclude that it's not possible?

I didn't understand what you mean here ? In response to your question, I had straced those postgres processes and saw none of them were stuck. I tried to shut down postgres, and 5 processes were left running - the parent "postmaster", 3 helper processes, and the "stuck" process 17087.

I'm not particularly convinced it's a ZFS issue either, since nothing has pointed to ZFS. Could you share some of those bug report links, just for reference by anyone reading this thread in the future?

Searching for "__lll_lock_wait_private" I found various bug reports and patches for glibc. Most of our servers are running centos7, and frequently not the latest updates, so these are plausibly related. Other servers hit this issue, and there's not much that's changed other than upgrading to zfs-2.0. I upgraded most people to postgres 13 late last year, so the only other change is upgrading to postgres 13.2, which is essentially limited to bugfixes. I don't see any commits related to signal handling. But it's possible that code changes caused code to be compiled differently and expose some other issue. ~~So far, no server hit this issue after upgrading glibc.~~ I'm not planning on digging into it further than that, on the hypothesis that it's a bug fixed years ago for which we hadn't applied the update. Edit: servers *did* hit this after upgrading glibc, but did not recur after upgrading postgis, which was sometimes running 3.1alpha pre-release. You could close this issue and I'll come back if there's some more interesting development. Sorry for the noise and pointing the finger here.

…

-- Justin

justinpryzby added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 15, 2021

justinpryzby closed this as completed Apr 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stuck in futex #11749

stuck in futex #11749

justinpryzby commented Mar 15, 2021 •

edited

Loading

rincebrain commented Mar 16, 2021

justinpryzby commented Mar 16, 2021 via email

rincebrain commented Mar 16, 2021

justinpryzby commented Mar 16, 2021 via email

rincebrain commented Mar 16, 2021

justinpryzby commented Mar 16, 2021 via email

rincebrain commented Mar 16, 2021

justinpryzby commented Apr 1, 2021 via email •

edited

Loading

stuck in futex #11749

stuck in futex #11749

Comments

justinpryzby commented Mar 15, 2021 • edited Loading

rincebrain commented Mar 16, 2021

justinpryzby commented Mar 16, 2021 via email

rincebrain commented Mar 16, 2021

justinpryzby commented Mar 16, 2021 via email

rincebrain commented Mar 16, 2021

justinpryzby commented Mar 16, 2021 via email

rincebrain commented Mar 16, 2021

justinpryzby commented Apr 1, 2021 via email • edited Loading

justinpryzby commented Mar 15, 2021 •

edited

Loading

justinpryzby commented Apr 1, 2021 via email •

edited

Loading