Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "feature: Eager watch mode for exec (#6507)" #6867

Merged
merged 1 commit into from
Jan 12, 2023

Conversation

gridbugs
Copy link
Collaborator

This reverts commit 5ff9a4f.

This was causing occasional segfaults on macos when running dune exec so reverting this until we figure out what's causing that.

This reverts commit 5ff9a4f.

This was causing occasional segfaults on macos when running `dune exec`
so reverting this until we figure out what's causing that.

Signed-off-by: Stephen Sherratt <stephen@sherra.tt>
@rgrinberg
Copy link
Member

How about you just disable the feature for now? I.e. just remove the flag that enables it. I don't mind reverting it, but I think you're just going to create more work for yourself long term if someone edits this file and creates conflicts for this change.

@gridbugs
Copy link
Collaborator Author

We see the segfault on macos even when running dune exec not in watch mode.

@rgrinberg rgrinberg merged commit 1848093 into ocaml:main Jan 12, 2023
@gridbugs
Copy link
Collaborator Author

I can semi-reliably reproduce this on a mac by running:

while make clean && dune build @test/dependency-external; do date; done

Usually within a few iterations it fails with a seg fault.

As an experiment I tried replacing the call to Scheduler.go_with_rpc_server_and_console_status_reporting with Scheduler.go and after a few hours I haven't got a seg fault. I let it run overnight and if there is still no seg fault I'll investigate further into why this change caused the problem to go away.

@rgrinberg
Copy link
Member

Btw, can you upload the stacktrace from the core dump?

@gridbugs
Copy link
Collaborator Author

$ lldb --core /cores/core.2223
(lldb) target create --core "/cores/core.2223"
Core file '/cores/core.2223' (x86_64) was loaded.
(lldb) bt
* thread #1
  * frame #0: 0x00007ff806f3b3ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff806f75a6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00000001038da262 main.exe`caml_thread_leave_blocking_section [inlined] st_masterlock_acquire(m=<unavailable>) at st_posix.h:137:5 [opt]
    frame #3: 0x00000001038da224 main.exe`caml_thread_leave_blocking_section at st_stubs.c:244:3 [opt]
    frame #4: 0x00000001038e5d63 main.exe`caml_leave_blocking_section at signals.c:175:3 [opt]
    frame #5: 0x00000001038e2d23 main.exe`unix_unlink(path=<unavailable>) at unlink.c:33:3 [opt]
    frame #6: 0x000000010385d2c4 main.exe`camlStdune__Fpath__unlink_no_err_758 + 36
    frame #7: 0x0000000103881afb main.exe`camlStdlib__new_exit_481 + 59
    frame #8: 0x0000000103881b73 main.exe`camlStdlib__exit_488 + 35
    frame #9: 0x00000001035cef85 main.exe`camlDune__exe__Main__entry + 1781
    frame #10: 0x00000001035c7789 main.exe`caml_program + 8265
    frame #11: 0x000000010390cf55 main.exe`caml_start_program + 73
    frame #12: 0x00000001038e43a6 main.exe`caml_startup_common(argv=0x00007ff7bc93e1a0, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #13: 0x00000001038e441b main.exe`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #14: 0x00000001038e4414 main.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172:15 [opt]
    frame #15: 0x00000001038e4414 main.exe`caml_main(argv=<unavailable>) at startup_nat.c:179:3 [opt]
    frame #16: 0x00000001038e448c main.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #17: 0x0000000105e2952e dyld`start + 462

@rgrinberg
Copy link
Member

Looks mysterious. @nojb any chance you might know?

@gridbugs
Copy link
Collaborator Author

Out of curiosity I tried instrumenting Fpath.unlink to print out the file it's unlinking (though I now doubt that's related) and now I don't get a seg fault.

@nojb
Copy link
Collaborator

nojb commented Jan 24, 2023

Looks mysterious. @nojb any chance you might know?

No idea. Is this with OCaml 5? Can it be reproduced with 4.14?

@gridbugs
Copy link
Collaborator Author

This was with 4.14

gridbugs added a commit to gridbugs/dune that referenced this pull request Jan 30, 2023
This is the second attempt at adding this feature. The problem with the
first attempt was that replacing the call to `Scheduler.go` with
`Scheduler.go_with_rpc_server_and_console_status_reporting` caused
occasional non-deterministic seg faults on macos. There's some info
about the problem on the PR which reverts the previous attempt at adding
this feature: ocaml#6867. At the time of
writing we don't know the cause of this problem.

The difference this time around is that we maintain the call to
`Scheduler.go` when running `dune exec` not in watch mode, and only
invoke `Scheduler.go_with_rpc_server_and_console_status_reporting` when
running in watch mode. There is some additional refactoring done to make
this split more ergonomic. We may see seg faults on macos when running
exec in watch mode but at least we won't introduce the potential for seg
faults into the existing use case of running exec not in watch mode
(assuming of course that there is a causal link between
`go_with_rpc_server_and_console_status_reporting` and the seg fault on
macos, which is not necessarily the case).

Signed-off-by: Stephen Sherratt <stephen@sherra.tt>
rgrinberg pushed a commit that referenced this pull request Feb 16, 2023
This is the second attempt at adding this feature. The problem with the
first attempt was that replacing the call to `Scheduler.go` with
`Scheduler.go_with_rpc_server_and_console_status_reporting` caused
occasional non-deterministic seg faults on macos. There's some info
about the problem on the PR which reverts the previous attempt at adding
this feature: #6867. At the time of
writing we don't know the cause of this problem.

The difference this time around is that we maintain the call to
`Scheduler.go` when running `dune exec` not in watch mode, and only
invoke `Scheduler.go_with_rpc_server_and_console_status_reporting` when
running in watch mode. There is some additional refactoring done to make
this split more ergonomic. We may see seg faults on macos when running
exec in watch mode but at least we won't introduce the potential for seg
faults into the existing use case of running exec not in watch mode
(assuming of course that there is a causal link between
`go_with_rpc_server_and_console_status_reporting` and the seg fault on
macos, which is not necessarily the case).

Signed-off-by: Stephen Sherratt <stephen@sherra.tt>
moyodiallo pushed a commit to moyodiallo/dune that referenced this pull request Apr 3, 2023
This is the second attempt at adding this feature. The problem with the
first attempt was that replacing the call to `Scheduler.go` with
`Scheduler.go_with_rpc_server_and_console_status_reporting` caused
occasional non-deterministic seg faults on macos. There's some info
about the problem on the PR which reverts the previous attempt at adding
this feature: ocaml#6867. At the time of
writing we don't know the cause of this problem.

The difference this time around is that we maintain the call to
`Scheduler.go` when running `dune exec` not in watch mode, and only
invoke `Scheduler.go_with_rpc_server_and_console_status_reporting` when
running in watch mode. There is some additional refactoring done to make
this split more ergonomic. We may see seg faults on macos when running
exec in watch mode but at least we won't introduce the potential for seg
faults into the existing use case of running exec not in watch mode
(assuming of course that there is a causal link between
`go_with_rpc_server_and_console_status_reporting` and the seg fault on
macos, which is not necessarily the case).

Signed-off-by: Stephen Sherratt <stephen@sherra.tt>
@gridbugs gridbugs deleted the revert-exec-watch branch October 11, 2023 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants