Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launcher: replace parent process on supported platforms. #569

Merged
merged 5 commits into from
Jan 11, 2024

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jan 30, 2023

This simplifies use of juliaup with, e.g., debuggers. Database updates are executed in a forked process.

There's a couple of tricky aspects:

  • not everything is safe to fork; essentially julialauncher may not become multithreaded for this to work
  • because we replace the parent process without waiting for the child, there'll be a zombie julialauncher process until the parent terminates

I don't think these are dealbreakers, but it's probably worth discussing.

Untested on Windows as cross-compilation doesn't work for juliaup (due to the build.rs).
I'm also a Rust newbie so the code probably can be improved.

cc @vchuravy @staticfloat

Fixes #552, fixes #607, fixes #645

@maleadt maleadt added enhancement New feature or request rust Pull requests that update Rust code labels Jan 30, 2023
@maleadt maleadt marked this pull request as ready for review January 31, 2023 08:41
@maleadt
Copy link
Member Author

maleadt commented Jan 31, 2023

This seems to work great. With NVIDIA's profilers it sadly still requires a --target-processes=all to follow the execve. gdb defaults to follow-fork-mode=parent, so that works great:

❯ gdb -nx --args ./target/debug/julialauncher -e 'ccall(:raise, Nothing, (Int,), 11)'
(gdb) run
Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
0x00007ffff7e3e64c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7e3e64c in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7dee938 in raise () from /usr/lib/libc.so.6
#2  0x00007fff980279be in ?? ()
#3  0x0000000000000000 in ?? ()

rr doesn't seem to like the execve though:

❯ rr record ./target/debug/julialauncher -e 'ccall(:raise, Nothing, (Int,), 11)'
❯ rr replay -o '-nx'
(rr) c
Continuing.
Program stopped.
0x0000000070000002 in syscall_traced ()
(rr) bt
#0  0x0000000070000002 in syscall_traced ()
#1  0x00007f5e21b53b18 in _raw_syscall () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/raw_syscall.S:120
#2  0x00007f5e21b4ed19 in traced_raw_syscall (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:350
#3  0x00007f5e21b51684 in sys_statfs (call=<optimized out>) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:3557
#4  syscall_hook_internal (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:4103
#5  syscall_hook (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:4142
#6  syscall_hook (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:4126
#7  0x00007f5e21b4e263 in _syscall_hook_trampoline () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscall_hook.S:308
#8  0x00007f5e21b4e2cd in __morestack () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscall_hook.S:443
#9  0x00007f5e21b4e2d4 in _syscall_hook_trampoline_48_3d_01_f0_ff_ff () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscall_hook.S:457
#10 0x00007f5e21911fc1 in execve () from /usr/lib/libc.so.6
#11 0x00007f5e21912735 in ?? () from /usr/lib/libc.so.6
#12 0x000055796510f0db in std::sys::unix::process::process_common::Command::do_exec () at library/std/src/sys/unix/process/process_unix.rs:377
#13 0x00005579651055e1 in std::sys::unix::process::process_common::Command::exec () at library/std/src/sys/unix/process/process_unix.rs:237
#14 std::os::unix::process::{impl#0}::exec () at library/std/src/os/unix/process.rs:212
#15 0x0000557964e88593 in julialauncher::run_app () at src/bin/julialauncher.rs:291
#16 0x0000557964e88775 in julialauncher::main () at src/bin/julialauncher.rs:361

... even though we can break on the segfault handling by using --goto:

❯ rr replay -o '-nx' --goto 3538
(rr) bt
#0  0x0000000070000002 in syscall_traced ()
#1  0x00007f279cb06b18 in _raw_syscall () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/raw_syscall.S:120
#2  0x00007f279cb01d19 in traced_raw_syscall (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:350
#3  0x00007f279cb04684 in sys_statfs (call=<optimized out>) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:3557
#4  syscall_hook_internal (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:4103
#5  syscall_hook (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:4142
#6  syscall_hook (call=0x681fffa0) at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscallbuf.c:4126
#7  0x00007f279cb01263 in _syscall_hook_trampoline () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscall_hook.S:308
#8  0x00007f279cb012cd in __morestack () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscall_hook.S:443
#9  0x00007f279cb012e9 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /home/tim/.cache/yay/rr-git/src/rr/src/preload/syscall_hook.S:462
#10 0x00007f279c9ec3b5 in write () from /usr/lib/libc.so.6
#11 0x00007f279bc7f642 in ijl_safe_printf (fmt=fmt@entry=0x7f279be3a6f5 "\n[%d] signal (%d.%d): %s\n") at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/jl_uv.c:628
#12 0x00007f279bca8889 in jl_critical_error (sig=sig@entry=11, si_code=<optimized out>, context=context@entry=0x7f27913fa480, ct=0x7f27917fc010) at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/signal-handling.c:466
#13 0x00007f279bca8d0e in sigdie_handler (sig=11, info=0x7f27913fa5b0, context=0x7f27913fa480) at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/signals-unix.c:60
#14 0x00007f279bca8f6e in segv_handler (context=0x7f27913fa480, info=0x7f27913fa5b0, sig=11) at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/signals-unix.c:349
#15 segv_handler (sig=11, info=0x7f27913fa5b0, context=0x7f27913fa480) at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/signals-unix.c:340
#16 <signal handler called>
#17 0x00007f279c97d64c in ?? () from /usr/lib/libc.so.6
#18 0x00007f279c92d938 in raise () from /usr/lib/libc.so.6
#19 0x00007f27377009be in ?? ()
#20 0x0000000000000000 in ?? ()

This isn't a multiprocess issue, as there's only one:

❯ rr ps
PID	PPID	EXIT	CMD
21501	--	-11	./target/debug/julialauncher -e ccall(:raise, Nothing, (Int,), 11)
21503	21501	0	(forked without exec)
21504	21503	0	(forked without exec)
21505	21501	0	(forked without exec)

@Keno Any thoughts why rr behaves differently here? Oh, this is rr-debugger/rr#2381.
Any way to improve the out-of-the-box experience of running juliaup under rr?

@Keno
Copy link
Member

Keno commented Jan 31, 2023

Oh, this is rr-debugger/rr#2381.

Yes

Any way to improve the out-of-the-box experience of running juliaup under rr?

Not really. If you're using BugReporting or whatever, we could somehow mark which process is the interesting one in the trace, but for vanilla rr, there isn't really a way to know. After all, maybe you want to debug the launcher.

@maleadt
Copy link
Member Author

maleadt commented Jan 31, 2023

If you're using BugReporting or whatever, we could somehow mark which process is the interesting one in the trace, but for vanilla rr, there isn't really a way to know.

FWIW, BugReporting calls rr record on Base.julia_cmd(), which is the actual Julia binary, so we don't have this issue there.

@maleadt
Copy link
Member Author

maleadt commented Feb 20, 2023

Bump, @davidanthoff.

@vchuravy
Copy link
Member

Wholeheartedly endorse this, I use juliaup everywhere now but every so often I need to use GDB and get confused by the fact that my breakpoints are not working

@davidanthoff
Copy link
Collaborator

davidanthoff commented May 6, 2023

While https://docs.rs/nix/0.26.2/nix/unistd/fn.fork.html#safety is not super clear, it has this sentence

Note that memory allocation may not be async-signal-safe and thus must be prevented.

which to me suggests that we can't do things like update the version db in the child process? The example just above that section also says that one can't call things like println! or unwrap in the child process, which would confirm that interpretation?

The way I am reading that section, it is not enough to not introduce multi-threading ourselves into julialauncher.

@maleadt
Copy link
Member Author

maleadt commented May 8, 2023

Note that memory allocation may not be async-signal-safe and thus must be prevented.

which to me suggests that we can't do things like update the version db in the child process?

Only if the parent process is multithreaded. AFAIU, the example would be safe to use println! as it isn't multithreaded. Or does Rust default to spawning multiple processes like Julia does?

@staticfloat
Copy link
Sponsor Member

Or does Rust default to spawning multiple processes like Julia does?

No, it does not.

@davidanthoff
Copy link
Collaborator

But if say any library we used were to spawn a thread, then we would be in trouble, right? Say our https request library, or anything else? We would essentially have to audit all of those packages? And if we wanted for example to adopt something like tokio down the road, then that would presumably also not work?

I'm wondering whether another strategy would be to make this replace thing opt-in (either via a command line flag, an env var or a config file setting), and when that is the case, we just skip the entire background "check for new versions" step. We would then not have to fork at all, but could just do the replace step by itself?

@maleadt
Copy link
Member Author

maleadt commented May 8, 2023

But if say any library we used were to spawn a thread, then we would be in trouble, right? Say our https request library, or anything else?

Only if those threads were spawned before the fork.

Not doing the fork would be fine by me too, but I'd rather we use some heuristic to detect gdb/rr/nsight/etc instead of using opt-in flags/env vars/settings.

@davidanthoff
Copy link
Collaborator

I'd rather we use some heuristic to detect gdb/rr/nsight/etc instead of using opt-in flags/env vars/settings

Yes, that sounds good to me as well!

@staticfloat
Copy link
Sponsor Member

Only if those threads were spawned before the fork.

IMO, the spirit of this PR is that the fork happens very early on; before we've started to do significant work. Forking is not a rare thing to do in the linux world, so most rust crates "should" avoid spinning up threads right at startup, I would imagine. Verifying this is easy, I launched julialauncher in lldb, put a breakpoint on fork and then dumped the thread list. It turns out that ctrlc launches a background thread to handle signals when we set the handler, so if we set that up after the fork, we should be good to go.

@fingolfin
Copy link
Contributor

fingolfin commented Jul 24, 2023

This would also likely resolve some issues, such as #607 :-)

I am confused about the concerns regarding multithreading and async IO here, but that's probably because I am completely unfamiliar with how juliaup is designed... but I just wanted to point out that of course both child and parent process can safely use multithreading if everything is done right (all processes on *nix are ultimately launched via fork+exec). As @staticfloat pointed out this is therefore no problem if the fork happens early enough.

If there are concerns about some library being badly designed and violating this rule by creating threads "too early" then I'd suggest to simplify the launcher to just this:

  1. minimal setup, avoiding fork and then ...
  2. the child process replaces itself via exec with a process that performs whatever complex background operations you need, such as updating databases
  3. the parent process replaces itself using exec with actual julia

Such a launcher would have minimal (ideally: no) external dependencies and therefore should be easy to audit against the problem of multi threading being introduced "accidentally". The two subprocesses of course can use multithreading arbitrarily.

Looking at https://github.com/JuliaLang/juliaup/blob/main/src/bin/julialauncher.rs it seems that a DB update already is handled by launching a separate process, so this all seems not far from what I described above.

So I am somewhat confused as to what the major concern here is?

@fredrikekre
Copy link
Member

Bump. Whats the status here? Anything except a conflict resolve needed before merging?

@davidanthoff
Copy link
Collaborator

I have another question about this: does this work across 32-64 bit boundaries? I.e. say Juliaup is 64 bit (so julialauncher as well), and then a user wants to start a 32 bit version of Julia (with julia +release~x86). Would that work?

This simplifies use of juliaup with, e.g., debuggers.
Database updates are executed in a forked process.
@fingolfin
Copy link
Contributor

@davidanthoff that should work. Otherwise a 64 bit bash/zsh/... could not launch 32bit binaries and vice-versa (they all do fork+exec with no special handling for 32 vs 64 bit)

src/bin/julialauncher.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks very good, with the caveat that I don't really speak rust -- but I can follow the general control flow and this looks like what I would expect.

So I hope this can be merged soon and released :-)

@maleadt
Copy link
Member Author

maleadt commented Jan 4, 2024

It turns out that ctrlc launches a background thread to handle signals when we set the handler, so if we set that up after the fork, we should be good to go.

I couldn't reproduce this; and I don't see any threads/processses being created (gdb + catch syscall clone) except for the one we initiate. Still, it seems wise to delay the CTRL-C handler registration, so I moved it.

@maleadt
Copy link
Member Author

maleadt commented Jan 4, 2024

Rebased, addressed comments, and made sure there's no new GH:A warnings. I also did some extensive testing locally, and this seems to work fine. I didn't test Windows, but nothing changed there (as is obvious when inspecting the diff with whitespace changes removed). So this seems good to go?

@KristofferC
Copy link
Sponsor Member

I'll merge this in a couple of days assuming no news here.

Copy link
Collaborator

@davidanthoff davidanthoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has someone tested all the various background task code paths for this? Things like a) initial setup and b) an actual version db update (i.e. where the version db on the server has actually changed)? I think the self-update is something we'll only be able to test once we put this out to at least the dev channel, but we should try that right after this is merged as well. I think a) is partially covered by tests, but b) probably not.

And I think this here still means that we essentially rule out that we ever adopt something like tokio with threads down the road? Probably not even in juliaup, as there is a fair bit of code shared between the binaries...

src/bin/julialauncher.rs Show resolved Hide resolved
@fingolfin
Copy link
Contributor

And I think this here still means that we essentially rule out that we ever adopt something like tokio with threads down the road?

Not it does not mean that, as I explained above.

@maleadt
Copy link
Member Author

maleadt commented Jan 5, 2024

Things like a) initial setup and b) an actual version db update

AFAICT initial set-up isn't affected by this change. And the background tasks still work; I don't see why they would not, the forked child behaves identical to the parent.

❯ tail ~/.julia/juliaup/juliaup.json
  "LastVersionDbUpdate": "2020-01-05T11:45:45.781666653Z"

❯ ./target/release/julia -e 42

❯ tail ~/.julia/juliaup/juliaup.json
  "LastVersionDbUpdate": "2024-01-05T11:46:35.160895192Z"

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jan 10, 2024

Not it does not mean that, as I explained above.

Also, the double fork+exec for the update process is probable better anyway since that's required to avoid a zombie child and is the standard way to start a daemon process on UNIX, which presumably is what we want the updater to be. Overall, this is just the correct design on UNIX systems. It cannot be done on Windows, of course, but better to do this correctly where we can.

@davidanthoff davidanthoff merged commit abaf9da into JuliaLang:main Jan 11, 2024
23 checks passed
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 11, 2024

The Zombie problem doesn't happen on windows, so it doesn't need the daemonize workaround

@maleadt maleadt deleted the execve branch January 12, 2024 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rust Pull requests that update Rust code
Projects
None yet
10 participants