Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions and stack trace overheads #45999

Open
kpamnany opened this issue Jul 11, 2022 · 8 comments
Open

Exceptions and stack trace overheads #45999

kpamnany opened this issue Jul 11, 2022 · 8 comments
Assignees
Labels
error handling Handling of exceptions by Julia or the user performance Must go faster

Comments

@kpamnany
Copy link
Contributor

kpamnany commented Jul 11, 2022

We observe two significant performance issues with exception handling:

  • With our sysimage (~3X larger than the stock Julia sysimage), gathering a displayable stack trace takes ~4 seconds. Profiling shows that this time is mostly spent in LLVM (SectionRef::containsSymbol and ELFObjectFile::getSymbolAddress). Any suggestions/ideas for how to improve this would be very helpful!
  • throw can take very slow paths. Here's one:
jl_throw
record_backtrace
rec_backtrace
jl_unw_stepn
jl_unw_step
unw_step
dwarf_step
find_reg_state
fetch_proc_info
tdep_find_proc_info # def'd to dwarf_find_proc_info
sigprocmask
dl_iterate_phdr
sigprocmask

dl_iterate_phdr walks through a list of shared objects; this can be a long list. We observe this call graph reasonably often; I'm not sure if that has anything to do with the environment (EC2 instance). Our large sysimage seems to exacerbate the performance problem here as well. I also found that exceptions are thrown not infrequently in type inference which is another performance hit.

I'm not sure what a good solution here would be. A couple of ideas:

  • At throw time, record the bare minimum and build out the backtrace only when/if it is needed. Implementing this might be gnarly.
  • Introduce a throw_light or similar when using exceptions for control flow. This is kinda ugly.

Any other ideas? I do see references to caching certain information in libunwind code, but I'm not familiar enough with the codebase to understand if this is being done sufficiently.

Cc: @vchuravy, @JeffBezanson, @vtjnash

Edited to correct per the following two comments.

@kpamnany kpamnany added performance Must go faster error handling Handling of exceptions by Julia or the user labels Jul 11, 2022
@vchuravy
Copy link
Member

dl_iterate_phdr is a libc call. Can you also post the screenshot you showed me that was 60% in a sys all since we hit the slow path in libunwind?

@vchuravy
Copy link
Member

To be precise, the remote sleuthing I did pointed towards https://github.com/libunwind/libunwind/blob/3be832395426b72248969247a4a66e3c3623578d/src/dwarf/Gfind_proc_info-lsb.c#L806-L808

The profile had ~60% of time spent in the syscall sigprocmask and the function that was called from was find_proc_info as in the chain Kiran showed there.

  1. find_proc_info
  2. fetch_proc_info
  3. find_reg_state
  4. dwarf_step

In particular there is a different path in https://github.com/libunwind/libunwind/blob/1f79a05edbd5c06240f8a15187b106831076540e/src/dwarf/Gparser.c#L468 that we are not hitting that might be faster.

But it is kinda ludicrous that each dwarf_step involves a walk of potentially all the object files loaded, and there should be some caching there?

@vtjnash
Copy link
Member

vtjnash commented Jul 11, 2022

That syscall seems unjustified also libunwind/libunwind@d3fad3a
We could disable it at build time to avoid the penalty for a feature we possibly don't need or want. However, I believe it is protecting us against the internal lock in dl_iterate_phdr from causing deadlocks for the process (since libunwind promises to be async-signal-safe). On macos we do something roughly similar by calling _dyld_atfork_prepare to get that same lock before pausing a task for profiling.

@kpamnany
Copy link
Contributor Author

kpamnany commented Jul 11, 2022

image

This profile? Had no dl_iterate_phdr... but yes, sigprocmask.

Edit: ah, it was in the code we were looking at.

@kpamnany
Copy link
Contributor Author

I believe it is protecting us against the internal lock in dl_iterate_phdr from causing deadlocks for the process (since libunwind promises to be async-signal-safe).

That sounds unavoidable then?

@vchuravy's suggestion:

each dwarf_step involves a walk of potentially all the object files loaded and there should be some caching there?

Sounds reasonable to me. For a particular backtrace exploration, will the set of object files loaded be static? Can a thread dlclose a shared object and invalidate the cache?

@vtjnash
Copy link
Member

vtjnash commented Jul 11, 2022

It might, and that might get us wedged into a really bad state if it happened anyways. In practice, never use dlclose if you care about this.

@kpamnany
Copy link
Contributor Author

On reflection, I don't think that could happen. If we're walking the stack for an exception and an address from a shared object is on the stack, it should be impossible for the reference count of the shared object to drop to 0.

@vchuravy
Copy link
Member

This profile? Had no dl_iterate_phdr... but yes, sigprocmask.

Yeah I think the frames were inlined.

@kpamnany kpamnany self-assigned this Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants