-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults in C++ native extension #74
Comments
Any thoughts @CAMOBAP ? I am on macOS using the precompiled version. It only happened once to me... |
@ronaldtse as far as I understood we are talking about https://github.com/metanorma/annotated-express/pull/70#issue-594401153 right? If this happens randomly, I'm also a bit confused because as @zakjan already mentioned we aren't in multi-threaded environment. @ronaldtse could you please confirm your steps to reproduce the issue (even if it happens only once) including platform and gem version |
I just ran Version info:
You can see the full trace in metanorma/annotated-express#70 . |
@maxirmx this segfault is indeed challenging to our release pipeline, I just encountered it today. Would you have time to do this? Thanks! |
@ronaldtse The issue is that I do not see where it happens. Possibly 'everywhere' and it is a fundamental design flaw. |
@maxirmx the Expressir gem's parser code is generated from the ANTLR grammar using https://github.com/camertron/antlr4-native-rb , so the pointer reference issues are likely there. Could you investigate in that direction? |
@maxirmx The segfault seems reliably reproduced in this build with Ruby 2.7: https://github.com/lutaml/expressir/runs/5721221032?check_suite_focus=true Maybe that is a good place to start investigation. |
I made several changes that improved stability as follows:
However, there is other issue (or issues) out there. It happens very rarely and is a ri test failure, not crash. Also overal CI/testing dows not look very robust as I described in #113 |
The root causes |
- finalized fixes for segfaults in C++ native extension ( using antlr4-native-rb 2.0.0.1 ) (Segfaults in C++ native extension #74) - added native extension sanity checks to CI scripts (Code review #113) - added rubocop run to CI scripts, some of ruby files are fixed to meet desired rubocop criteria (Code review #113) - added verification of pakaged gems in CI scripts (this is critical since development and release procedures use different toolchains) (Code review #113) - cleaned obsolete files - fixed build for x64-mingw-ucrt binary gem (Build x64-mingw-ucrt version to support Windows Ruby 3.1 #103)
Although the extension looks more stable, there is (are) issue(s) that cause abnormal terminations. It looks like the problem is related to compaction (good explanation is here: https://alanwu.space/post/check-compaction/) Something very similar was researched and not fixed in Rice as discussed here: ruby-rice/rice#159 |
Expressir implements a method to access antlr token stream from Ruby. |
Opening this issue to summarize the state of C++ native extension stability.
During the initial implementation of C++ native parser, there used to be frequent segfaults. In that time, I decreased segfault frequency to none on my machine by reordering of Ruby code, so that it always first extracts values from native classes to Ruby variables, and only then uses them for further processing. It's strange, but it helps, as if it introduces some sort of a sync barrier (if we were in a multi-threaded code, but we're not). It might be caused by Rice or Ruby GC, but I couldn't find the real cause yet.
expressir/lib/expressir/express_exp/visitor.rb
Lines 1916 to 1948 in 37015c9
There is a note to avoid retaining any references to ANTLR4 native classes in antlr4-native-rb, otherwise it leads to segfaults. I think that I'm following this correctly, but apparently there is more to that. In a discussion with the author, he's also not sure about the real cause.
Based on https://github.com/metanorma/annotated-express/pull/70#issuecomment-800842128 segfaults still occur. Rarely, but they do. This needs more investigation. If a segfault occurs, currently just re-run the command.
The text was updated successfully, but these errors were encountered: