Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults in C++ native extension #74

Closed
zakjan opened this issue Mar 17, 2021 · 11 comments · Fixed by #116 or #121
Closed

Segfaults in C++ native extension #74

zakjan opened this issue Mar 17, 2021 · 11 comments · Fixed by #116 or #121
Assignees
Labels
bug Something isn't working

Comments

@zakjan
Copy link
Contributor

zakjan commented Mar 17, 2021

Opening this issue to summarize the state of C++ native extension stability.

During the initial implementation of C++ native parser, there used to be frequent segfaults. In that time, I decreased segfault frequency to none on my machine by reordering of Ruby code, so that it always first extracts values from native classes to Ruby variables, and only then uses them for further processing. It's strange, but it helps, as if it introduces some sort of a sync barrier (if we were in a multi-threaded code, but we're not). It might be caused by Rice or Ruby GC, but I couldn't find the real cause yet.

def visit_schema_decl(ctx)
ctx__schema_id = ctx.schema_id
ctx__schema_version_id = ctx.schema_version_id
ctx__schema_body = ctx.schema_body
ctx__schema_body__interface_specification = ctx__schema_body&.interface_specification
ctx__schema_body__constant_decl = ctx__schema_body&.constant_decl
ctx__schema_body__schema_body_declaration = ctx__schema_body&.schema_body_declaration
id = visit_if(ctx__schema_id)
version = visit_if(ctx__schema_version_id)
interfaces = visit_if_map(ctx__schema_body__interface_specification)
constants = visit_if(ctx__schema_body__constant_decl, [])
declarations = visit_if_map(ctx__schema_body__schema_body_declaration)
types = declarations.select{|x| x.is_a? Model::Type}
entities = declarations.select{|x| x.is_a? Model::Entity}
subtype_constraints = declarations.select{|x| x.is_a? Model::SubtypeConstraint}
functions = declarations.select{|x| x.is_a? Model::Function}
rules = declarations.select{|x| x.is_a? Model::Rule}
procedures = declarations.select{|x| x.is_a? Model::Procedure}
Model::Schema.new({
id: id,
version: version,
interfaces: interfaces,
constants: constants,
types: types,
entities: entities,
subtype_constraints: subtype_constraints,
functions: functions,
rules: rules,
procedures: procedures
})
end

There is a note to avoid retaining any references to ANTLR4 native classes in antlr4-native-rb, otherwise it leads to segfaults. I think that I'm following this correctly, but apparently there is more to that. In a discussion with the author, he's also not sure about the real cause.

Based on https://github.com/metanorma/annotated-express/pull/70#issuecomment-800842128 segfaults still occur. Rarely, but they do. This needs more investigation. If a segfault occurs, currently just re-run the command.

@zakjan zakjan added the bug Something isn't working label Mar 17, 2021
@ronaldtse
Copy link
Contributor

Any thoughts @CAMOBAP ? I am on macOS using the precompiled version. It only happened once to me...

@CAMOBAP
Copy link
Contributor

CAMOBAP commented Mar 18, 2021

@ronaldtse as far as I understood we are talking about https://github.com/metanorma/annotated-express/pull/70#issue-594401153 right?

If this happens randomly, I'm also a bit confused because as @zakjan already mentioned we aren't in multi-threaded environment.

@ronaldtse could you please confirm your steps to reproduce the issue (even if it happens only once) including platform and gem version

@ronaldtse
Copy link
Contributor

ronaldtse commented Mar 18, 2021

I just ran bundle exec metanorma site generate.

Version info:

/Users/me/.rbenv/versions/2.6.6/lib/ruby/gems/2.6.0/gems/expressir-0.2.21-x86_64-darwin/lib/expressir/express_exp/visitor.rb:432: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-darwin19]

You can see the full trace in metanorma/annotated-express#70 .

Gemfile.lock.zip

@ronaldtse
Copy link
Contributor

@maxirmx this segfault is indeed challenging to our release pipeline, I just encountered it today. Would you have time to do this? Thanks!

@maxirmx
Copy link
Contributor

maxirmx commented Mar 26, 2022

@ronaldtse
From the discussions in this and other issues I got an impression that there are some references to ANTLR4 native variables that are used accross calls to Ruby code. Such instances are local variables (i.e. allocated on stack) and passed by pointer. Result should be unpredictable but it is an error, of course.

The issue is that I do not see where it happens. Possibly 'everywhere' and it is a fundamental design flaw.

@ronaldtse
Copy link
Contributor

@maxirmx the Expressir gem's parser code is generated from the ANTLR grammar using https://github.com/camertron/antlr4-native-rb , so the pointer reference issues are likely there. Could you investigate in that direction?

@ronaldtse
Copy link
Contributor

@maxirmx The segfault seems reliably reproduced in this build with Ruby 2.7: https://github.com/lutaml/expressir/runs/5721221032?check_suite_focus=true

Maybe that is a good place to start investigation.

@maxirmx
Copy link
Contributor

maxirmx commented May 4, 2022

I made several changes that improved stability as follows:

  • parse.rb -- self.from_file, self.from_files
    parse, syntax, visitor methods return complex tree structures created in netive (C++) extension
    visit method references nodes and leaves of this structures but it is totally untransparent for Ruby GarbageCllector
    so in this class we keep those C++ structure marked for GC so they are not freed

  • upgraded antlr to version 4.9.2 It shall resolve an issue with th order of static constructors (antlr/antlr4@0d0e7b4)

  • Changed ``ContextProxy``` implementation so that getParent, getChildren methods return 'detached' objects that can be safely freed by Ruby Garbage collector

However, there is other issue (or issues) out there. It happens very rarely and is a ri test failure, not crash. Also overal CI/testing dows not look very robust as I described in #113

@maxirmx
Copy link
Contributor

maxirmx commented May 5, 2022

@maxirmx maxirmx linked a pull request May 5, 2022 that will close this issue
maxirmx added a commit that referenced this issue May 12, 2022
- finalized fixes for segfaults in C++ native extension ( using antlr4-native-rb 2.0.0.1 ) (Segfaults in C++ native extension #74)
- added native extension sanity checks to CI scripts (Code review  #113)
- added rubocop run to CI scripts, some of ruby files are fixed to meet desired rubocop criteria (Code review  #113)
- added verification of pakaged gems in CI scripts (this is critical since development and release procedures use different toolchains) (Code review  #113)
- cleaned obsolete files
- fixed build for x64-mingw-ucrt binary gem (Build x64-mingw-ucrt version to support Windows Ruby 3.1 #103)
@maxirmx
Copy link
Contributor

maxirmx commented May 15, 2022

Although the extension looks more stable, there is (are) issue(s) that cause abnormal terminations. It looks like the problem is related to compaction (good explanation is here: https://alanwu.space/post/check-compaction/)
Running 'verify_compaction_references' results in weird issues so it is not easy to track either.

Something very similar was researched and not fixed in Rice as discussed here: ruby-rice/rice#159

@maxirmx maxirmx reopened this May 15, 2022
@maxirmx maxirmx linked a pull request May 25, 2022 that will close this issue
@maxirmx
Copy link
Contributor

maxirmx commented May 27, 2022

Expressir implements a method to access antlr token stream from Ruby.
This method (Parser.tokens in Ruby; ParserProxy::getTokens in C++) is generates on top of the source created by antlr4-native gem. getTokens returns Ruby array containing direct references to antlr objects and it is an issue.
When Ruby releases the array it also attempts to release all its members since there is no information of other references to said members.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants