Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Handling for Nokogiri::XML::SAX::Document under jruby seems broken #1847

Closed
adjam opened this issue Dec 20, 2018 · 5 comments
Closed

Comments

@adjam
Copy link

adjam commented Dec 20, 2018

What problems are you experiencing?

Created a subclass of Nokogiri::XML::SAX::Document that raises an exception in the #error method. Tried to parse a non-well-formed XML document, expecting the error to be raised, but it wasn't.

What's the output from nokogiri -v?

    ---
    warnings: []
    nokogiri: 1.9.1
    ruby:
      version: 2.5.0
      platform: java
      description: jruby 9.2.5.0 (2.5.0) 2018-12-06 6d5a228 OpenJDK 64-Bit Server VM 25.191-b12
        on 1.8.0_191-b12 +jit [linux-x86_64]
      engine: jruby
      jruby: 9.2.5.0
    xerces: Xerces-J 2.12.0
    nekohtml: NekoHTML 1.9.21

Can you provide a self-contained script that reproduces what you're seeing?

#!/usr/bin/env ruby

require 'nokogiri'

class NoopDocument < Nokogiri::XML::SAX::Document
    def start_element(el, attrs)
        puts "#{el}"
    end

    def end_element(el)
        puts "/#{el}"
    end

    def error(msg)
        raise(StandardError, msg)
    end
end

data = %q{<xml><element/ <element att="^G"/></xml>}

doc = NoopDocument.new
p = Nokogiri::XML::SAX::Parser.new(doc)

begin
    p.parse(data)
    puts "completed without errors."
rescue StandardError => e
    puts e
    puts "An (expected) error was thrown"
end
  • outputs
xml
completed without errors.

When run under the above configuration. When I change rubies to 2.5.1 and now nokogiri -v outputs:

# Nokogiri (1.9.1)
    ---
    warnings: []
    nokogiri: 1.9.1
    ruby:
      version: 2.5.1
      platform: x86_64-linux
      description: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/ajconsta/.gem/ruby/2.5.1/gems/nokogiri-1.9.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.8"
      libxslt_path: "/home/ajconsta/.gem/ruby/2.5.1/gems/nokogiri-1.9.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.32"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Fix-nullptr-deref-with-XPath-logic-ops.patch
      - 0003-Fix-infinite-loop-in-LZMA-decompression.patch
      libxslt_patches: []
      compiled: 2.9.8
      loaded: 2.9.8

I see the expected output:

xml
error parsing attribute name
An (expected) error was thrown

(for reference, in case GitHub 'cleans' the document, the attribute value in the included XML is \x07, which is not valid for the utf-8 encoding. I also made the document not well-formed in more obvious ways to crank up the error-throwing capabilities.)

@flavorjones
Copy link
Member

Hey, thank you so much for reporting this! I'm not at all confident that the JRuby implementation of the SAX parser has had many eyes on it over the years, and so it's great to hear feedback.

I'll make sure someone takes a look at this in the next few days.

@adjam
Copy link
Author

adjam commented Feb 8, 2019

Did some research of my own; my current thinking is that the culprit is

https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/internals/NokogiriHandler.java#L256

-- when the underlying (Java) SAX parser encounters the error, NokogiriHandler (which is registered as the SAX ErrorHandler here) in turn calls the #error method on the Nokogiri::XML::SAX::Document class (implemented in Ruby), and the catch block linked above effectively swallows the exception thrown in the ruby code.

My first reaction is to take out the catch block here, on the assumption that if the #error method throws an exception that's on the implementer of that method; this is how it works under MRI, as the test case shows. There might be something more subtle, e.g. if the RaiseError is a ruby exception, retthrow it, whereas if it's a Java exception, wrap and rethrow, but I am not familiar enough with the general intent. If you have a preference for one of these two, I can create a PR.

@adjam
Copy link
Author

adjam commented Feb 8, 2019

It looks like the XmlSAXPushParser test depends on the current behavior, viz. that unless NokogiriHandler.java populates addError, its tests will fail (you can push bad XML into it and no exception will be raised).

This is due to the thread juggling on which that implementation is based; the error is thrown by a method executing on a different thread than the one doing the parsing but IIUC it shares the NokogiriHandler with the PushParser, and this handler instance handles communication between the threads. If the handler's errors are not populated, the pushparser has no way of knowing an error has occurred.

A full fix will have to find a way to propagate the error properly in both cases; one possible way of proceeding is to create separate synchronized error handler that handles the job of communicating between the two threads and delegates the propagation to the Document class defined by the client.

@adjam
Copy link
Author

adjam commented Feb 9, 2019

Still working on a PR; I can make all the tests pass by keeping the try/catch behavior referenced above and calling addError in the catch block before rethrowing the RaiseException. This satisfies both the push and SAX parser tests. Not sure yet if this has weird side effects (what with threads being involved n'all).

@flavorjones
Copy link
Member

Closed by #1872

@flavorjones flavorjones added this to the v1.11.0 milestone Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants