Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge input lookup when parsing (SAX) #2028

Closed
juskoljo opened this issue Apr 27, 2020 · 14 comments
Closed

Huge input lookup when parsing (SAX) #2028

juskoljo opened this issue Apr 27, 2020 · 14 comments

Comments

@juskoljo
Copy link

juskoljo commented Apr 27, 2020

Hi

Apologies if this has been submitted already. I have problem parsing a big XML file that has ~20mb base64 encoded file attached. It seems that when parsing from IO and content has "\r\n" as line separators causes "Huge input lookup" error.

Reproduced "successfully" this with 1.10.9 and older versions

Thanks,
Jussi

Example:

#! /usr/bin/env ruby

# encoding: utf-8

require 'nokogiri'
require 'stringio'

template = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <something>value</something>
  <more>value</more>
  <boom>%s</boom>
</root>
XML

class Handler < Nokogiri::XML::SAX::Document
  def error(message)
    raise message
  end
end

parser = Nokogiri::XML::SAX::Parser.new(Handler.new)

# OK
xml = template % "#{'X' * 77}\n" * 300_000
parser.parse(xml)

# OK
xml = template % "#{'X' * 77}\r\n" * 300_000
parser.parse(xml)

# OK
xml = StringIO.new(template %  "#{'X' * 77}\n" * 400_000)
parser.parse(xml)

# internal error: Huge input lookup (RuntimeError)
xml = StringIO.new(template % "#{'X' * 77}\r\n" * 300_000)
parser.parse(xml)

Environment

# Nokogiri (1.10.9)
    ---
    warnings: []
    nokogiri: 1.10.9
    ruby:
      version: 2.6.5
      platform: x86_64-darwin18
      description: ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/jussi/.rvm/gems/ruby-2.6.5/gems/nokogiri-1.10.9/ports/x86_64-apple-darwin18.7.0/libxml2/2.9.10"
      libxslt_path: "/Users/jussi/.rvm/gems/ruby-2.6.5/gems/nokogiri-1.10.9/ports/x86_64-apple-darwin18.7.0/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10
@flavorjones
Copy link
Member

@juskoljo Thanks for reporting, and sorry you're having a problem. I'll try to take a look later today.

@juskoljo
Copy link
Author

Hi, any updates on this? 👍

@flavorjones
Copy link
Member

@juskoljo Hi! Apologies for the slow response, it's been hard for me to find time to work on OSS recently.

I'm having difficulty reproducing what you're seeing, here's what I get using the script you provided:

Traceback (most recent call last):
	4: from /home/flavorjones/baz.rb:27:in `<main>'
	3: from /home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.9/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
	2: from /home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.9/lib/nokogiri/xml/sax/parser.rb:110:in `parse_memory'
	1: from /home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.9/lib/nokogiri/xml/sax/parser.rb:110:in `parse_with'
/home/flavorjones/baz.rb:19:in `error': XML declaration allowed only at the start of the document (RuntimeError)

This is because the generated XML has multiple XML declarations.

I took a few minutes to rewrite the script to avoid multiple decls (as well as multiple roots) and still can't reproduce. Here's what I did:

#! /usr/bin/env ruby

# encoding: utf-8

require "nokogiri"
require "stringio"

start_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
XML

end_xml = <<-XML
</root>
XML

template = <<-XML
<node>
  <something>value</something>
  <more>value</more>
  <boom>%s</boom>
</node>
XML

class Handler < Nokogiri::XML::SAX::Document
  def error(message)
    raise message
  end
end

parser = Nokogiri::XML::SAX::Parser.new(Handler.new)

# OK
xml = start_xml + (template % "#{"X" * 77}\n" * 300_000) + end_xml
parser.parse(xml)
puts "ok"

# OK
xml = start_xml + (template % "#{"X" * 77}\r\n" * 300_000) + end_xml
parser.parse(xml)
puts "ok"

# OK
xml = StringIO.new(start_xml + (template % "#{"X" * 77}\n" * 400_000) + end_xml)
parser.parse(xml)
puts "ok"

# internal error: Huge input lookup (RuntimeError)
xml = StringIO.new(start_xml + (template % "#{"X" * 77}\r\n" * 300_000) + end_xml)
parser.parse(xml)
puts "ok"

which prints out four "ok"s.

Can you help me reproduce this? Or help me discover what I'm doing differently from you (or what's different about my system)?

My nokogiri is:

~ $ nokogiri -v
# Nokogiri (1.10.9)
    ---
    warnings: []
    nokogiri: 1.10.9
    ruby:
      version: 2.7.0
      platform: x86_64-linux
      description: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
      libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10

@juskoljo
Copy link
Author

juskoljo commented May 31, 2020

Hi @flavorjones

No worries! :). You are right, the first script generated something that was not my intention... Note to myself: Never edit a script online while posting ;). Following script should reproduce the issue.

The script is simulating a scenario when there is a big XML node (base64 encoded file) with CR and/or LF after every 77 chars.

# encoding: utf-8

require 'nokogiri'
require 'stringio'

template = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <something>value</something>
  <more>value</more>
  <boom>%s</boom>
</root>
XML

class Handler < Nokogiri::XML::SAX::Document
  def error(message)
    raise message
  end
end

parser = Nokogiri::XML::SAX::Parser.new(Handler.new)

# OK
xml = template % ("#{'X' * 77}\n" * 300_000)
parser.parse(xml)

# OK
xml = template % ("#{'X' * 77}\r\n" * 300_000)
parser.parse(xml)

# OK
xml = StringIO.new(template % ("#{'X' * 77}\n" * 400_000))
parser.parse(xml)

# internal error: Huge input lookup (RuntimeError)
xml = StringIO.new(template % ("#{'X' * 77}\r\n" * 300_000))
parser.parse(xml)

Output:

xml.rb:16:in `error': internal error: Huge input lookup (RuntimeError)
	from /Users/x/.rvm/gems/ruby-2.3.8/gems/nokogiri-1.10.9/lib/nokogiri/xml/sax/parser.rb:93:in `parse_with'
	from /Users/x/.rvm/gems/ruby-2.3.8/gems/nokogiri-1.10.9/lib/nokogiri/xml/sax/parser.rb:93:in `parse_io'
	from /Users/x/.rvm/gems/ruby-2.3.8/gems/nokogiri-1.10.9/lib/nokogiri/xml/sax/parser.rb:81:in `parse'
	from xml.rb:36:in `<main>'

My nokogiri seem to be the same as yours except:

   ruby:
      version: 2.3.8
      platform: x86_64-darwin18
      description: ruby 2.3.8p459 (2018-10-18 revision 65136) [x86_64-darwin18]

@juskoljo
Copy link
Author

Hi Mike, any updates on this? The snippet in my previous post should fire the exception 👍

@flavorjones
Copy link
Member

@juskoljo Thanks for your patience, and apologies for for not replying sooner - your reply on May 31 fell through the cracks of my inbox (and I'm still struggling to spend time on OSS).

I'll take a look today, I have some time blocked out.

@flavorjones
Copy link
Member

OK, I've explored this a bit and found something interesting. This script:

#! /usr/bin/env ruby
# encoding: utf-8

require 'nokogiri'
require 'stringio'

TEMPLATE = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <something>value</something>
  <more>value</more>
  <boom>%s</boom>
</root>
XML

class Handler < Nokogiri::XML::SAX::Document
  def error(message)
    raise message
  end
end

def try_parse(number)
  xml = StringIO.new(TEMPLATE % ("#{'X' * number}\n" * 128_000))
  parser = Nokogiri::XML::SAX::Parser.new(Handler.new)
  begin
    parser.parse(xml)
  rescue Exception => e
    puts "#{number} raises #{e}"
    return
  end
  puts "#{number} is OK"
end

try_parse 77 # ok
try_parse 78 # error
try_parse 79 # ok

emits:

77 is OK
78 raises internal error: Huge input lookup
79 is OK

🤔

@flavorjones
Copy link
Member

flavorjones commented Oct 1, 2020

Here's the call stack when the error is raised:

#5  0x00007ffff331c0b7 in xmlGROW (ctxt=0x555555ae67d0) at parser.c:2106
#6  0x00007ffff3323a9a in xmlParseCharDataComplex (ctxt=0x555555ae67d0, cdata=0) at parser.c:4578
#7  0x00007ffff33236da in xmlParseCharData__internal_alias (ctxt=0x555555ae67d0, cdata=0) at parser.c:4518
#8  0x00007ffff3332d8a in xmlParseContent__internal_alias (ctxt=0x555555ae67d0) at parser.c:9885
#9  0x00007ffff3332ed4 in xmlParseElement__internal_alias (ctxt=0x555555ae67d0) at parser.c:9918
#10 0x00007ffff33356fb in xmlParseDocument__internal_alias (ctxt=0x555555ae67d0) at parser.c:10754
#11 0x00007ffff34b9dbd in parse_doc (ctxt_val=ctxt_val@entry=93824998074320) at ../../../../ext/nokogiri/xml_sax_parser_context.c:85
#12 0x00007ffff7ad3324 in rb_ensure (b_proc=b_proc@entry=0x7ffff34b9db0 <parse_doc>, data1=data1@entry=93824998074320, e_proc=e_proc@entry=0x7ffff34b9d80 <parse_doc_finalize>, 
    data2=data2@entry=93824998074320) at eval.c:1128
#13 0x00007ffff34b9e73 in parse_with (self=<optimized out>, sax_handler=93824999974800) at ../../../../ext/nokogiri/xml_sax_parser_context.c:126
#14 0x00007ffff7c84eaa in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x555555bd7cc0, calling=<optimized out>, reg_cfp=0x7ffff7017ef8, ec=0x555555758600)
    at vm_insnhelper.c:2514
#15 vm_call_cfunc (ec=0x555555758600, reg_cfp=0x7ffff7017ef8, calling=<optimized out>, cd=0x555555bd7cc0) at vm_insnhelper.c:2539
#16 0x00007ffff7c904b6 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>)
    at vm_insnhelper.c:4023
#17 vm_exec_core (ec=0x4d2, initial=140737488329608, initial@entry=0) at insns.def:801
#18 0x00007ffff7c96f3f in rb_vm_exec (ec=0x555555758600, mjit_enable_p=mjit_enable_p@entry=1) at vm.c:1929
#19 0x00007ffff7ca2310 in rb_iseq_eval_main (iseq=iseq@entry=0x5555557811a8) at vm.c:2179
#20 0x00007ffff7acf68a in rb_ec_exec_node (ec=ec@entry=0x555555758600, n=n@entry=0x5555557811a8) at eval.c:277
#21 0x00007ffff7ad6039 in ruby_run_node (n=0x5555557811a8) at eval.c:335
#22 0x000055555555496b in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:50

@flavorjones
Copy link
Member

I've narrowed this down to what I think is a libxml2 edge case in parsing elements in xmlParseCharDataComplex. Will spend some more time on it this weekend.

@flavorjones
Copy link
Member

OK, I've found the problem and I think it's a bug in libxml2. I'll write a brief description here, but will submit a bug report upstream and will think about patching Nokogiri's vendored library in the meantime.

In brief: two things are happening simultaneously within libxml2/parser.c:

  1. the text node is big enough to exceed XML_MAX_TEXT_LENGTH (which defaults to 10,000,000 bytes)
  2. this bug triggers an edge case where the optimized path in xmlParseCharData passes control to xmlParseCharDataComplex which is limited by XML_MAX_TEXT_LENGTH

If only one or the other of these happens, nobody notices:

  1. if the text node is smaller than XML_MAX_TEXT_LENGTH then there's simply a performance penalty for triggering (2), but functionally it works fine (e.g., in your script if we repeat less than 128,000 times the text node comes in under this size)
  2. if the bug isn't hit, then the optimized path is followed (e.g., in your script if n is 77 or 79 then the bug isn't triggered and so the optimized path is followed and the memory limit isn't triggered)

The bug is related to SAX parsing: by default libxml2 will read the doc in chunks of size xmlIO.c's MINLEN (which is 4,000 bytes). When, after a read, the first byte is 0x0A (aka \n), then the xmlParseCharData function gives up and calls xmlParseCharDataComplex.

You can actually see this in action by making sure the first line of the node's text is:

node = ("x" * 3894) + "\n" # to always trigger the bug

where the importance of 3894 is 4000-106, where 106 is the number of bytes occurring in the document before the node's text begins. Any boom node in this document that is longer than 10,000,000 characters will fail if its 3895th character is a newline. CRAZYTOWN.

Here's the patch that fixes this problem (note that it looks like the same bug exists in xmlParseComment):

diff --git a/parser.c b/parser.c
index f779eb6..ed78d94 100644
--- a/parser.c
+++ b/parser.c
@@ -4506,7 +4506,7 @@ get_more:
             if (ctxt->instate == XML_PARSER_EOF)
 		return;
 	    in = ctxt->input->cur;
-	} while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09));
+	} while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0xD) || (*in == 0xA) || (*in == 0x9));
 	nbchar = 0;
     }
     ctxt->input->line = line;
@@ -4987,7 +4987,7 @@ get_more:
 	    ctxt->input->col++;
 	    goto get_more;
 	}
-    } while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09));
+    } while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0xD) || (*in == 0xA) || (*in == 0x9));
     xmlParseCommentComplex(ctxt, buf, len, size);
     ctxt->instate = state;
     return;

@flavorjones
Copy link
Member

I've opened an issue upstream (https://gitlab.gnome.org/GNOME/libxml2/-/issues/192) and have proposed a change (https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/86).

@flavorjones
Copy link
Member

Also, apparently we're fixing two bugs with this one! Good work. https://gitlab.gnome.org/nwellnhof/libxml2/-/commit/99bda1e1ee77783e43c9059af00cd326deee3372

@flavorjones
Copy link
Member

@juskoljo I'm going to close this issue, since upstream has acknowledged the problem and seems likely to fix. Please open a new issue if you feel this is urgent enough to warrant applying a patch within Nokogiri's vendored libxml2.

@juskoljo
Copy link
Author

Hi @flavorjones Super! Thank you very much for the update and patch! While waiting update for libxml2 I think I'll manage with the patch you provided! Thanks again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants