Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of #to_xml munged beyond certain file size using UTF-16 declaration #752

Closed
Phrogz opened this issue Aug 28, 2012 · 5 comments
Closed

Comments

@Phrogz
Copy link

Phrogz commented Aug 28, 2012

For more details see http://stackoverflow.com/q/12162548/405017

Given a file on disk with UTF-16LE encoding and the contents:

<?xml version="1.0" encoding="UTF-16" ?>
<Foo>
  <Bar><![CDATA[ (...3906 characters...) ]]></Bar>
  <Jim>Oh! Hello there.</Jim>
</Foo>

The output of reading in this file and calling to_xml is broken:

require 'nokogiri'
xml = File.open('Simplified.xml','rb:utf-16',&:read)
doc1 = Nokogiri.XML(xml,&:noblanks)
xml1 = doc1.to_xml.encode('utf-8')
p xml1
#=> "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<Foo>\n  <Bar><![CDATA[ ... ]]></Bar>\n  <Jim>Oh! Hello there.\uFFFE\u3C00\u0000\u2F00\u0000\u4A00\u0000\u6900\u0000\u6D00\u0000\u3E00\u0000\u0A00\u0000\u3C00\u0000\u2F00\u0000\u4600\u0000\u6F00\u0000\u6F00\u0000\u3E00\u0000\u0A00\u0000"
  • If I delete some of the text out of the <Bar> CDATA, the output is fixed.

  • I can query and serialize elements that are munged in the output just fine:

    puts doc1.at('Jim').to_xml.encode('utf-8')
    #=> <Jim>Oh! Hello there.</Jim>
    
  • If I remove the XML declaration from the input before parsing the document, the output is fixed:

    UTF_DEC = '<?xml version="1.0" encoding="UTF-16" ?>'.encode('UTF-16LE')
    doc2 = Nokogiri.XML(xml.sub(UTF_DEC,''),&:noblanks)
    puts doc2.to_xml.encode('utf-8')
    
    <?xml version="1.0"?>
    <Foo>
      <Bar><![CDATA[ Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ac augue arcu, eget laoreet lorem. Quisque ac augue velit. Integer consectetur suscipit vehicula. Etiam et convallis enim. Etiam varius massa sit amet lacus rhoncus varius in non ante. Sed dictum, metus eu bibendum ornare, ligula dui commodo urna, ut dignissim felis dolor eget nisl. Proin sit amet nisi nunc. Vestibulum a urna sed dui dignissim blandit nec vel enim. Vivamus tincidunt nulla id dui hendrerit hendrerit. Aliquam neque orci, luctus sit amet fringilla eu, varius vitae diam. Suspendisse varius rutrum lorem eget malesuada. Sed dapibus dapibus nisl, in cursus ante lacinia non. Aenean id sagittis ipsum. Suspendisse elit nunc, porta sit amet blandit ut, laoreet sed est. Nunc eget sem vitae nisl elementum ullamcorper ut sit amet urna. Sed ligula quam, fringilla in facilisis tincidunt, vehicula in nisi. Maecenas a augue in augue semper scelerisque sit amet ut arcu. Praesent hendrerit, enim in elementum ornare, lorem nisi euismod dolor, sit amet ornare mi sem sodales lacus. Fusce et tempor mauris. In non quam nisl, non consequat diam. Duis sit amet massa ultrices massa cursus iaculis. Nunc ullamcorper malesuada sem dignissim semper. Fusce aliquet lacus quis nisi tincidunt sodales. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque posuere commodo aliquet. Aliquam blandit vestibulum facilisis. Sed pellentesque viverra dignissim. Etiam est lacus, mollis eu pretium vitae, lacinia eleifend augue. Mauris vitae quam nisl. In venenatis nunc ac eros elementum cursus. Sed a metus sit amet nunc euismod condimentum id non orci. Curabitur velit turpis, lacinia non eleifend sed, rhoncus id est. Fusce ut massa dolor, ut sodales odio. Donec aliquam convallis tellus, eu pharetra tortor iaculis non. Integer imperdiet feugiat ipsum a gravida. Mauris sapien ipsum, ultricies ac placerat ut, imperdiet eu justo. Quisque quis consectetur velit. Etiam facilisis sapien nec enim tincidunt pulvinar. Duis fermentum faucibus felis, sed consequat libero pretium at. Phasellus nibh purus, suscipit in vestibulum vel, blandit at leo. Suspendisse placerat elit sed enim bibendum vel hendrerit mauris pretium. Maecenas ut lacus eu nisi euismod pretium. Aliquam feugiat felis id massa aliquam pharetra sed non eros. Morbi interdum molestie iaculis. Curabitur varius ante ac dui dapibus non laoreet risus blandit. Nunc sit amet magna lacus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Phasellus egestas nunc sed turpis imperdiet a rhoncus massa aliquam. Nulla facilisi. Phasellus sit amet neque felis, nec vestibulum massa. Donec luctus fringilla dolor et gravida. Phasellus euismod lectus eget elit hendrerit non vehicula tellus venenatis. Phasellus sit amet ligula et purus dignissim feugiat at vitae libero. Proin ut tortor eros, quis laoreet lectus. Quisque nec urna mattis ante gravida fermentum eu at nibh. Phasellus sapien elit, tincidunt quis laoreet id, lobortis sed magna. Aliquam pulvinar erat eu sapien pretium bibendum. Maecenas eleifend, leo quis sodales tincidunt, leo felis tristique dolor, vitae ultrices neque felis ut metus. Etiam dignissim egestas ipsum, eget tempor ipsum rutrum eu. Donec vehicula eleifend ullamcorper. Mauris justo nulla, varius a mattis a, cursus sit amet risus. Phasellus rutrum interdum blandit. Donec ut justo eros, ut auctor dolor. Suspendisse potenti. Cras ultricies, dui eget mattis bibendum, leo dui luctus purus, sit amet rhoncus libero metus eget purus. Pellentesque scelerisque ornare sapien faucibus tempor. Suspendisse potenti. Proin fermentum bibendum dapibus. Pellentesque facilisis aliquam. Nam egestas tellus non mauris scelerisque feugiat pellentesque lacus dignissim. Quisque id nulla felis. Mauris justo mauris, posuere sed facilisis in, venenatis nec risus. Mauris eu dui sed tellus laoreet tempor a in turpis volutpat. ]]></Bar>
      <Jim>Oh! Hello there.</Jim>
    </Foo>
    

Nokogiri 1.5.5 on Ruby 1.9.3p194 (2012-04-20) [i386-mingw32] on Windows 7

@Phrogz
Copy link
Author

Phrogz commented Sep 5, 2012

An alternative fix/workaround comes from the Stack Overflow question. Instead of:

xml1 = doc1.to_xml.encode('utf-8')

...use:

xml1 = doc1.to_xml(encoding:'utf-8')

This produces non-munged output.

@flavorjones
Copy link
Member

@Phrogz, thanks for opening this issue and apologies for the embarrassingly long time it's taken to respond.

This is likely a libxml2 parsing bug. It feels similar in nature to these:

and I'll try to fix and send a PR upstream ... might need a few days.

@flavorjones
Copy link
Member

Phew, this was a tricky one to figure out, but it turns out that Nokogiri wasn't using the proper encoding after libxml2 flushed its internal buffer for the first time. As long as a UTF-16 document was longer than ~4000 code points, this bug would be triggered.

See 2e260f5 for the fix, and #2434 for the PR.

@flavorjones flavorjones added this to the v1.14.0 milestone Jan 23, 2022
@flavorjones
Copy link
Member

Fixed by #2434, will be in the next minor release of Nokogiri (v1.14.0)

@flavorjones
Copy link
Member

Also see related #2447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants