Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding of special characters is not consistent across ruby versions #41

Open
kashook opened this issue Jun 18, 2014 · 3 comments
Open

Comments

@kashook
Copy link

kashook commented Jun 18, 2014

Consider the following ruby script:

require 'gyoku'
puts "Ruby Version: #{RUBY_VERSION}"
puts Gyoku.xml(:special_chars => %q{&"<>'})

When executed on Ruby 1.9.3, it produces this output:

Ruby Version: 1.9.3
<specialChars>&amp;&quot;&lt;&gt;'</specialChars>

When executed on Ruby 2.0.0, it produces this output:

Ruby Version: 2.0.0
<specialChars>&amp;&quot;&lt;&gt;&#39;</specialChars>

The apostrophe character is not encoded at all on Ruby 1.9.3, but on ruby 2.0 it is converted to &#39;. The reason appears to be because of this line. The behavior of the CGI.escapeHTML function seems to have changed from Ruby 1.9.3 to Ruby 2.0. Here is an example.

This script:

require 'cgi'
string = %q{<>&"'}
puts "Ruby Version: #{RUBY_VERSION}"
puts "Original: #{string}"
puts "Encoded: #{CGI.escapeHTML(string)}"

produces this output on Ruby 1.9.3:

Ruby Version: 1.9.3
Original: <>&"'
Encoded: &lt;&gt;&amp;&quot;'

but this output on Ruby 2.0:

Ruby Version: 2.1.1
Original: <>&"'
Encoded: &lt;&gt;&amp;&quot;&#39;

It seems like the output should not change depending on the ruby version. In addition, if an apostrophe is going to be encoded, it seems like it would be more correct to encode it as &apos; instead of &#39;. In fact, encoding it as &#39; is directly causing me a problem here. :)

@tjarratt
Copy link
Contributor

Ouch, that's a painful bug. Thank you so much for investigating this issue so thoroughly and explaining it so succinctly, @keiths-osc !

I'd rather not drop support for ruby 1.9.3 yet (it's not close to End of Life yet), so is there a better workaround than switching on RUBY_VERSION in the code here? I haven't looked too exhaustively, but I couldn't find anything outside of CGI.escape_html in the standard lib.

@kashook
Copy link
Author

kashook commented Jun 19, 2014

@tjarrat, I've been giving your question (and this issue in general) some more thought. There is something else I noticed. When producing xml with attributes, I noticed that the special characters are handled consistently across ruby versions:

This script

require 'gyoku'
puts "Ruby Version: #{RUBY_VERSION}"
puts Gyoku.xml(:special_chars => %q{&"<>'})
puts Gyoku.xml({ :test => 'something', :attributes! => { :test => { :chars => %q{&"<>'} } } })

outputs this on 1.9.3

Ruby Version: 1.9.3
<specialChars>&amp;&quot;&lt;&gt;'</specialChars>
<test chars="&amp;&quot;&lt;&gt;'">something</test>

and this on 2.0

Ruby Version: 2.0.0
<specialChars>&amp;&quot;&lt;&gt;&#39;</specialChars>
<test chars="&amp;&quot;&lt;&gt;'">something</test>

In the chars attribute, the apostrophe is just left as an apostrophe. It appears that it is actually the Builder gem that is handling the encoding for the attribute values instead of any code in gyoku. The encoding appears to be done by a call to this method, which then ends up calling this method. The XChar thing that the latter is using is located here. I believe that the encoding of element values within Builder itself are handled by the same escape method. When builder encodes element values, here is how they look:

require 'builder'
puts "Ruby Version: #{RUBY_VERSION}"
xmlbuilder = Builder::XmlMarkup.new
puts xmlbuilder.test(%q{"<>&'})

I get this on 1.9.3:

Ruby Version: 1.9.3
<test>"&lt;&gt;&amp;'</test>

and this on 2.0:

Ruby Version: 2.0.0
<test>"&lt;&gt;&amp;'</test>

I wonder if the best bet would be for gyoku to encode the special characters for element values in the exact same way builder does (perhaps by directly using the XChar thing from Builder) instead of using CGI.escapeHTML?

The results are consistent across Ruby versions, and my gut feeling is that the Builder way of handling the encoding for XML is maybe more correct than what CGI.escapeHTML does. Of course, Builder doesn't encode the double quote character (where gyoku currently does), but technically it doesn't need to be if I understand XML correctly.

@tjarratt
Copy link
Contributor

Sorry I've been so quiet lately --- I'm going to need some time to think about this and explore the ramifications of that change. I think we'd need some stronger integration tests further upstream in Savon to verify that this change doesn't break existing use cases.

With ruby 1.9.3 reaching EOL in a few months I wonder how that affects this issue. We could keep the behavior from 1.9.3, but anyone relying on the new behavior would be very surprised (to say the least).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants