-
-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for iso-8859-1, utf8 "sentinel" and numeric entities #268
Conversation
…so-8859-1, eg. ☺ => ☺. This is what browsers send when the user tries to submit characters that aren't available in the charset being used to submit the form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add a utf8Sentinel
option to stringify
that can produce the utf8
param.
lib/parse.js
Outdated
} | ||
if (has.call(obj, key)) { | ||
obj[key] = [].concat(obj[key]).concat(val); | ||
if (key === 'utf8' && options.utf8Sentinel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the utf-8
key would only matter with the utf8Sentinel
option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the feature is a bit opinionated, so I figured that we'd treat utf8=...
as a regular parameter when it's not turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the user be able to customize the name of this parameter or is this following some specific standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an actual standard, but utf8=✓
seemed to be a de facto standard back when I made this originally. Back when the internet was still in the process of switching to utf-8 and iso-8859-1 was more prevalent :)
- https://softwareengineering.stackexchange.com/questions/168751/is-the-use-of-utf8-preferable-to-utf8-true
- Ruby on rails added it to all forms at some point, I don't know if they still do
- I saw it in Google Search urls, that was how I became aware of the trick. They don't use it anymore, though, not even if I make a search in IE6 on Windows XP.
I don't think it'd be relevant to let the user customize the parameter name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Can you add this to the readme somewhere? For example how will folks know how to use it, especially if it's not an actual standard they may have learned about elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to make true
mean "utf8"
and a check mark, but make an object able to override either or both? (we'd still want it in the readme)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get back to this soon and add docs etc. Still not sure I see the value of customizing the parameter name etc. Feels like a "when in doubt, leave it out" scenario until someone actually requests it.
I just noticed that github is also using utf8=✓
. If you go to "Files changed" in this PR, then change the diff settings, you'll end up at a url like this:
https://github.com/ljharb/qs/pull/268/files?utf8=%E2%9C%93&diff=unified
... so RoR probably still does it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that making it maximally configurable right out of the gate, but also pleasant for the common defaults, is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a bit of docs in 452008d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure I get why maximum configurability for the utf8Sentinel
parameter would be a good thing when we don't know that anyone will ever need it. That makes it easy to get the configuration parameters wrong, but at that point there'd be no going back. Also it'd add complexity because we'd need to dynamically compute the misinterpreted-as-iso-8859-1 versions of it.
lib/parse.js
Outdated
if (has.call(obj, key)) { | ||
obj[key] = [].concat(obj[key]).concat(val); | ||
if (key === 'utf8' && options.utf8Sentinel) { | ||
if (val === '✓' || val === '\xe2\x9c\x93') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is '\xe2\x9c\x93'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a string representation of the 3 octets in the utf-8 encoding of the ✓
character.
decodeURIComponent('%e2%9c%93')
'✓'
Checking for it here makes sure that we will switch into utf-8 mode when the parameter was parsed as iso-8859-1 because of defaultCharset
. It's what makes this test pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a standard for using this character? Should it be customizable, especially if there lacks a standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
lib/parse.js
Outdated
if (key === 'utf8' && options.utf8Sentinel) { | ||
if (val === '✓' || val === '\xe2\x9c\x93') { | ||
charset = 'utf-8'; | ||
} else if (val === '✓') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the "numeric encoding" of ✓
.
This is what browsers will submit in an application/x-www-form-urlencoded
body when the encoding of the page containing the form is iso-8859-1
, or when the submitted form has an accept-charset
attribute of iso-8859-1
. Presumably also with other charsets that does no contain the ✓
character, such as us-ascii
.
Thus we can use it as an indicator that the rest of the query string or application/x-www-form-urlencoded
is iso-8859-1
.
This behavior is covered by these two tests:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store these kinds of values in named variables, so that they're self-documenting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, 7aeb51e
lib/utils.js
Outdated
var decode = function (str, decoder, charset) { | ||
var strWithoutPlus = str.replace(/\+/g, ' '); | ||
if (charset === 'iso-8859-1') { | ||
return unescape(strWithoutPlus); // Cannot throw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you enumerate how unescape
differs from decodeURIComponent
with this charset? when would the latter throw and the former do the correct thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unescape
will decode the format %uXXXX in addition to the standard %XX format. This seems like an undeserved behavior, and at best just weird that the %uXXXX format will work if you use latain 1 but not if you use utf8. It should probably accept %uXXXX in both or neither.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unescape
is a legacy global function that converts string => string, interpreting percent-encoded octets to be in iso-8859-1. Percent not followed by two hex chars are left as-is, and I have never succeeded in getting it to throw for any string value, so that's why it doesn't seem necessary to put it in a try...catch.
decodeURIComponent
is similar, but interprets the percent-encoded octets as utf-8. It'll throw a URIError
if percent is not followed by two hex chars, or if sequences of percent-encoded octets don't form a valid utf-8 character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percent not followed by two hex chars are left as-is
This is not correct, see my comment above yours about %uXXXX . Can you make it not decode that format and only touch the %XX ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I had forgotten about that. Fixed in bd9bc06.
test/parse.js
Outdated
}); | ||
|
||
t.test('prefers an utf-8 charset specified by the utf8 sentinel to a default charset of iso-8859-1', function (st) { | ||
st.deepEqual(qs.parse('utf8=%E2%9C%93&%C3%B8=%C3%B8', { utf8Sentinel: true, charset: 'iso-8859-1' }), { ø: 'ø' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be great if we could use named variables to explain what these values are, and why they're significant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, attempted in papandreou@afb4a16
You're right, I didn't think of |
Do you think adding |
I never realized that Github used this, that is neat. I was just playing around and the only difference it seems from this implementation and Github is that this requires that utf8 be the first parameter but Github doesn't seem to have that restriction. |
But I didn't try very much yet besides some url tweaks in Chrome. |
Means that it doesn't have to be the first parameter to affect the parsing of the entire query string. ljharb#268 (comment)
Means that it doesn't have to be the first parameter to affect the parsing of the entire query string. ljharb#268 (comment)
@dougwilson, you're right, I also thought about that the other day. It's something that wasn't necessary for my original implementation as I had complete control of the Fixed in 603c49a. |
I implemented support for |
lib/parse.js
Outdated
} else if (parts[i] === isoSentinel) { | ||
charset = 'iso-8859-1'; | ||
} | ||
parts.splice(i, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, can splice be avoided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we can store a skipIndex
in a variable and use it to ignore the sentinel argument in the second loop? Does that sound better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splice mutates, is O(n), and is the singularly most complex API method in JavaScript, so if it's not an imposition, yes that would be great :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4e65517.
lib/parse.js
Outdated
// the form is iso-8859-1, or when the submitted form has an accept-charset | ||
// attribute of iso-8859-1. Presumably also with other charsets that do not contain | ||
// the ✓ character, such as us-ascii. | ||
var isoSentinel = 'utf8=' + encodeURIComponent('✓'); // %26%2310003%3B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is fine, but i'd also be ok with the literal hardcoded value and the commented part being the code that would generate it at runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 228bf8e.
lib/parse.js
Outdated
if (options.utf8Sentinel) { | ||
for (i = 0; i < parts.length; ++i) { | ||
if (parts[i].indexOf('utf8=') === 0) { | ||
if (parts[i] === utf8Sentinel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic seems like it could be if parts[i] === utf8Sentinel, else if parts[i] === isoSentinel, else if parts[i].indexOf('utf8=') === 0
as well (i might be prematurely optimizing, or maybe it would be cleaner, i'm not sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we'd need to introduce a foundSentinel
variable that gets set in each of the three cases so that we can break out of the loop? Or put the skipIndex=i; i=parts.length;
bit in all three cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... We could also define that a utf8=...
parameter we don't explicitly recognize as a checkmark in any character set should just be parsed as a regular parameter? I'm in doubt myself.
var parseValues = function parseQueryStringValues(str, options) { | ||
var obj = {}; | ||
var cleanStr = options.ignoreQueryPrefix ? str.replace(/^\?/, '') : str; | ||
var limit = options.parameterLimit === Infinity ? undefined : options.parameterLimit; | ||
var parts = cleanStr.split(options.delimiter, limit); | ||
var charset = options.charset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we throw an exception here if this value is anything but undefined, iso-8859-1, or utf-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Fixed in 8deaab7.
lib/parse.js
Outdated
for (i = 0; i < parts.length; ++i) { | ||
if (parts[i].indexOf('utf8=') === 0) { | ||
if (parts[i] === utf8Sentinel) { | ||
charset = 'utf-8'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the user passes in iso-8859-1 and this logic sets it to utf-8? should that be allowed, or should this logic be skipped when charset is passed, or should it throw when they're different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly the intention -- that the character set inferred from the sentinel can should take precedence. When you specify both a charset
and utf8Sentinel:true
, the charset
acts as more of a default charset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. when an explicit charset is provided, it seems to me that the user is saying "i don't care what you think it is, treat it like it's this" - if it's a default charset, i'd expect it to be called defaultCharset
when used along with the utf8 sentinel option. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, I see the point, even though adding another option also contributes to the complexity.
Then qs.parse(..., { utf8Sentinel: true, charset: 'iso-8859-1' })
would throw because that combination of options don't make sense?
Another concern is the symmetry with qs.stringify
. Here qs.stringify(..., { utf8Sentinel: true, charset: 'iso-8859-1'})
does make sense. It'd be a bit weird to force defaultCharset
for parse
, but not stringify
.
On balance I think it'd be best to keep it as charset
in all cases, but very explicitly point out how the charset
+utf8Sentinel
combo is interpreted in the utf8Sentinel
documentation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. consistency between parse and stringify is important, for sure.
I think maybe parse should indeed throw on { utf8Sentinel: true, charset: 'anything that's not utf-8' }
, although that makes sense for stringify (where it should work) - and then we can keep the option called "charset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would prohibit my original use case for these additions: A service compatible with the ancient FormMail.pl
script (originally written in the nineties). It has to assume iso-8859-1
because of all the ancient HTML snippets on iso-8859-1 pages pointed at it (with no sentinel).
However, more “modern” pages can explicitly switch to utf-8 mode by including the sentinel.
This requires the utf8Sentinel: true, charset: 'iso-8859-1'
combo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh that makes sense.
alrighty, thanks for bearing with me on the discussion. let's keep what you've got.
I've addressed or replied to all the feedback, looking forward to further discussions :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to rebase this down a bit and take another pass locally, but I think your work is done :-) thanks for bearing with me on the back-and-forth!
@papandreou it would be nice to pull in the changes here - want to do that in this PR, or in a separate one? |
Both done now. Looking forward to see this merged so I can move on to body-parser ;) @ljharb @dougwilson, thanks for the thorough reviews! |
Body-parser only accepts options that work in both qs and querystring equally. |
@dougwilson, hmm, that's bad news for completing the work laid out in expressjs/body-parser#194 :( |
I'm not sure what the implementation would look like but I'm not sure it's impossible to do at all. What I think you implemented here was a two pass solution, right? First look through the parameters for sentinal and second decode. I believe you should be able to do those things already with querystring, yes? It supports a custom decoding function. |
Basically unless I'm missing something here, the current implementation of the parse changes in this PR could be implemented by a user without ever have modifying the qs module at all. |
@dougwilson, damn, I think you're right. That must have been added after I did the original work. I should probably rethink the whole thing then. |
Yea, I'm only thinking of the current implementation, of course. I think once it transformed into a two pass is when it changed. Now, don't get me wrong, it's still useful being part of qs directly if it lands as folks can use it directly (especially for query strings -- body parser only operates on the request entity not the url). It could also just pass to the native impl for qs and use a wrapper for querystring. I believe body parser is doing that currently for at least one of the options (wrapping one of the impls that is different). |
@dougwilson, seems like that worked, I've opened a PR here: expressjs/body-parser#326 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I should have this merged and released by this weekend.
README.md
Outdated
var sentinel = qs.stringify({ a: '☺' }, { utf8Sentinel: true }); | ||
assert.equal(sentinel, 'utf8=%E2%9C%93&a=%E2%98%BA'); | ||
|
||
var isoSentinel = qs.stringify({ a: 'æ' }, { utf8Sentinel: true, charset: 'iso-8859-1' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this seems a little strange - why would you want the sentinel if you were encoding in “not utf-8”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry about my earlier comment, I thought this was in qs.parse
:)
It might be a slim use case, but including a sentinel that explicitly says that the query string is in iso-8859-1
could be useful in some scenarios. Obviously it's much more useful in the inverse case, because all of this has to do with legacy services that just assume it's iso-8859-1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but the option is called utf8Sentinel
- why would that appear when it's not utf-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because the parameter is called utf8
. I agree that the name of the option isn’t ideal, but couldn’t come up with anything else. I’ve tried googling it to see if there’s an established name for the concept, but nothing turned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
charsetSentinel? encodingSentinel?
When you say "could be useful", is this a use case you've ever seen anywhere? Or are we just planning for something that might happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like charsetSentinel
better, let's go with that instead, fixed in 2efe555.
Well, the test suite for that FormMail.pl
-compatible service could have used the ability to encode a query string and an application/x-www-form-urlencoded
body with an iso-8859-1 sentinel, as that's one of the cases that explicitly had to be supported.
In other words, it's useful whenever you want to emulate what a browser sends when a form containing utf8=✓
is submitted. So mostly in test suites for web servers that cannot assume that everything is utf-8.
lib/utils.js
Outdated
@@ -124,6 +130,12 @@ var encode = function encode(str) { | |||
|
|||
var string = typeof str === 'string' ? str : String(str); | |||
|
|||
if (charset === 'iso-8859-1') { | |||
return escape(string).replace(/%u[0-9a-f]{4}/gi, function ($0) { | |||
return '%26%23' + parseInt($0.substr(2), 16) + '%3B'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use slice; substr is nonstandard or deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout, fixed in 6ff0e39
2efe555
to
0226bec
Compare
Means that it doesn't have to be the first parameter to affect the parsing of the entire query string. ljharb#268 (comment)
0226bec
to
4a8d202
Compare
Means that it doesn't have to be the first parameter to affect the parsing of the entire query string. ljharb#268 (comment)
4a8d202
to
7bcf2dd
Compare
…326) * urlencoded, in extended mode: Support iso-8859-1 encoded requests, and also accept iso-8859-1 as a default encoding. * urlencoded: Support an utf8 sentinel to detect the charset. * Pass the interpretNumericEntities option through to qs. * Fix lint * Support charsets, sentinels etc. via custom decoders Works in both extended and simple mode. * Simplify * Fix empty parameter issue with utf8Sentinel in simple mode * Run all the charset/sentinel tests in both extended and simple modes * utf8Sentinel => charsetSentinel ljharb/qs#268 (comment) * Update qs to 6.9.1 * Always use the qs module, even in simple mode #326 (comment) * Create the simple and extended parser with the same function, reducing duplication * Don't mention the querystring module in the README * Fix lint * Update qs to 6.9.4 * Consistently call it "utf8 sentinel" * Simplify by relying on the qs module's support for detecting the charset #326 (comment) * Simplify further * Put back debug option * Document that defaultCharset defaults to utf-8 #326 (comment)
Context: expressjs/body-parser#194
I've rebased the branch now, and the tests pass.
Looking at the documentation now I learn that support for custom encoders and decoders has landed in the mean time. I'm not quite sure if I should try to change this PR to use that, or if it's OK to have
iso-8859-1
as a "first class" charset.A point in favor of having built-in support is that all the heavy lifting is done by the arcane
unescape
function so it doesn't require iconv-lite or iconv.I don't think it would be possible to implement the charset switching via the utf8 sentinel within that, as it requires state keeping between the parsing of the individual params.