url: fast path ascii domains, do not run ToASCII #13030

zimbabao · 2017-05-15T00:41:43Z

To match browser behavior fast path ascii only domains and
do not run ToASCII on them.

Fixes: #12965
Refs: #12966
Refs: whatwg/url#309

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

Affected core subsystem(s)

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

TimothyGu

LGTM other than the nits below:

/cc @domenic, @annevk

TimothyGu · 2017-05-15T02:53:32Z

src/node_url.cc

+static inline bool IsAllASCII(std::string* input) {
+  for (size_t n = 0; n < input->size(); n++) {
+    const char ch = (*input)[n];
+    if (ch & 0x80) {


Please declare a new CHAR_TEST in the dedicated section above (grep CHAR_TEST to find where it is):

// https://infra.spec.whatwg.org/#ascii-code-point CHAR_TEST(8, IsASCIICodePoint, (ch >= '\0' && ch <= '\x7f'))

And use IsASCIICodePoint(ch) here.

@TimothyGu : Done all changes suggested.

TimothyGu · 2017-05-15T02:54:05Z

src/node_url.cc

@@ -829,6 +829,16 @@ static url_host_type ParseOpaqueHost(url_host* host,
  return type;
 }

+static inline bool IsAllASCII(std::string* input) {


You can use a const std::string& here instead of std::string* too.

Perhaps we can use contains_non_ascii() from string_bytes.cc instead as an optimization?

@mscdex : That function is inside anon namespace and is static. Do you off hand know any place where we can refactor it in some utility?.

@mscdex : I checked the function, it optimizes for length >=16 . Most domain names will be smaller, so will it help?. We can try and benchmark it.

Well if nothing else it's good to reuse existing functionality where possible. That way if further optimizations are ever made to that function, everyone benefits automatically.

Actually, it may be even better to just fold the checking and lowercasing into the same function and loop (reading and lowercasing 4 bytes at a time as long as possible). That way you're only iterating over the string once.

@mscdex : I made changes to use existing contains_non_ascii.

Will check about folding checking and lowercasing into one.

@mscdex : folding both checking and lowercasing will result
either

Code duplication as ContainsNonAscii is used at one more place. or

Or have a function pointer as argument, its a problem with separation of concern. IIRC since function pointers are of statically know function it will get inlined.

Whats your opinion on this?.

I made changes to re-use ContainsNonAscii. Have a look at that when you get time.

TimothyGu · 2017-05-15T02:54:22Z

src/node_url.cc

+  // Match browser behavior for ASCII only domains
+  // and do not run them through ToASCII algorithm.
+  if (IsAllASCII(&decoded)) {
+    // Lowercase aschii domains


TimothyGu · 2017-05-15T02:54:39Z

src/node_url.cc

+      decoded[n] = std::tolower(decoded[n]);
+    }
+  } else {
+    // Then we have to punycode toASCII


s/punycode/Unicode IDNA/

TimothyGu · 2017-05-15T02:55:16Z

src/node_url.cc

+  if (IsAllASCII(&decoded)) {
+    // Lowercase aschii domains
+    for (size_t n = 0; n < decoded.size(); n++) {
+      decoded[n] = std::tolower(decoded[n]);


ASCIILowercase

mscdex · 2017-05-15T03:36:49Z

src/node_url.cc

+  // and do not run them through ToASCII algorithm.
+  if (IsAllASCII(&decoded)) {
+    // Lowercase aschii domains
+    for (size_t n = 0; n < decoded.size(); n++) {


For converting to lowercase, consider converting multiple bytes at a time instead of 1 by 1. Here is one good example of how to do that.

I'm not sure if this is that performance-sensitive, and even if it is C-level SIMD optimization still sounds like overkill to me.

Technically it's not "proper" SIMD (e.g. SSE) unless the compiler is somehow smart enough to automatically convert it to that. Anyway, I still think it would be worth benchmarking...

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

joyeecheung · 2017-05-15T05:03:22Z

CI: https://ci.nodejs.org/job/node-test-pull-request/8077/

joyeecheung · 2017-05-15T05:06:48Z

test/parallel/test-whatwg-url-domainto.js

@@ -35,6 +36,13 @@ const tests = require('../fixtures/url-idna.js');
 }

 {
+  for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) {
+    assert.strictEqual(ascii, domainToASCII(unicode),


If it's only for testing that those domains won't get converted, maybe just make the test cases an array of string and assert.strictEqual(domain, domainToASCII(domain))?

Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)

Made the change, reverting it. Thanks for catching.

joyeecheung · 2017-05-15T05:11:42Z

src/node_url.cc

+  if (IsAllASCII(decoded)) {
+    // Lowercase ASCII domains
+    for (size_t n = 0; n < decoded.size(); n++) {
+      decoded[n] = ASCIILowercase(decoded[n]);


Another optimization would be testing if the decoded domain contains only lowercase characters first. If it does, just use the original string and avoid assignments.

joyeecheung · 2017-05-15T05:13:11Z

test/parallel/test-whatwg-url-domainto.js

@@ -35,6 +36,13 @@ const tests = require('../fixtures/url-idna.js');
 }

 {
+  for (const [i, { ascii, unicode }] of testsHyphenDomains.valid.entries()) {
+    assert.strictEqual(ascii, domainToASCII(unicode),


Oh..never mind, it is supposed to check uppercase characters will be converted to lowercase ones...(is there any in the test cases?)

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

TimothyGu · 2017-05-15T06:16:05Z

src/node_url.cc

@@ -829,6 +834,16 @@ static url_host_type ParseOpaqueHost(url_host* host,
  return type;
 }

+static inline bool IsAllASCII(const std::string& input) {


Have you seen #13030 (comment)?

TimothyGu · 2017-05-15T06:18:05Z

src/node_url.cc

+  if (IsAllASCII(decoded)) {
+    // Lowercase ASCII domains
+    for (size_t n = 0; n < decoded.size(); n++) {
+      if (!IsLowerCaseASCII(decoded[n])) {


No need for this check; ASCIILowercase will do nothing if the code unit is already lower case or not a letter.

Won't this prevent idempotent writes to lowercase chars?. @joyeecheung that was the intent right?.

I was acutually talking about something like

bool needs_lowercase = false; for (size_t n = 0; n < decoded.size(); n++) { if (!IsLowerCaseASCII(decoded[n])) { needs_lowercase = true; break; } } if (needs_lowercase) { // convert decode } // otherwise no need to mutate decode, which should be the common case

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

TimothyGu · 2017-05-15T14:40:18Z

src/string_utils.cc

+      return false;
+  }
+
+  bool ContainsNonAscii(const char* src, size_t len) {


Should this be inlined?

Done, along with other changes.

TimothyGu · 2017-05-15T14:40:35Z

src/node_url.cc

+  if (!stringutils::ContainsNonAscii(buf, strlen(buf))) {
+    // Lowercase ASCII domains
+    for (size_t n = 0; n < decoded.size(); n++) {
+      if (!IsLowerCaseASCII(decoded[n])) {


Again, the IsLowerCaseASCII is not needed.

TimothyGu · 2017-05-15T14:44:11Z

src/string_utils.h

+
+#include "env.h"
+#include "env-inl.h"
+#include "util.h"


My guess is that some of these includes may not be needed.

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

addaleax · 2017-05-19T21:43:41Z

ping @TimothyGu, have your comments been addressed?

zimbabao · 2017-05-19T21:47:43Z

@addaleax : I have another PR some problem which is as per very latest changes in spec #12966 . I'll close current one. //cc @TimothyGu

url: fast path ascii domains, do not run ToASCII

8233c34

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. dont-land-on-v4.x whatwg-url Issues and PRs related to the WHATWG URL implementation. labels May 15, 2017

zimbabao mentioned this pull request May 15, 2017

url: ignore IDN errors when domainname have two hyphens #12966

Closed

4 tasks

TimothyGu approved these changes May 15, 2017

View reviewed changes

mscdex reviewed May 15, 2017

View reviewed changes

url: fast path ascii domains, do not run ToASCII

35e901d

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

joyeecheung reviewed May 15, 2017

View reviewed changes

zimbabao added 2 commits May 14, 2017 22:35

url: fast path ascii domains, do not run ToASCII

a13c377

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

url: fast path ascii domains, do not run ToASCII

d816988

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

TimothyGu suggested changes May 15, 2017

View reviewed changes

url: fast path ascii domains, do not run ToASCII

e86297c

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

TimothyGu reviewed May 15, 2017

View reviewed changes

zimbabao force-pushed the fastpath-ascii-domains-2 branch 2 times, most recently from 0cdbd66 to e0a9a33 Compare May 16, 2017 15:12

url: fast path ascii domains, do not run ToASCII

861604f

To match browser behavior fast path ascii only domains and do not run ToASCII on them. Fixes: nodejs#12965 Refs: nodejs#12966 Refs: whatwg/url#309

zimbabao force-pushed the fastpath-ascii-domains-2 branch from e0a9a33 to 861604f Compare May 16, 2017 15:27

mscdex added the performance Issues and PRs related to the performance of Node.js. label May 16, 2017

zimbabao closed this May 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

url: fast path ascii domains, do not run ToASCII #13030

url: fast path ascii domains, do not run ToASCII #13030

zimbabao commented May 15, 2017

TimothyGu left a comment

TimothyGu May 15, 2017

zimbabao May 15, 2017

TimothyGu May 15, 2017

mscdex May 15, 2017

zimbabao May 15, 2017

zimbabao May 15, 2017

mscdex May 15, 2017 •

edited

Loading

mscdex May 15, 2017 •

edited

Loading

zimbabao May 15, 2017

zimbabao May 17, 2017

TimothyGu May 15, 2017

TimothyGu May 15, 2017

TimothyGu May 15, 2017

mscdex May 15, 2017 •

edited

Loading

TimothyGu May 15, 2017

mscdex May 15, 2017 •

edited

Loading

joyeecheung commented May 15, 2017

joyeecheung May 15, 2017 •

edited

Loading

joyeecheung May 15, 2017

zimbabao May 15, 2017

joyeecheung May 15, 2017

zimbabao May 15, 2017

joyeecheung May 15, 2017

TimothyGu May 15, 2017

TimothyGu May 15, 2017

zimbabao May 15, 2017

joyeecheung May 15, 2017 •

edited

Loading

TimothyGu May 15, 2017

zimbabao May 16, 2017

TimothyGu May 15, 2017

TimothyGu May 15, 2017

addaleax commented May 19, 2017

zimbabao commented May 19, 2017

url: fast path ascii domains, do not run ToASCII #13030

url: fast path ascii domains, do not run ToASCII #13030

Conversation

zimbabao commented May 15, 2017

Checklist

Affected core subsystem(s)

TimothyGu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex May 15, 2017 • edited Loading

Choose a reason for hiding this comment

mscdex May 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex May 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex May 15, 2017 • edited Loading

Choose a reason for hiding this comment

joyeecheung commented May 15, 2017

joyeecheung May 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joyeecheung May 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

addaleax commented May 19, 2017

zimbabao commented May 19, 2017

mscdex May 15, 2017 •

edited

Loading

mscdex May 15, 2017 •

edited

Loading

mscdex May 15, 2017 •

edited

Loading

mscdex May 15, 2017 •

edited

Loading

joyeecheung May 15, 2017 •

edited

Loading

joyeecheung May 15, 2017 •

edited

Loading