Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: Improve url module performance #3 #2303

Closed
wants to merge 5 commits into from

Conversation

STRML
Copy link
Contributor

@STRML STRML commented Aug 5, 2015

This is a rebase of #1650, which is a stripped version of #1561 that does not break any compat.

As @petkaantonov commented he's short on time, so I went ahead and rebased the PR so it can be merged.

Thanks @petkaantonov for your hard work, hoping this can make it into iojs soon.

@STRML STRML changed the title Faster url parser 3 url: Improve url module performance #3 Aug 5, 2015
@Fishrock123
Copy link
Contributor

@STRML do you think you could re-add the whatwg tests ala https://github.com/nodejs/io.js/pull/1561/files#diff-5632d0587131eb38265a594bcc1bdf6dR2921?

@Fishrock123 Fishrock123 added the url Issues and PRs related to the legacy built-in url module. label Aug 5, 2015
@STRML
Copy link
Contributor Author

STRML commented Aug 5, 2015

@Fishrock123 That's doable, however the overrides are failing - I assume this is because the new PR has decided not to break any backcompat?

@Fishrock123
Copy link
Contributor

@STRML Could be, maybe see if the whatwg tests pass or not without the url improvements?

@STRML
Copy link
Contributor Author

STRML commented Aug 5, 2015

It appears that even without the overrides, a few are failing. Subtle things, like:

AssertionError: 'http://2001::1/' == 'http://2001:1/:'

or

AssertionError: 'http://[1::2]:3:4/' == 'http://[1::2]/:3:4'

Without these changes (on master), most of the overrides fail as expected. Yet, two of the "nodejs" entries fail (which actually happen to match the override):

AssertionError: 'file:/example.com/' == 'file://example.com/'
AssertionError: 'file:///example.com/' == 'file://example.com/'

So it appears the behavior has changed subtly since #1561 was written.

@Fishrock123
Copy link
Contributor

@STRML right. The important thing is that it works like it did before applying the patches though, so that's the point we need to test from. :)

@STRML
Copy link
Contributor Author

STRML commented Aug 5, 2015

Yes. There are some differences. Should I enumerate them?

@Fishrock123
Copy link
Contributor

@STRML Yes please. They may (probably) indicate differences we may need to fix.

@STRML
Copy link
Contributor Author

STRML commented Aug 5, 2015

Behavior changes from master are below. I've marked the ones that match an override from the original PR (indicating that they are intended / better behavior) with a ✓,

{base} + {input}
{error}: {actual (#2303)} == {expected (master)}
----

"http://example.org/foo/bar" + "http://[1::2]:3:4",
✓ AssertionError: 'http://[1::2]:3:4/' == 'http://:4/[1::2]:3' 

"http://example.org/foo/bar" + "http://2001::1",
AssertionError: 'http://2001::1/' == 'http://2001:1/:'

"http://example.org/foo/bar" + "http://2001::1]",
AssertionError: 'http://2001::1/]' == 'http://2001/::1]'

"http://example.org/foo/bar" + "http://2001::1]:80",
✓ AssertionError: 'http://2001::1/]:80' == 'http://2001:80/::1]' 

"file:///tmp/mock/path" + "  File:c|////foo\\bar.html"
✓ AssertionError: 'file:///tmp/mock/c%7C////foo/bar.html' == 'file://c/%7C////foo/bar.html' 

"file:///tmp/mock/path" + "//"
✓ AssertionError: 'file:////' == 'file://' 

@silverwind
Copy link
Contributor

I think the best option here is to fix the remaining issues so all WHATWG tests pass (valueing spec conformity over speed) and then land this on the next (or is it next+1?) branch for 4.0.

@STRML STRML force-pushed the faster-url-parser-3 branch from bbe2157 to 6ba195d Compare August 6, 2015 15:45
@STRML
Copy link
Contributor Author

STRML commented Aug 6, 2015

Rebased on newest master.

@STRML
Copy link
Contributor Author

STRML commented Aug 6, 2015

Okay. If we're going for better compat with the WHATWG tests, I've gone through them. I'm just going to refer to them in the order they appear in the comment above:

output WHATWG Ref issue
http://[1::2]:3:4/ link ✓ Matches the refimpl.
http://2001::1/ link Matches Chrome only. The refimpl doesn't have a trailing slash. Regardless it is better behavior than the current nodejs output of 'http://2001:1/:'.
http://2001::1/] link Output is wrong. For reference, the refimpl is http://2001::1], and Chrome is http://2001::1]/. The issue is that our trailing slash is inside the bracket. Note that Node today does 'http://2001/::1]' which is arguably more wrong.
http://2001::1/]:80 link Has the same issue as 3 with the slash.
file:///tmp/mock/c%7C////foo/bar.html link Matches Chrome only. No browser gets the refimpl correct which is file:///c:////foo/bar.html.
file://// link Wrong. All browsers output file:///. Node master outputs file://. This PR outputs file:////.

@domenic
Copy link
Contributor

domenic commented Aug 6, 2015

I don't think it's generally a good idea to be consulting Chrome without consulting a quorum of other browsers. Chrome has no special status here.

@STRML
Copy link
Contributor Author

STRML commented Aug 6, 2015

I'm just going from the published test results. I'll edit the comment and link to them.

@domenic
Copy link
Contributor

domenic commented Aug 6, 2015

Right. So, for example, for 2, the spec matches 2 browsers, whereas the behavior in this patch matches 1 (Chrome). That's the kind of clarity I think is important.

@STRML
Copy link
Contributor Author

STRML commented Sep 2, 2015

I've updated the table above to be more descriptive. Hopefully someone can take it from here.

In summary, this parser is faster and more accurate than Node master but not perfect. In some tests, no browsers pass.

@silverwind silverwind added the semver-major PRs that contain breaking changes and should be released in the next major version. label Sep 5, 2015
@silverwind
Copy link
Contributor

Some breakage will be unavoidable here, so I'll label this as major.

@STRML
Copy link
Contributor Author

STRML commented Sep 8, 2015

What needs to be done to get this in for v5?

@silverwind
Copy link
Contributor

@STRML to recap, are the above 5 examples the only failing WHATWG tests?

@STRML
Copy link
Contributor Author

STRML commented Sep 8, 2015

The 6 listed are the ones that do not match the previous behavior. The url module had several failing WHATWG tests before this patch.

If I can find a better dump of the tests and their expected values I'll update.

@silverwind
Copy link
Contributor

I don't think matching previous behaviour should be a concern if we aim for a semver-major change. I'd sign this off if it passes the full up-to-date WHATWG test suite, and I think @domenic would agree.

@STRML
Copy link
Contributor Author

STRML commented Sep 8, 2015

That may take some work but I agree that this should be the goal. It appears, however, that there are no major implementations, anywhere, that pass all the tests.

For reference, this appears to be the most up-to-date repository for the tests: https://github.com/w3c/web-platform-tests/tree/master/url

Unfortunately the test data is in a non-standard format and needs to be parsed. Best this becomes a separate url module test.

@domenic
Copy link
Contributor

domenic commented Sep 8, 2015

https://github.com/jsdom/whatwg-url should pass

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Nov 16, 2015
@STRML STRML force-pushed the faster-url-parser-3 branch from b6b57e8 to 018659b Compare February 6, 2016 23:17
@STRML
Copy link
Contributor Author

STRML commented Feb 6, 2016

I've rebased this PR. There's one failing test I'm working on:

Path: parallel/test-url
actual:
 {
  "protocol": "http:",
  "slashes": true,
  "auth": null,
  "host": "x.y.com",
  "port": null,
  "hostname": "x.y.com",
  "hash": "#f%20g%3Ch%3Ei",
  "search": "?d=e",
  "query": "d=e",
  "pathname": "/b/c",
  "path": "/b/c?d=e",
  "href": "http://x.y.com/b/c?d=e#f%20g%3Ch%3Ei",
  "_prependSlash": false
}
expected:
 {
  "href": "http://x.y.com/;a/b/c?d=e#f%20g%3Ch%3Ei",
  "protocol": "http:",
  "slashes": true,
  "host": "x.y.com",
  "hostname": "x.y.com",
  "pathname": ";a/b/c",
  "search": "?d=e",
  "query": "d=e",
  "hash": "#f%20g%3Ch%3Ei",
  "path": ";a/b/c?d=e",
  "auth": null,
  "port": null,
  "_prependSlash": false
}

Benches look good. I've run them for comparison with mscdex's PR (#4892) which is more conservative but easier to merge.

With petka's PR
---------

url/url-parse.js
url/url-parse.js type=one n=250000: 242334.90973
url/url-parse.js type=two n=250000: 510975.71127
url/url-parse.js type=three n=250000: 334966.59023
url/url-parse.js type=four n=250000: 1371166.72654
url/url-parse.js type=five n=250000: 1084797.65178
url/url-parse.js type=six n=250000: 369123.32686

url/url-resolve.js
url/url-resolve.js type=one n=100000: 88020.45870

With mscdex's PR
----------

url/url-parse.js
url/url-parse.js type=one n=250000: 159077.70473
url/url-parse.js type=two n=250000: 188850.73800
url/url-parse.js type=three n=250000: 153934.26544
url/url-parse.js type=four n=250000: 444040.32985
url/url-parse.js type=five n=250000: 466854.53702
url/url-parse.js type=six n=250000: 161929.07658

url/url-resolve.js
url/url-resolve.js type=one n=100000: 66938.81149

Baseline (master)
----------

url/url-parse.js
url/url-parse.js type=one n=250000: 94871.96053
url/url-parse.js type=two n=250000: 111151.28242
url/url-parse.js type=three n=250000: 99248.53141
url/url-parse.js type=four n=250000: 444074.46607
url/url-parse.js type=five n=250000: 224690.52837
url/url-parse.js type=six n=250000: 96083.83463

url/url-resolve.js
url/url-resolve.js type=one n=100000: 57167.49271

@STRML
Copy link
Contributor Author

STRML commented Feb 6, 2016

@fansworld-claudio I agree that as part of a semver-major change that bug should be fixed, but I'd rather not put it in this particular PR so that we have a better hope of merging it.

@STRML
Copy link
Contributor Author

STRML commented Feb 7, 2016

@rvagg Re: the reversion of these changes, is there still interest in getting it merged?

@STRML
Copy link
Contributor Author

STRML commented Feb 7, 2016

I'm working through the WHATWG tests. We're failing a lot of them. We're better on about 10 of them with this PR than we were before, what we're missing some major functionality (like backtracking on ..).

I believe this is mergeable as a major but after it does, we should concentrate on attempting to achieve as near to 100% WHATWG compatibility as possible.

I've marked most of the failures (starting to get tired, there are very many) with links to the spec and expected results.

All of our tests are passing in this PR.

@silverwind
Copy link
Contributor

@STRML Thanks! By the way, we recently got a conflicting PR #4892 that also claims to improve performance. Maybe you and @mscdex can work something out about this situation?

@ChALkeR ChALkeR added the performance Issues and PRs related to the performance of Node.js. label Feb 16, 2016
@benjamingr
Copy link
Member

Status @STRML?

@STRML
Copy link
Contributor Author

STRML commented Mar 23, 2016

Looks like since #4892 / #5300 landed we have quite a bit of rebase work to do if we want to get this in.

@benjamingr
Copy link
Member

Still interested?

@silverwind
Copy link
Contributor

Is it worth it? I think at this point, it'd be a pretty big rewrite. Are there still perf gains to be had?

@STRML
Copy link
Contributor Author

STRML commented Mar 23, 2016

Still significant performance benefits, and the new parser is more compliant with the WHATWG tests than the current ones. We have a chance with this overhaul to become much more standards, and to be nearly an order of magnitude faster as well.

Unfortunately I have a lot of work on my plate right now so it may be some time before I'm able to rebase this.

@silverwind
Copy link
Contributor

Alright, let's leave this open.

@jasnell
Copy link
Member

jasnell commented Feb 28, 2017

Given the existence of the new WHATWG URL parser implmentation and the lack of forward progress on this, I'm closing. Can reopen if necessary.

@jasnell jasnell closed this Feb 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues and PRs related to the performance of Node.js. semver-major PRs that contain breaking changes and should be released in the next major version. stalled Issues and PRs that are stalled. url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants