Fix for url.parse() leaving trailing ":" on the protocol/scheme #1580

jordansissel · 2011-08-23T21:50:07Z

Before:
% node -e 'require("url").parse("http://www.google.com/").protocol'
http:

After:
% node -e 'require("url").parse("http://www.google.com/").protocol'
http

Before: % node -e 'require("url").parse("http://www.google.com/").protocol' http: After: % node -e 'require("url").parse("http://www.google.com/").protocol' http

jordansissel · 2011-08-23T22:16:50Z

Scratch that, I'll update all the tests, too.

isaacs · 2011-08-23T22:21:32Z

Why?

jordansissel · 2011-08-24T00:22:14Z

Bad day at the office? Or is today "close-without-reason Tuesday"? I'll resubmit tomorrow - hopefully with better luck. Perhaps there's good science to be had in such experiments!

In the mean time, you can mend your possibly bad day with some UCB skits, like this one: http://www.ucbcomedy.com/videos/play/6904/who-poisoned-whose-tea

isaacs · 2011-08-24T04:20:53Z

Sorry, my terseness was not coming from any sort of disrespect or grumpiness. What I mean is, why is this a good change?

There's been zero discussion of this idea. It is purely cosmetic but will break lots of node programs. What's the benefit?

I closed the issue because, in lieu of some very compelling reason, there's no way this is happening. There's a "reopen" button if such reason can be found, but otherwise, it may as well not take up space on the list.

So...

Why?

jordansissel · 2011-08-24T06:01:53Z

Well, I filed this because I believe I found a bug. I sent it as a pull because it was an easy patch. There's been zero discussion because nobody had filed this bug before. The comment section of github issues and pulls are a great arena for such discussion (and code review).

As for a compelling reason, here's a few:

python's urlparse module says "http"
ruby's uri lib says "http"
java's java.net.URL class says "http"
perl's URI module says "http"
url.js is nowhere anywhere near RFC1738, and that sucks (X)

Code samples for each of 4 languages mentioned above:

% python -c 'import urlparse; print urlparse.urlparse("http://www.google.com").scheme'
http
% ruby -ruri -e 'puts URI.parse("http://www.google.com/").scheme'
http
% jruby -rjava -e 'puts java.net.URL.new("http://www.google.com/").getProtocol'
http
% perl -mURI -le 'print URI->new("http://www.google.com")->scheme'
http

(X) url.js is pretty broken with respect to RFC1738 and real-world url stuff, but much of that is out of scope for this issue.

jordansissel · 2011-08-24T06:09:17Z

Specific snippets of grammar from RFC1738:

 ; The generic form of a URL is:
genericurl     = scheme ":" schemepart

; the scheme is in lower case; interpreters should use case-ignore
scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]

"scheme" above is what url.js calls "protocol" - note how the colon isn't part of the scheme. The scheme grammar includes no allowance for colon characters.

chjj · 2011-08-24T07:02:02Z

To be fair, I can think of one URL parser that does keep the trailing colon: the browser.

F12 + window.location.protocol

isaacs · 2011-08-24T07:26:26Z

Node's url.js borrows its naming conventions from the location object in the browser, extended (but not changed) in the following ways:

add query, since that's such a common use case, and we have a querystring parser as well, so it's really easy
add auth, and special handling for mailto:, file:, javascript: and a few others, since node sees these, but client-side JS never does
lastly, parser is designed to handle a "path-only" url, as is most commonly found on HTTP requests.

The browser is the reference implementation as far as url parsing and resolving is concerned. People who aren't familiar with every relevant RFC (ie, almost everyone) expect it to work the same.

slaskis · 2011-08-24T14:23:23Z

@jordansissel I found this a bit strange as well and it made me write https://github.com/publicclass/addressable which is basically Rubys extended URI gem "addressable" for js. It parses urls closer to the RFC with some extra features found in the ruby gem...

jordansissel · 2011-08-24T18:16:58Z

Even if you're ignoring everything else, and only using 'the browser' as your specification/reference, url.js still falls down pretty hard.

Further, I'm not sure what you're calling "the browser" (which browser? which version?). Example, google chrome 13 fails to recognize "svn+ssh://foo.com/" as a valid URL, but Firefox 4.0.1 does fine. Additionally, url.js doesn't properly parse data urls, while most modern browsers handle seem to handle this fine, so is url.js really based on a browser, or was it based on some fantasy of what some browser somewhere at some time might maybe have done?

Whatever happens, it would be nice if folks like @slaskis didn't have to look at the core / standard library of a given tool and go "that's mad broken, dog" and have to fix a standard, broken thing by making a new third-party thing.

(edit: google chrome might recognize svn+ssh://, but the behavior of unknown url schemes seems to be to search them instead of saying 'i don't know what this is')

chjj · 2011-08-24T18:34:53Z

If we're going to bring IETF spec's into this. Here is the regex that the URI RFC recommends for parsing URI's: http://tools.ietf.org/html/rfc3986#page-51 (I use a modified version of this myself, it's pretty fast and it works well.)

/^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/

Which for the following URL:

http://www.ics.uci.edu/pub/ietf/uri/#Related

Results in the following captures:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

They capture both the trailing colon and the protocol without it. What to do now? ;)

isaacs · 2011-08-24T18:43:20Z

It'd be nice if url.parse handled + in protocols. I don't think there's an open issue on this, feel free to post one. It's not strictly allowed for hyperlinks, but as you point out, urls are more than href attributes.

I'd also like to figure out some deterministic way to handle urls that separate the hostname from a cwd-relative path with :, like ssh://foo@bar.com:some-dir. It would make it easier to parse git remote urls in npm.

It would also be nice if the return value of url.parse was an instance of a class that had a toString method which returned the formatted href, since that's the only omission from what the browser implementations provide. But that's not terribly important. (Incidentally, since new is often faster in v8 than creating an object literal, it might be a slight speed improvement, but it'd be slight, and I doubt that any node program is spending enough time parsing urls to notice.)

Whatever happens, it would be nice if folks like @slaskis didn't have to look at the core / standard library of a given tool and go "that's mad broken, dog" and have to fix a standard, broken thing by making a new third-party thing.

It is not the intent of node to be a batteries-included platform. https://github.com/joyent/node/wiki/node-core-vs-userland

Url parsing is an extremely common task that most web servers and clients need to do. There is an existing quorum in the leading JavaScript implementations on how urls are to be parsed, so we're following it at least as closely as any browser follows any other.

google chrome might recognize svn+ssh://

It doesn't fetch it, and hyperlinks with svn+ssh hrefs aren't followed. If you configure chrome to open a different application for this protocol, then it'll open that application.

Further, I'm not sure what you're calling "the browser" (which browser? which version?).

window.location parsing is fairly consistent across browsers I've worked with. I originally tested url.js against Firefox 3 and 4, Chrome stable and dev (10-ish, I think?), Safari 4/Webkit nightly, and MSIE 6 and 7. Typically, Chrome and Firefox are considered authoritative in this case.

jordansissel · 2011-08-24T21:44:01Z

It is not the intent of node to be a batteries-included platform. https://github.com/joyent/node/wiki/node-core-vs-userland

Foot-stomp and 'you shall not pass' heard and noted. I withdraw my patch and bug report.

isaacs · 2011-08-25T01:12:27Z

Foot-stomp and 'you shall not pass' heard and noted.

Hahah. I think @ry is the gandalf character here. You know he once removed url parsing entirely?

slaskis · 2011-08-25T09:13:40Z

@chjj yep, that's the regex I use in addressable. Works a treat :)

jpluscplusm · 2011-08-28T23:41:02Z

@chjj and @isaacs: $1 only collects the scheme's trailing ":" because the grammar for "collecting-parens-but-who-cares" hadn't yet been implemented in POSIX regex :D

Read the next section of the RFC; the sentence starting "Therefore, we can determine the value of the five components as ..." is kind of important. "$1" is irrelevant. It is an ex-regex. It has perished. It does not matter.

How about the ascii-art section "Syntax Components", or section 3? Given that none of the art there notes the trailing ":" as being remotely important, how about dumping it? How about noting that the only purpose that that section imparts the ":" is "the character between the first part we care about and the second part we care about"? It's a string literal - a character that could, should and must be thrown away in order for code at a higher level to use the library code sensibly!

[Edited to remove alcoholic content]

mikeal · 2011-08-29T00:16:55Z

we need to match expectations, the expectation is that this behaves like the browser.

we live with the browser's mistakes, that's the web.

wjessop · 2011-08-29T00:34:24Z

A URL parsing lib including a trailing : in the scheme part looks like a bug to me, regardless of the mistakes "the browsers" have made in the past.

gilesbowkett · 2011-08-29T01:05:44Z

yeah, that shit's fucking insane. I don't give a damn about your ideology, any ideology which leads you to write something like that has its head up its ass.

gilesbowkett · 2011-08-29T01:06:19Z

Sorry, nothing personal. Immensely useful library. I use it and love it. But you have got to be fucking kidding me.

starwed · 2011-08-29T01:08:50Z

Heh, I read a blog post a few days ago about the differences between various URL parsing/slicing.

http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces

Quick reference image:
http://farm7.static.flickr.com/6203/6082913622_c953b1fc96_o.png

donpark · 2011-08-29T01:12:05Z

well, I'll take practical sense over logical sense, head up its ass or not.

xenomonkey · 2011-08-29T01:41:34Z

Um, guys. Why not just include protocol with the colon (eg "http:") and scheme without it ("http"). That way everyone wins.

jwatte · 2011-08-29T01:51:03Z

I'm a reasonably experienced developer, and my expectation was not that it would include the colon.
How about adding a "scheme" property, without the colon, to the parser result?
Also, staying compatible with all software in all versions is how you end up with technologies like Windows ME, or PHP. Successful, but painful. Is that the goal of Node?

isaacs · 2011-08-29T02:04:44Z

@xenomonkey Adding scheme without the colon is not out of the question. But why? (Seriously. Make a compelling case beyond "ruby and python users will stop telling node users that node is unusable.")

@gilesbowkett I'm not sure who you're insulting.

@jwatte How long did it take you to adapt your program to handle the ":" in the protocol property? A whole minute? Less?

I'm really not seeing a bug here.

wjessop · 2011-08-29T02:09:11Z

But why?

@isaacs: For me @jordansissel already covered it, but to summarise, convention, standards and some sense of correctness.

xenomonkey · 2011-08-29T02:19:51Z

@isaacs Upon reflection I think my suggestion is crap (results in bloated code for, as you say, no gain). You're quite right the only reason for doing it would be to make node more similar to other languages and slightly more compliant with the RFC. Either scheme (without the colon) or protocol (with the colon) should be supported but not both. Personally I don't care either way (I'm quite capable of adding/stripping a character from a string if I need to).

I would suspect that since node is based on javascript that you should stick with the most expected solution. For javascript programmers than would be to use protocol and leave the colon on the end.

ELLIOTTCABLE · 2011-08-29T02:26:14Z

Yeah, what the hell was @gilesbowkett on about?

Anyway, I don’t know why everybody’s flipping out about this. @isaacs’ point is perfectly appropriate: Node is JavaScript; Node is intended to parallel the browser; Node is intended to keep everything in userspace code. There is no real argument for the change, except that “some other languages do it this way.” What browsers do ≥ what other languages do, because Node is intended to ape the browser, not those other languages.

isaacs · 2011-09-02T17:59:28Z

@paulbjensen wins 9000 internets!

mikeal · 2011-09-02T18:11:12Z

I HAVE AN OPINION!

tj · 2011-09-02T18:14:35Z

MOAR COLONS

chjj · 2011-09-02T18:17:19Z

What are colons? Are they webscale? 9000 internets is not enough for webscale.

tamzinblake · 2011-09-02T19:33:12Z

I: think: this: issue: has: not: received: enough: attention::

Too: many: developers: have: been: failing: to: end: every: string: with: a: colon::

BroDotJS · 2011-09-02T20:15:41Z

Losers always whine about RFCs and back compat. Bros go home with the prom queen:

require.colonsblow = function (moduleName) {
    var mod = require(moduleName);
    var k, fn;

    function colonic(x) {
        return (typeof x === 'string' && x.slice(-1) === ':') ? x.slice(0, -1) : x;
    }

    function wrap(fn) {
        return function () {
            var result = fn.apply(this, arguments);
            var k;

            if (result instanceof Array) {
                result = result.map(function (v, i, a) {
                    return colonic(v);
                });
            } else if (result && typeof result === 'object') {
                for (k in result) {
                    if (result.hasOwnProperty(k)) {
                        result[k] = colonic(result[k]);
                    }
                }
            }

            return colonic(result);
        };
    }

    for (k in mod) {
        if (mod.hasOwnProperty(k) && typeof mod[k] === 'function') {
            mod[k] = wrap(mod[k]);
        }
    }

    return mod;
};

var ballinURLz = require.colonsblow('url');

console.log(ballinURLz.parse('http://your.mom.com/so/fat'));

Doneski. This bro is headed to Twin Peaks for shots and steaks. Who's in? No nerds.

Marak · 2011-09-02T20:26:52Z

Isaacs was the prom queen.

TooTallNate · 2011-09-02T20:30:45Z

<ref>The Rock</ref>

chjj · 2011-09-02T20:31:49Z

The Rock

Ah, beat me to it.

tj · 2011-09-02T20:31:53Z

Marak · 2011-09-02T20:36:20Z

mmalecki · 2011-09-02T21:16:48Z

sbussard · 2011-09-02T22:29:25Z

+1 for getting rid of useless chars

BonsaiDen · 2011-09-02T22:50:39Z

Compatibility with non-standard APIs is stupid. 99% of the users will perform a replace() on the thing anyways, just get rid of it now.

mikeal · 2011-09-02T22:57:33Z

@BonsaiDen @sbussard this thread is now closed to serious comments, only jokes are allowed now

tanepiper · 2011-09-02T23:14:17Z

At this rate, we might as well just get rid of semicolons too

sbussard · 2011-09-02T23:19:16Z

what other things end with a colon? ... :(){ :|:& };:

donpark · 2011-09-03T01:06:33Z

How do I unparticipate from this 4chanish thread?

Marak · 2011-09-03T01:08:02Z

Qard · 2011-09-03T01:12:59Z

Below the comment box. Click "Disable notifications for this Pull Request"

donpark · 2011-09-03T01:15:50Z

Doh. Thx @Qard & @Marak

ghost · 2012-09-02T16:01:25Z

What the hell did I just read?

johan · 2012-12-08T18:27:59Z

+1 keeping protocol as-is
+1 adding colon-less scheme property

The first is useful for sharing code and APIs between front and back end code.
The second is useful because it's a reasonable expectation for those of us coming from the RFCs.
We can have both, just as we can have both .hostname and .host (with port), which are also useful.

/ web front-and-back-end developer since 15 years

vicary · 2012-12-08T20:11:06Z

What's the point of digging arguments last year? Esp. when it comes to punchuations, it's strictly personal and your opinion is not likely to work.

isaacs · 2012-12-09T17:40:53Z

@github Can you please please give us the ability to close comments on issues after some period of time? This is a textbook example of a thing that is well beyond the point where any good can possibly come from additional conversation.

Background behind the colon is nodejs/node-v0.x-archive#1580 (be sure to read it all the way)

billy-grates · 2015-07-18T04:06:22Z

@github -1 on the ability to close comments after some period of time. As a firm believer in free speech, we should not let people stifle constructive conversation.

I've been a fan of using github as a source of high quality entertainment for over 5 years, you would be doing your key demographic a disservice in implementing said feature.

Also I feel like @gilesbowkett comments were on point.

revmischa · 2015-07-18T07:07:14Z

thanks guys for going so deep into this colon issue. the library really needed some cleansing. felt like it was getting full of waste.
those colons can be a real pain in the you-know-what! too bad that this thread had to go down that dark hole.

Fix url.parse() leaving trailing ':' on protocol

7fd62d9

Before: % node -e 'require("url").parse("http://www.google.com/").protocol' http: After: % node -e 'require("url").parse("http://www.google.com/").protocol' http

isaacs closed this Aug 23, 2011

rubys pushed a commit to webspecs/url that referenced this pull request Nov 27, 2014

Remove a note per @zcorpan and add an exciting logo!

09d621f

Background behind the colon is nodejs/node-v0.x-archive#1580 (be sure to read it all the way)

kenany mentioned this pull request Mar 22, 2015

lib: remove : from protocol in Url.parse(). nodejs/node#1237

Closed

bascht mentioned this pull request Apr 27, 2016

Supply full URL instead of param bits. fhemberger/good-logstash#1

Merged

Fix for url.parse() leaving trailing ":" on the protocol/scheme #1580

Fix for url.parse() leaving trailing ":" on the protocol/scheme #1580

Conversation

jordansissel commented Aug 23, 2011

jordansissel commented Aug 23, 2011

isaacs commented Aug 23, 2011

jordansissel commented Aug 24, 2011

isaacs commented Aug 24, 2011

jordansissel commented Aug 24, 2011

jordansissel commented Aug 24, 2011

chjj commented Aug 24, 2011

isaacs commented Aug 24, 2011

slaskis commented Aug 24, 2011

jordansissel commented Aug 24, 2011

chjj commented Aug 24, 2011

isaacs commented Aug 24, 2011

jordansissel commented Aug 24, 2011

isaacs commented Aug 25, 2011

slaskis commented Aug 25, 2011

jpluscplusm commented Aug 28, 2011

mikeal commented Aug 29, 2011

wjessop commented Aug 29, 2011

gilesbowkett commented Aug 29, 2011

gilesbowkett commented Aug 29, 2011

starwed commented Aug 29, 2011

donpark commented Aug 29, 2011

xenomonkey commented Aug 29, 2011

jwatte commented Aug 29, 2011

isaacs commented Aug 29, 2011

wjessop commented Aug 29, 2011

xenomonkey commented Aug 29, 2011

ELLIOTTCABLE commented Aug 29, 2011

isaacs commented Sep 2, 2011

mikeal commented Sep 2, 2011

tj commented Sep 2, 2011

chjj commented Sep 2, 2011

tamzinblake commented Sep 2, 2011

BroDotJS commented Sep 2, 2011

Marak commented Sep 2, 2011

TooTallNate commented Sep 2, 2011

chjj commented Sep 2, 2011

tj commented Sep 2, 2011

Marak commented Sep 2, 2011

mmalecki commented Sep 2, 2011

sbussard commented Sep 2, 2011

BonsaiDen commented Sep 2, 2011

mikeal commented Sep 2, 2011

tanepiper commented Sep 2, 2011

sbussard commented Sep 2, 2011

donpark commented Sep 3, 2011

Marak commented Sep 3, 2011

Qard commented Sep 3, 2011

donpark commented Sep 3, 2011

ghost commented Sep 2, 2012

johan commented Dec 8, 2012

vicary commented Dec 8, 2012

isaacs commented Dec 9, 2012

billy-grates commented Jul 18, 2015

revmischa commented Jul 18, 2015