`item` on Strings has questionable value #20

michaelficarra · 2020-09-08T22:46:39Z

I get that strings expose an array-like interface to their code units and that this proposal is supposed to have some sort of simple desugaring as an indexing operation that accounts for negative indices. But given how easy it is to [].item.call("string", codeUnitIndex), I think the more useful default would be code point based. In my opinion, any of

defaulting to code points
providing two named variants
not providing item on Strings at all

would all be preferable to "".item being basically identical to [].item.

The text was updated successfully, but these errors were encountered:

bakkot · 2020-09-08T23:09:04Z

not providing item on Strings at all

That's my preference, FWIW. Most of the reasons @tabatkins gives for not wanting strings to be iterable in this essay apply here as well.

ljharb · 2020-09-08T23:22:53Z

I find it quite important to provide this on strings, with the current semantics. Strings are treated in the spec as an indexed collection type just like all the others - iterable, arraylike, bracketable - and it would be a confusing inconsistency to have a non-string method that appears on all the other iterable arraylikes.

bakkot · 2020-09-08T23:37:36Z

Well, fortunately, we can and should omit it for arguments, which is the other iterable arraylike. And then this will be present only on arrays and typed arrays, just like .filter and friends.

ljharb · 2020-09-08T23:40:19Z

Arguments is a special case because it lacks a prototype.

bakkot · 2020-09-08T23:45:14Z

It is one of only four iterable arraylikes in the language. The claim that "it would be a confusing inconsistency to have a non-string method that appears on all the other iterable arraylikes" gets pretty weak if you are implicitly excluding one of them.

In any case, if we're excluding arguments from consideration, then your argument is "any method which is present on both arrays and typed arrays must also be present on strings", yes? But that's already not true of many methods, like .filter.

syg · 2020-09-08T23:46:44Z

I am fine with leaving item out of String.prototype, but that's predicated on that UTF16 code unit indexing on Strings is a legacy feature that we as a committee agree to not keep for future proposals. I don't have too strong feelings on this case.

I find Tab's essay persuasive and think there are good reasons to not have strings be collections. But as it stands, UTF16 code unit indexing via array-like operations is already a thing on Strings in more ways than just being iterable. slice() existing is a big one. There is a DX advantage to let people avoid typing str.slice(-42)[0] for the UTF16 code unit case.

Pragmatically I think I'd feel differently if our code units were, say, UTF8.

Overall, my takeaway is that if need be, I'd rather split out String#item to its own proposal than have get hung up on the code unit vs code point point.

ljharb · 2020-09-08T23:51:34Z

I think slice is definitely the thing motivating my thinking here.

I don’t think code points makes more sense than code units given the whole grapheme clusters debate.

tabatkins · 2020-09-09T15:51:15Z

I don't have a strong preference on whether this goes on Strings or not; that is not the primary purpose of this proposal, just a nice-to-have consistency.

That said, I do on occasion index strings, and in particular, checking the last character in a string isn't too uncommon - I have a couple of examples of it in my code. And in general I think it's a good thing for languages to have consistency across similar things, to reduce learning burden. ("[].item.call("string", codeUnitIndex) is easy" is not a compelling argument, I'm afraid.) So I'm weakly in favor of keeping .item() on Strings here; I'd prefer to only drop it if there's a reasonable blocker, not just "eh, doesn't seem that useful".

I don't think my opinions on why Strings shouldn't be iterable have any bearing here. First, "indexable" is different than "iterable" (tho parts of my essay certainly still transfer over); second, they're already indexable in JS, so that argument is already lost, and we should prefer consistency with other similar types.

I am strongly opposed to keeping it in but making it refer to code points; that again breaks the consistency with other types, where .item() with a non-negative value is equivalent to the [] operator. We already have a method that indexes codepoints (.codepointAt()), which can likely be trivially amended to take negative values and have these semantics as well. I'd be quite supportive of that!

bakkot · 2020-09-09T16:31:50Z

We already have a method that indexes codepoints

FWIW, .codePointAt doesn't actually index by code points. It indexes by code units (as with []), and then returns the code point beginning at the specified code unit. (And making it take negative values would be weird - either foo.codePointAt(foo.length - 1) would be different from foo.codePointAt(-1), or we would have the awkward situation that '🗾'.codePointAt(-1) would give you 0xDDFE rather than 0x1F5FE.)

I don't think having item index strings by code points would be a good idea, just saying, that's not a thing we currently have.

I'd prefer to only drop it if there's a reasonable blocker, not just "eh, doesn't seem that useful".

The argument isn't "that doesn't seem useful", it's "every time we expose another place where it's easy to accidentally end up dealing with UTF-16 code units rather than Unicode code points we are building a footgun, and we should avoid doing that without a good reason".

In particular, if we expect people to use this method for looking at the last character in a string, that's a very strong reason not to add it with code unit semantics, because that will break as soon as they start dealing with strings which include non-BMP characters - exactly the sort of bug which is hard to catch in testing and then ends up screwing over production users from non-European cultures (and also people who like emoji).

tabatkins · 2020-09-09T16:50:23Z

FWIW, .codePointAt doesn't actually index by code points.

Ooof, I actually wasn't aware of this. I mostly just use the method to get the codepoint of weird characters I copy-paste from a page, so I only ever use 0 and wouldn't have seen the difference.

In particular, if we expect people to use this method for looking at the last character in a string, that's a very strong reason not to add it with code unit semantics, because that will break as soon as they start dealing with strings which include non-BMP characters - exactly the sort of bug which is hard to catch in testing and then ends up screwing over production users from non-European cultures (and also people who like emoji).

Yeah, that's a reasonable concern.

Well, I do still object to .item() having semantics other than what [] does, so either it needs to stay as it is, or we drop. Again, I lean weakly towards keeping it, but don't have a strong opinion on it.

syg · 2020-09-09T16:51:26Z

The argument isn't "that doesn't seem useful", it's "every time we expose another place where it's easy to accidentally end up dealing with UTF-16 code units rather than Unicode code points we are building a footgun, and we should avoid doing that without a good reason".

Well, I guess the empirical question is: is UTF16 code unit indexing well entrenched to the point where it may not be accidental use? item() isn't going to the thing that encourages accidental use, the presence of indexing itself is.

kmiller68 · 2020-09-09T20:13:08Z

Wouldn't code point indexing have to be O(n) indexing for a non-utf32 string? That seems unexpected for a generic indexed accessor into a string, which is my understanding of what we want this to be. If we don't think utf16 code unit indexing is acceptable, I'd rather we don't do .item on strings at all. That said, I think the prevalence of [] is such a common idiom that I'd guess not doing utf-16 code units would lead to more than anything else...

I'd also guess that an npm package adding String.prototype.item will become very popular if we decide to exclude it. So, its not clear to me how much developers will be spared by excluding String.prototype.item.

bakkot · 2020-09-09T20:16:43Z

I'd also guess that an npm package adding String.prototype.item will become very popular if we decide to exclude it.

I'd take the other side of that bet.

codehag · 2020-09-18T08:52:42Z

We were discussing this internally. Our preference leans towards using code points. Removing or spliting item for string out is also a good direction to go, since this is not controversial for the other types. cc @evilpie and @jswalden

tabatkins · 2020-09-18T16:11:10Z

I still strongly favor leaving out String.item entirely, rather than giving it different semantics than []. One of the big useful things about .item() over [] is that it gives you negative indexing, and if String.item() instead gives you something that's sometimes different, that's a huge footgun.

[] indexing of Strings is a footgun to start with, so I understand the desire, but keeping the mental model simple ([] == .item()) is better than making one instance slightly better.

If codepoint indexing is desired (and the current hack of "spread into an Array, then index" isn't sufficient), we should introduce (in a separate proposal) a .codePoint() method. (Figuring out naming to avoid confusion with .codePointAt() is left as an exercise for the reader.)

jswalden · 2020-09-18T19:58:49Z

I brought up the "this should be code points" idea in Mozilla discussion, with the idea that at least it should be a deliberate decision as to whether it should be units or points. Time was short, so we didn't dive into it terribly deeply.

On slightly more thinking by me, tho -- independent of the discussion here, not that it matters -- the overlapping problems of return type (exposing UTF-16 guts more places is not good if you return code units), nature of the input index (if it returns code points), and algorithmic complexity (if it returns code points) seem to motivate to me not having item on strings. For the reasons in @tabatkins's link, or in https://hsivonen.fi/string-length/ separately (but along similar lines), and probably other places.

hax · 2020-09-22T07:44:24Z

In today's meeting, this proposal already achieve stage 3 with String.prototype.item() using code unit semantic, because no delegate (include me) in the meeting would like to be a bad guy to block the whole proposal. Such things happened many times, I feel this is not a healthy process, and the result may be just random.

Personally I really hope we could advance the proposal without string item() and use more time to discuss the possible solutions. Even the final result is still land item() with code unit semantic, it could have much solid rationale.

Jack-Works · 2020-09-22T09:36:52Z

Yeah, maybe split them into two proposals

syg · 2020-09-22T15:40:50Z

I'm not sure what new information could be surfaced here that would better inform the debate: it comes down to "indexing is already a thing + consistency" vs "UTF16 code unit indexing considered harmful".

There are telegraphed possible blockers on both sides, and so we chose one.

Edit: To be clear, there were strongish feelings for both including it and not including (the "not including" option includes splitting into two proposals).

hax · 2020-09-22T16:10:19Z

@syg I think we still could check other option like codepoint-safe version of item(). It could be consistent with [] in most cases, only inconsistent in surrogate case (to fix the isolated surrogate), and mitigate the harm of UTF16 code unit indexing.

bakkot · 2020-09-22T16:15:46Z

I don't think that there's more information about the proposal itself to be surfaced, but more time for discussion would allow delegates who did not have an opinion to form one and for those who didn't express an opinion to do so.

I really dislike that "telegraphing a blocker" effectively means you can get whatever outcome you like. If it turns out that only a couple people (yourself, Jordan, Waldemar) preferred adding it, and everyone else preferred not adding it, but the process ends up adding it anyway because Jordan is more willing to imply he'll block a popular proposal if it doesn't include a part he wants, I think that's a bad outcome. And I think that might be pretty much what happened. But we can't tell because most delegates didn't weigh in, because the timebox was tight.

syg · 2020-09-22T16:35:54Z

@hax No, I am strongly against any version of item() that doesn't have the exact semantics as using []. I'm for some method that does that, but I am very against that method being called item.

@bakkot Agreed on our process being bad for situations like this. I really dislike the possibility of lone objectors at all, but this isn't the right thread for that. I want viable escalation paths beyond "appeasing or laborious efforts for people who are willing to block". Without alternatives, those are the tools our consensus system gives us.

(FWIW in this particular case, Mark Miller also expressed strong feelings preferring inclusion.)

hax · 2020-09-22T16:44:24Z

I'm for some method that does that, but I am very against that method being called item.

@syg If we add such method in the future, for example named it as uItem, it will also introduce confusion for developers. Obviously developers should be encouraged to use a unicode codepoint safe version, which implies item() should be deprecated, so why add it in the first place?

syg · 2020-09-22T16:55:31Z

@hax uItem is a strawperson. String already has many methods with codePoint in the name, and this new method probably wouldn't have item in the name. I'd find that argument compelling if we were designing a new String, but the argument remains that UTF16 code unit indexing on Strings is already too entrenched, and not only will we not be moving the needle much by omitting item from String we'll also be hurting consistency with other indexables.

There's no future where item itself is deprecated without deprecating the rest of String indexing. IOW I don't see it ever being deprecated. You can, of course, continue to say "Don't index Strings at all", but for that you'd already have to enumerate all the ways Strings can be indexed, and the addition of item is unfortunate but relatively minor in that framing.

hax · 2020-09-22T17:12:54Z

To be clear I'm not proposing "don't index Strings" but suggest not adding codepoint-unsafe methods anymore. We can add new methods which codepoint-safe but still codeunit-indexed. With these new methods, string is still indexable, just won't return isolated surrogate.

syg · 2020-09-22T17:13:25Z

@hax I'd welcome that as a separate proposal!

ljharb · 2020-09-22T17:15:02Z

As was discussed during the meeting, "code-point-safe" is not, in fact, safe - it's just arguably safer than code units for many use cases.

The language doesn't have a truly safe (grapheme-cluster-based, i assume) mechanism for interacting with strings. Code points are not the way forward here.

hax · 2020-09-22T17:24:10Z

It's a slippery slope that because it's not truly safe, so let's still add unsafe version.

And it's definitely safer than codepoint-unsafe version. Unsafe version could give u a corrupted string, causing many problem. For example split it in surrogate pair, encode these two string to utf8 and concat it, it's a invalid utf8... Grapheme-cluster is much high-level and don't have such issue.

ExE-Boss · 2020-09-22T19:18:07Z

I’m personally in favour of splitting %String.prototype.item% into a future proposal.

…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732

…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732 UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013

…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732

mathiasbynens · 2020-10-01T17:52:16Z

I'd also guess that an npm package adding String.prototype.item will become very popular if we decide to exclude it. So, its not clear to me how much developers will be spared by excluding String.prototype.item.

This is opposite to how we usually deal with proposals for new built-ins: we wait for userland patterns to emerge, and then based on their prevalence, attempt to standardize them. We shouldn't standardize APIs just because we think it would be popular on npm.

I think String.prototype.item does not pay for itself — it has massive potential to cause confusion (“does it index by code unit like str[n] or by code point like [...str][n] or by grapheme cluster like Intl.Segmenter”?), and it would best be excluded from the language. At the very least, String.prototype.item should be pursued as a separate proposal (because adding the method for Array and TypedArray is uncontroversial).

Instead of "an .item() method on all the built-in indexable classes", this proposal would then be "an .item() method on all the built-in array classes".

syg · 2020-10-01T18:10:32Z

I think String.prototype.item does not pay for itself — it has massive potential to cause confusion

Code unit indexing might be undesirable but I don't think in this case it's confusing because .item() is clearly aligned with how [] behaves on the receiver.

zloirock · 2020-11-17T20:34:28Z

Related this thread: #33 (comment)

syg · 2020-11-20T17:29:26Z

We have discussed this in TC39 and reached consensus to include the method on String.prototype.

leobalter mentioned this issue Sep 8, 2020

Stage 3 spec reviewers #19

Closed

6 tasks

syg closed this as completed Nov 20, 2020

codehag mentioned this issue Dec 1, 2020

Relative indexing method on JS indexables (.at) mozilla/standards-positions#458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`item` on Strings has questionable value #20

`item` on Strings has questionable value #20

michaelficarra commented Sep 8, 2020

bakkot commented Sep 8, 2020

ljharb commented Sep 8, 2020

bakkot commented Sep 8, 2020

ljharb commented Sep 8, 2020

bakkot commented Sep 8, 2020

syg commented Sep 8, 2020 •

edited

Loading

ljharb commented Sep 8, 2020

tabatkins commented Sep 9, 2020

bakkot commented Sep 9, 2020

tabatkins commented Sep 9, 2020

syg commented Sep 9, 2020 •

edited

Loading

kmiller68 commented Sep 9, 2020

bakkot commented Sep 9, 2020

codehag commented Sep 18, 2020 •

edited

Loading

tabatkins commented Sep 18, 2020

jswalden commented Sep 18, 2020 •

edited

Loading

hax commented Sep 22, 2020

Jack-Works commented Sep 22, 2020

syg commented Sep 22, 2020 •

edited

Loading

hax commented Sep 22, 2020

bakkot commented Sep 22, 2020

syg commented Sep 22, 2020

hax commented Sep 22, 2020

syg commented Sep 22, 2020

hax commented Sep 22, 2020

syg commented Sep 22, 2020

ljharb commented Sep 22, 2020

hax commented Sep 22, 2020

ExE-Boss commented Sep 22, 2020

mathiasbynens commented Oct 1, 2020

syg commented Oct 1, 2020

zloirock commented Nov 17, 2020

syg commented Nov 20, 2020

item on Strings has questionable value #20

item on Strings has questionable value #20

Comments

michaelficarra commented Sep 8, 2020

bakkot commented Sep 8, 2020

ljharb commented Sep 8, 2020

bakkot commented Sep 8, 2020

ljharb commented Sep 8, 2020

bakkot commented Sep 8, 2020

syg commented Sep 8, 2020 • edited Loading

ljharb commented Sep 8, 2020

tabatkins commented Sep 9, 2020

bakkot commented Sep 9, 2020

tabatkins commented Sep 9, 2020

syg commented Sep 9, 2020 • edited Loading

kmiller68 commented Sep 9, 2020

bakkot commented Sep 9, 2020

codehag commented Sep 18, 2020 • edited Loading

tabatkins commented Sep 18, 2020

jswalden commented Sep 18, 2020 • edited Loading

hax commented Sep 22, 2020

Jack-Works commented Sep 22, 2020

syg commented Sep 22, 2020 • edited Loading

hax commented Sep 22, 2020

bakkot commented Sep 22, 2020

syg commented Sep 22, 2020

hax commented Sep 22, 2020

syg commented Sep 22, 2020

hax commented Sep 22, 2020

syg commented Sep 22, 2020

ljharb commented Sep 22, 2020

hax commented Sep 22, 2020

ExE-Boss commented Sep 22, 2020

mathiasbynens commented Oct 1, 2020

syg commented Oct 1, 2020

zloirock commented Nov 17, 2020

syg commented Nov 20, 2020

`item` on Strings has questionable value #20

`item` on Strings has questionable value #20

syg commented Sep 8, 2020 •

edited

Loading

syg commented Sep 9, 2020 •

edited

Loading

codehag commented Sep 18, 2020 •

edited

Loading

jswalden commented Sep 18, 2020 •

edited

Loading

syg commented Sep 22, 2020 •

edited

Loading