Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

item on Strings has questionable value #20

Closed
michaelficarra opened this issue Sep 8, 2020 · 33 comments
Closed

item on Strings has questionable value #20

michaelficarra opened this issue Sep 8, 2020 · 33 comments

Comments

@michaelficarra
Copy link
Member

I get that strings expose an array-like interface to their code units and that this proposal is supposed to have some sort of simple desugaring as an indexing operation that accounts for negative indices. But given how easy it is to [].item.call("string", codeUnitIndex), I think the more useful default would be code point based. In my opinion, any of

  • defaulting to code points
  • providing two named variants
  • not providing item on Strings at all

would all be preferable to "".item being basically identical to [].item.

@bakkot
Copy link

bakkot commented Sep 8, 2020

not providing item on Strings at all

That's my preference, FWIW. Most of the reasons @tabatkins gives for not wanting strings to be iterable in this essay apply here as well.

@leobalter leobalter mentioned this issue Sep 8, 2020
6 tasks
@ljharb
Copy link
Member

ljharb commented Sep 8, 2020

I find it quite important to provide this on strings, with the current semantics. Strings are treated in the spec as an indexed collection type just like all the others - iterable, arraylike, bracketable - and it would be a confusing inconsistency to have a non-string method that appears on all the other iterable arraylikes.

@bakkot
Copy link

bakkot commented Sep 8, 2020

Well, fortunately, we can and should omit it for arguments, which is the other iterable arraylike. And then this will be present only on arrays and typed arrays, just like .filter and friends.

@ljharb
Copy link
Member

ljharb commented Sep 8, 2020

Arguments is a special case because it lacks a prototype.

@bakkot
Copy link

bakkot commented Sep 8, 2020

It is one of only four iterable arraylikes in the language. The claim that "it would be a confusing inconsistency to have a non-string method that appears on all the other iterable arraylikes" gets pretty weak if you are implicitly excluding one of them.

In any case, if we're excluding arguments from consideration, then your argument is "any method which is present on both arrays and typed arrays must also be present on strings", yes? But that's already not true of many methods, like .filter.

@syg
Copy link
Collaborator

syg commented Sep 8, 2020

I am fine with leaving item out of String.prototype, but that's predicated on that UTF16 code unit indexing on Strings is a legacy feature that we as a committee agree to not keep for future proposals. I don't have too strong feelings on this case.

I find Tab's essay persuasive and think there are good reasons to not have strings be collections. But as it stands, UTF16 code unit indexing via array-like operations is already a thing on Strings in more ways than just being iterable. slice() existing is a big one. There is a DX advantage to let people avoid typing str.slice(-42)[0] for the UTF16 code unit case.

Pragmatically I think I'd feel differently if our code units were, say, UTF8.

Overall, my takeaway is that if need be, I'd rather split out String#item to its own proposal than have get hung up on the code unit vs code point point.

@ljharb
Copy link
Member

ljharb commented Sep 8, 2020

I think slice is definitely the thing motivating my thinking here.

I don’t think code points makes more sense than code units given the whole grapheme clusters debate.

@tabatkins
Copy link
Collaborator

I don't have a strong preference on whether this goes on Strings or not; that is not the primary purpose of this proposal, just a nice-to-have consistency.

That said, I do on occasion index strings, and in particular, checking the last character in a string isn't too uncommon - I have a couple of examples of it in my code. And in general I think it's a good thing for languages to have consistency across similar things, to reduce learning burden. ("[].item.call("string", codeUnitIndex) is easy" is not a compelling argument, I'm afraid.) So I'm weakly in favor of keeping .item() on Strings here; I'd prefer to only drop it if there's a reasonable blocker, not just "eh, doesn't seem that useful".

I don't think my opinions on why Strings shouldn't be iterable have any bearing here. First, "indexable" is different than "iterable" (tho parts of my essay certainly still transfer over); second, they're already indexable in JS, so that argument is already lost, and we should prefer consistency with other similar types.

I am strongly opposed to keeping it in but making it refer to code points; that again breaks the consistency with other types, where .item() with a non-negative value is equivalent to the [] operator. We already have a method that indexes codepoints (.codepointAt()), which can likely be trivially amended to take negative values and have these semantics as well. I'd be quite supportive of that!

@bakkot
Copy link

bakkot commented Sep 9, 2020

We already have a method that indexes codepoints

FWIW, .codePointAt doesn't actually index by code points. It indexes by code units (as with []), and then returns the code point beginning at the specified code unit. (And making it take negative values would be weird - either foo.codePointAt(foo.length - 1) would be different from foo.codePointAt(-1), or we would have the awkward situation that '🗾'.codePointAt(-1) would give you 0xDDFE rather than 0x1F5FE.)

I don't think having item index strings by code points would be a good idea, just saying, that's not a thing we currently have.

I'd prefer to only drop it if there's a reasonable blocker, not just "eh, doesn't seem that useful".

The argument isn't "that doesn't seem useful", it's "every time we expose another place where it's easy to accidentally end up dealing with UTF-16 code units rather than Unicode code points we are building a footgun, and we should avoid doing that without a good reason".

In particular, if we expect people to use this method for looking at the last character in a string, that's a very strong reason not to add it with code unit semantics, because that will break as soon as they start dealing with strings which include non-BMP characters - exactly the sort of bug which is hard to catch in testing and then ends up screwing over production users from non-European cultures (and also people who like emoji).

@tabatkins
Copy link
Collaborator

FWIW, .codePointAt doesn't actually index by code points.

Ooof, I actually wasn't aware of this. I mostly just use the method to get the codepoint of weird characters I copy-paste from a page, so I only ever use 0 and wouldn't have seen the difference.

In particular, if we expect people to use this method for looking at the last character in a string, that's a very strong reason not to add it with code unit semantics, because that will break as soon as they start dealing with strings which include non-BMP characters - exactly the sort of bug which is hard to catch in testing and then ends up screwing over production users from non-European cultures (and also people who like emoji).

Yeah, that's a reasonable concern.

Well, I do still object to .item() having semantics other than what [] does, so either it needs to stay as it is, or we drop. Again, I lean weakly towards keeping it, but don't have a strong opinion on it.

@syg
Copy link
Collaborator

syg commented Sep 9, 2020

The argument isn't "that doesn't seem useful", it's "every time we expose another place where it's easy to accidentally end up dealing with UTF-16 code units rather than Unicode code points we are building a footgun, and we should avoid doing that without a good reason".

Well, I guess the empirical question is: is UTF16 code unit indexing well entrenched to the point where it may not be accidental use? item() isn't going to the thing that encourages accidental use, the presence of indexing itself is.

@kmiller68
Copy link

Wouldn't code point indexing have to be O(n) indexing for a non-utf32 string? That seems unexpected for a generic indexed accessor into a string, which is my understanding of what we want this to be. If we don't think utf16 code unit indexing is acceptable, I'd rather we don't do .item on strings at all. That said, I think the prevalence of [] is such a common idiom that I'd guess not doing utf-16 code units would lead to more than anything else...

I'd also guess that an npm package adding String.prototype.item will become very popular if we decide to exclude it. So, its not clear to me how much developers will be spared by excluding String.prototype.item.

@bakkot
Copy link

bakkot commented Sep 9, 2020

I'd also guess that an npm package adding String.prototype.item will become very popular if we decide to exclude it.

I'd take the other side of that bet.

@codehag
Copy link

codehag commented Sep 18, 2020

We were discussing this internally. Our preference leans towards using code points. Removing or spliting item for string out is also a good direction to go, since this is not controversial for the other types. cc @evilpie and @jswalden

@tabatkins
Copy link
Collaborator

I still strongly favor leaving out String.item entirely, rather than giving it different semantics than []. One of the big useful things about .item() over [] is that it gives you negative indexing, and if String.item() instead gives you something that's sometimes different, that's a huge footgun.

[] indexing of Strings is a footgun to start with, so I understand the desire, but keeping the mental model simple ([] == .item()) is better than making one instance slightly better.

If codepoint indexing is desired (and the current hack of "spread into an Array, then index" isn't sufficient), we should introduce (in a separate proposal) a .codePoint() method. (Figuring out naming to avoid confusion with .codePointAt() is left as an exercise for the reader.)

@jswalden
Copy link

jswalden commented Sep 18, 2020

I brought up the "this should be code points" idea in Mozilla discussion, with the idea that at least it should be a deliberate decision as to whether it should be units or points. Time was short, so we didn't dive into it terribly deeply.

On slightly more thinking by me, tho -- independent of the discussion here, not that it matters -- the overlapping problems of return type (exposing UTF-16 guts more places is not good if you return code units), nature of the input index (if it returns code points), and algorithmic complexity (if it returns code points) seem to motivate to me not having item on strings. For the reasons in @tabatkins's link, or in https://hsivonen.fi/string-length/ separately (but along similar lines), and probably other places.

@hax
Copy link
Member

hax commented Sep 22, 2020

In today's meeting, this proposal already achieve stage 3 with String.prototype.item() using code unit semantic, because no delegate (include me) in the meeting would like to be a bad guy to block the whole proposal. Such things happened many times, I feel this is not a healthy process, and the result may be just random.

Personally I really hope we could advance the proposal without string item() and use more time to discuss the possible solutions. Even the final result is still land item() with code unit semantic, it could have much solid rationale.

@Jack-Works
Copy link
Member

Yeah, maybe split them into two proposals

@syg
Copy link
Collaborator

syg commented Sep 22, 2020

I'm not sure what new information could be surfaced here that would better inform the debate: it comes down to "indexing is already a thing + consistency" vs "UTF16 code unit indexing considered harmful".

There are telegraphed possible blockers on both sides, and so we chose one.

Edit: To be clear, there were strongish feelings for both including it and not including (the "not including" option includes splitting into two proposals).

@hax
Copy link
Member

hax commented Sep 22, 2020

@syg I think we still could check other option like codepoint-safe version of item(). It could be consistent with [] in most cases, only inconsistent in surrogate case (to fix the isolated surrogate), and mitigate the harm of UTF16 code unit indexing.

@bakkot
Copy link

bakkot commented Sep 22, 2020

I don't think that there's more information about the proposal itself to be surfaced, but more time for discussion would allow delegates who did not have an opinion to form one and for those who didn't express an opinion to do so.

I really dislike that "telegraphing a blocker" effectively means you can get whatever outcome you like. If it turns out that only a couple people (yourself, Jordan, Waldemar) preferred adding it, and everyone else preferred not adding it, but the process ends up adding it anyway because Jordan is more willing to imply he'll block a popular proposal if it doesn't include a part he wants, I think that's a bad outcome. And I think that might be pretty much what happened. But we can't tell because most delegates didn't weigh in, because the timebox was tight.

@syg
Copy link
Collaborator

syg commented Sep 22, 2020

@hax No, I am strongly against any version of item() that doesn't have the exact semantics as using []. I'm for some method that does that, but I am very against that method being called item.

@bakkot Agreed on our process being bad for situations like this. I really dislike the possibility of lone objectors at all, but this isn't the right thread for that. I want viable escalation paths beyond "appeasing or laborious efforts for people who are willing to block". Without alternatives, those are the tools our consensus system gives us.

(FWIW in this particular case, Mark Miller also expressed strong feelings preferring inclusion.)

@hax
Copy link
Member

hax commented Sep 22, 2020

I'm for some method that does that, but I am very against that method being called item.

@syg If we add such method in the future, for example named it as uItem, it will also introduce confusion for developers. Obviously developers should be encouraged to use a unicode codepoint safe version, which implies item() should be deprecated, so why add it in the first place?

@syg
Copy link
Collaborator

syg commented Sep 22, 2020

@hax uItem is a strawperson. String already has many methods with codePoint in the name, and this new method probably wouldn't have item in the name. I'd find that argument compelling if we were designing a new String, but the argument remains that UTF16 code unit indexing on Strings is already too entrenched, and not only will we not be moving the needle much by omitting item from String we'll also be hurting consistency with other indexables.

There's no future where item itself is deprecated without deprecating the rest of String indexing. IOW I don't see it ever being deprecated. You can, of course, continue to say "Don't index Strings at all", but for that you'd already have to enumerate all the ways Strings can be indexed, and the addition of item is unfortunate but relatively minor in that framing.

@hax
Copy link
Member

hax commented Sep 22, 2020

To be clear I'm not proposing "don't index Strings" but suggest not adding codepoint-unsafe methods anymore. We can add new methods which codepoint-safe but still codeunit-indexed. With these new methods, string is still indexable, just won't return isolated surrogate.

@syg
Copy link
Collaborator

syg commented Sep 22, 2020

@hax I'd welcome that as a separate proposal!

@ljharb
Copy link
Member

ljharb commented Sep 22, 2020

As was discussed during the meeting, "code-point-safe" is not, in fact, safe - it's just arguably safer than code units for many use cases.

The language doesn't have a truly safe (grapheme-cluster-based, i assume) mechanism for interacting with strings. Code points are not the way forward here.

@hax
Copy link
Member

hax commented Sep 22, 2020

It's a slippery slope that because it's not truly safe, so let's still add unsafe version.

And it's definitely safer than codepoint-unsafe version. Unsafe version could give u a corrupted string, causing many problem. For example split it in surrogate pair, encode these two string to utf8 and concat it, it's a invalid utf8... Grapheme-cluster is much high-level and don't have such issue.

@ExE-Boss
Copy link

I’m personally in favour of splitting %String.prototype.item% into a future proposal.

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Sep 27, 2020
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Sep 28, 2020
…pe.item in Nightly. r=anba

We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20

Differential Revision: https://phabricator.services.mozilla.com/D90732

UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Sep 28, 2020
…pe.item in Nightly. r=anba

We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20

Differential Revision: https://phabricator.services.mozilla.com/D90732

UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Sep 28, 2020
…pe.item in Nightly. r=anba

We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20

Differential Revision: https://phabricator.services.mozilla.com/D90732

UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013
sidvishnoi pushed a commit to sidvishnoi/gecko-webmonetization that referenced this issue Oct 1, 2020
@mathiasbynens
Copy link
Member

I'd also guess that an npm package adding String.prototype.item will become very popular if we decide to exclude it. So, its not clear to me how much developers will be spared by excluding String.prototype.item.

This is opposite to how we usually deal with proposals for new built-ins: we wait for userland patterns to emerge, and then based on their prevalence, attempt to standardize them. We shouldn't standardize APIs just because we think it would be popular on npm.

I think String.prototype.item does not pay for itself — it has massive potential to cause confusion (“does it index by code unit like str[n] or by code point like [...str][n] or by grapheme cluster like Intl.Segmenter”?), and it would best be excluded from the language. At the very least, String.prototype.item should be pursued as a separate proposal (because adding the method for Array and TypedArray is uncontroversial).

Instead of "an .item() method on all the built-in indexable classes", this proposal would then be "an .item() method on all the built-in array classes".

@syg
Copy link
Collaborator

syg commented Oct 1, 2020

I think String.prototype.item does not pay for itself — it has massive potential to cause confusion

Code unit indexing might be undesirable but I don't think in this case it's confusing because .item() is clearly aligned with how [] behaves on the receiver.

@zloirock
Copy link

Related this thread: #33 (comment)

@syg
Copy link
Collaborator

syg commented Nov 20, 2020

We have discussed this in TC39 and reached consensus to include the method on String.prototype.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests