-
Notifications
You must be signed in to change notification settings - Fork 19
item
on Strings has questionable value
#20
Comments
That's my preference, FWIW. Most of the reasons @tabatkins gives for not wanting strings to be iterable in this essay apply here as well. |
I find it quite important to provide this on strings, with the current semantics. Strings are treated in the spec as an indexed collection type just like all the others - iterable, arraylike, bracketable - and it would be a confusing inconsistency to have a non-string method that appears on all the other iterable arraylikes. |
Well, fortunately, we can and should omit it for |
Arguments is a special case because it lacks a prototype. |
It is one of only four iterable arraylikes in the language. The claim that "it would be a confusing inconsistency to have a non-string method that appears on all the other iterable arraylikes" gets pretty weak if you are implicitly excluding one of them. In any case, if we're excluding |
I am fine with leaving I find Tab's essay persuasive and think there are good reasons to not have strings be collections. But as it stands, UTF16 code unit indexing via array-like operations is already a thing on Strings in more ways than just being iterable. Pragmatically I think I'd feel differently if our code units were, say, UTF8. Overall, my takeaway is that if need be, I'd rather split out |
I think slice is definitely the thing motivating my thinking here. I don’t think code points makes more sense than code units given the whole grapheme clusters debate. |
I don't have a strong preference on whether this goes on Strings or not; that is not the primary purpose of this proposal, just a nice-to-have consistency. That said, I do on occasion index strings, and in particular, checking the last character in a string isn't too uncommon - I have a couple of examples of it in my code. And in general I think it's a good thing for languages to have consistency across similar things, to reduce learning burden. (" I don't think my opinions on why Strings shouldn't be iterable have any bearing here. First, "indexable" is different than "iterable" (tho parts of my essay certainly still transfer over); second, they're already indexable in JS, so that argument is already lost, and we should prefer consistency with other similar types. I am strongly opposed to keeping it in but making it refer to code points; that again breaks the consistency with other types, where |
FWIW, I don't think having
The argument isn't "that doesn't seem useful", it's "every time we expose another place where it's easy to accidentally end up dealing with UTF-16 code units rather than Unicode code points we are building a footgun, and we should avoid doing that without a good reason". In particular, if we expect people to use this method for looking at the last character in a string, that's a very strong reason not to add it with code unit semantics, because that will break as soon as they start dealing with strings which include non-BMP characters - exactly the sort of bug which is hard to catch in testing and then ends up screwing over production users from non-European cultures (and also people who like emoji). |
Ooof, I actually wasn't aware of this. I mostly just use the method to get the codepoint of weird characters I copy-paste from a page, so I only ever use
Yeah, that's a reasonable concern. Well, I do still object to |
Well, I guess the empirical question is: is UTF16 code unit indexing well entrenched to the point where it may not be accidental use? |
Wouldn't code point indexing have to be O(n) indexing for a non-utf32 string? That seems unexpected for a generic indexed accessor into a string, which is my understanding of what we want this to be. If we don't think utf16 code unit indexing is acceptable, I'd rather we don't do I'd also guess that an npm package adding |
I'd take the other side of that bet. |
I still strongly favor leaving out
If codepoint indexing is desired (and the current hack of "spread into an Array, then index" isn't sufficient), we should introduce (in a separate proposal) a |
I brought up the "this should be code points" idea in Mozilla discussion, with the idea that at least it should be a deliberate decision as to whether it should be units or points. Time was short, so we didn't dive into it terribly deeply. On slightly more thinking by me, tho -- independent of the discussion here, not that it matters -- the overlapping problems of return type (exposing UTF-16 guts more places is not good if you return code units), nature of the input index (if it returns code points), and algorithmic complexity (if it returns code points) seem to motivate to me not having |
In today's meeting, this proposal already achieve stage 3 with Personally I really hope we could advance the proposal without string |
Yeah, maybe split them into two proposals |
I'm not sure what new information could be surfaced here that would better inform the debate: it comes down to "indexing is already a thing + consistency" vs "UTF16 code unit indexing considered harmful". There are telegraphed possible blockers on both sides, and so we chose one. Edit: To be clear, there were strongish feelings for both including it and not including (the "not including" option includes splitting into two proposals). |
@syg I think we still could check other option like codepoint-safe version of |
I don't think that there's more information about the proposal itself to be surfaced, but more time for discussion would allow delegates who did not have an opinion to form one and for those who didn't express an opinion to do so. I really dislike that "telegraphing a blocker" effectively means you can get whatever outcome you like. If it turns out that only a couple people (yourself, Jordan, Waldemar) preferred adding it, and everyone else preferred not adding it, but the process ends up adding it anyway because Jordan is more willing to imply he'll block a popular proposal if it doesn't include a part he wants, I think that's a bad outcome. And I think that might be pretty much what happened. But we can't tell because most delegates didn't weigh in, because the timebox was tight. |
@hax No, I am strongly against any version of @bakkot Agreed on our process being bad for situations like this. I really dislike the possibility of lone objectors at all, but this isn't the right thread for that. I want viable escalation paths beyond "appeasing or laborious efforts for people who are willing to block". Without alternatives, those are the tools our consensus system gives us. (FWIW in this particular case, Mark Miller also expressed strong feelings preferring inclusion.) |
@syg If we add such method in the future, for example named it as |
@hax There's no future where |
To be clear I'm not proposing "don't index Strings" but suggest not adding codepoint-unsafe methods anymore. We can add new methods which codepoint-safe but still codeunit-indexed. With these new methods, string is still indexable, just won't return isolated surrogate. |
@hax I'd welcome that as a separate proposal! |
As was discussed during the meeting, "code-point-safe" is not, in fact, safe - it's just arguably safer than code units for many use cases. The language doesn't have a truly safe (grapheme-cluster-based, i assume) mechanism for interacting with strings. Code points are not the way forward here. |
It's a slippery slope that because it's not truly safe, so let's still add unsafe version. And it's definitely safer than codepoint-unsafe version. Unsafe version could give u a corrupted string, causing many problem. For example split it in surrogate pair, encode these two string to utf8 and concat it, it's a invalid utf8... Grapheme-cluster is much high-level and don't have such issue. |
I’m personally in favour of splitting |
…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732
…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732 UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013
…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732 UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013
…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732 UltraBlame original commit: f43cdaab86113c78acea1e51d858335a0e7cc013
…pe.item in Nightly. r=anba We don't implement String.prototype.item: tc39/proposal-relative-indexing-method#20 Differential Revision: https://phabricator.services.mozilla.com/D90732
This is opposite to how we usually deal with proposals for new built-ins: we wait for userland patterns to emerge, and then based on their prevalence, attempt to standardize them. We shouldn't standardize APIs just because we think it would be popular on npm. I think Instead of "an |
Code unit indexing might be undesirable but I don't think in this case it's confusing because |
Related this thread: #33 (comment) |
We have discussed this in TC39 and reached consensus to include the method on |
I get that strings expose an array-like interface to their code units and that this proposal is supposed to have some sort of simple desugaring as an indexing operation that accounts for negative indices. But given how easy it is to
[].item.call("string", codeUnitIndex)
, I think the more useful default would be code point based. In my opinion, any ofitem
on Strings at allwould all be preferable to
"".item
being basically identical to[].item
.The text was updated successfully, but these errors were encountered: