Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isemoji function to Unicode stdlib and export it #38458

Closed
wants to merge 15 commits into from

Conversation

archermarx
Copy link
Contributor

@archermarx archermarx commented Nov 16, 2020

Resolves #38063

@archermarx archermarx changed the title Add isemoji function and export it Add isemoji function to Unicode stdlib and export it Nov 16, 2020
@archermarx archermarx marked this pull request as draft November 16, 2020 17:00
@archermarx archermarx marked this pull request as ready for review November 16, 2020 18:27
@archermarx
Copy link
Contributor Author

I'm not sure why the tests are failing here. Looking at the buildbot results, it doesn't seem to have anything to do with code I introduced? The windows versions fail on a different test, and the linux and mac versions pass all tests but error at a later point due to something involving python (i think?). Anyone know what's going on?

stdlib/Unicode/src/Unicode.jl Outdated Show resolved Hide resolved
stdlib/Unicode/src/Unicode.jl Outdated Show resolved Hide resolved
archermarx and others added 2 commits November 16, 2020 15:25
Co-authored-by: Stefan Karpinski <stefan@karpinski.org>
Handle empty strings

Co-authored-by: Stefan Karpinski <stefan@karpinski.org>
0x02600:0x026FF, # Misc symbols
0x02700:0x027BF, # Dingbats
0x0FE00:0x0FE0F, # Variation Selectors
0x1F900:0x1F9FF, # Supplemental Symbols and Pictographs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be generated from https://unicode.org/Public/emoji/13.1/emoji-sequences.txt somehow? Or at least have a test that checks that everything from https://unicode.org/Public/emoji/13.1/emoji-sequences.txt is in this set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll get to work on this. Is it ok to include a local copy of this file in the /Unicode/test/ directory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just download it during the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can but it may make CI flaky. Let's try that and see how it goes. If it's flaky we may need to cache a copy but that can be dealt with separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented y'all's suggestions, so now we download https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt from the unicode website and use that to generate a const containing emoji ranges, which are then tested against. In the tests, we download the https://unicode.org/Public/emoji/13.1/emoji-sequences.txt and https://www.unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt files, which we use to build a list of all valid emoji, which we then test the isemoji function against.

archermarx and others added 5 commits November 16, 2020 15:33
Co-authored-by: Stefan Karpinski <stefan@karpinski.org>
The previous range list turned out not to be exhaustive. Now, we read the single emoji ranges directly from the unicode consortium website. Additionally added support for emojis that contain variation selectors ('\uFE0F')
@archermarx
Copy link
Contributor Author

Something's broken now and I'm not sure why.

@@ -89,4 +89,86 @@ letter combined with an accent mark is a single grapheme.)
"""
graphemes(s::AbstractString) = Base.Unicode.GraphemeIterator{typeof(s)}(s)

const emoji_data = download("https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we want to store this entire string in the compiled library. You should just download it when you parse the data, maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your old code that just looked at certain code point blocks was a lot more compact and independent of the Unicode version. I’m just not sure if it is standard conforming?

Could we just use the old code combined with checking the category code to see if the code point is assigned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I still think we should use the data file, but only for tests.)

Copy link
Contributor Author

@archermarx archermarx Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i looked more into the emoji codepoints, i found some corner cases which the old version didn't catch. It would be possible to patch those, but the secondary issue was that there's a bunch of unassigned emoji in the blocks in the old version. If we're ok with saying "yes this is an emoji" to characters in those larger blocks which are not (yet) emoji, then we can go with the old system. If not, we would either need to restrict those ranges to those codepoints which are currently assigned manually or via parsing the emoji_data file like i do here.

i do agree the old version was a lot simpler, so i'm not sure what the best solution is.

Copy link
Member

@stevengj stevengj Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should return false for currently unassigned codepoints, but we can check for that simply by returning false if Unicode.category_code(char) == Unicode.UTF8PROC_CATEGORY_CN

(Or even be more restrictive: only allow category So or Sk.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat, didn't know about that. I'll get back with a new version later today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this, there are things in the emoji data file which fall under unicode category Sm and Po (◾ and ‼, respectively), and things in So which do not qualify as emoji (Ⓕ). To be honest, I'm not sure why some of these symbols get called emoji and some don't, but I think the only way to be complete here is to use the full list of ranges. We don't have to download it necessarily (could take the output from the file and manually include that in the unicode.jl file, but that's a 702-element array), but I'm not sure of another way to make sure we catch all emoji without false positives

@stevengj
Copy link
Member

What is the isemoji function good for, especially if the set of "emoji" characters is a bit fuzzy and doesn't include all dingbats?

@StefanKarpinski
Copy link
Member

Another question: suppose we had a good definition of a single character being an emoji; given, what is the utility of isemoji on a string? Why would one want to know that a string only consists of emoji? Wouldn't it be more useful to ask if any of the characters in a string are emoji or not? The zero-width joiner issue complicates things, but it's a little unclear to me what the right predicate should be. Most questions you could ask about a string could presumably be answered by calling graphemes and then applying a predicate to individual graphemes. Maybe that's why this PRs predicate makes sense: it's only intended to be applied to single-grapheme strings?

@archermarx
Copy link
Contributor Author

After some thought and retooling yesterday, I agree. The utility most makes sense in identifying a single emoji, not whether a string is all emoji. I have a new version I'll upload in a bit that does that well and catches all the emoji in the dataset. We could separately have a contains_emoji function. My use case for this has been in parsing text and compiling statistics of emoji usage in large corpuses of internet conversation, so i think splitting out individual emoji makes far more sense than just saying "everything in this string is an emoji".

The remaining question is how do we want do the check for single characters? the most effective way is to parse emoji_data.txt, but do we want to download it? have a cached copy? a pre-made list of ranges in a separate file? i think the last would be most space efficient and would obviate the need for emoji file parsing utilities in the main body of the Unicode module and let them live in the tests instead

@StefanKarpinski
Copy link
Member

The problem with only having isemoji as a character predicate would seem to be that some emoji are not a single character but a whole grapheme, so if you want to count them you probably want to count such a sequence as a single emoji, right? One way to do this would be to call graphemes to iterate the graphemes in a string and then have a predicate that can be applied to each grapheme string to indicate whether it is an emoji grapheme or not. Of course, that's not the only way that predicate can be called, so the question is what is the most useful generalization from a predicate like that to arbitrary strings?

@archermarx
Copy link
Contributor Author

archermarx commented Nov 18, 2020

I have a version (to be uploaded shortly) that has isemoji as a character predicate and a string predicate, so that we get this behavior

julia> teststring = "My family looks like this: 👩‍👩‍👦‍👦 and I ❤️ them"

julia> isemoji(teststring)
false

julia> filter(isemoji, [g for g in graphemes(teststring)])
2-element Array{SubString{String},1}:
 "👩\u200d👩\u200d👦\u200d👦"
 "❤️"

julia> isemoji("😈😘")
false

julia> isemoji('👦')
true

julie> isemoji("")
false

This seems to me the most intuitive way to address the question.

@archermarx
Copy link
Contributor Author

archermarx commented Nov 18, 2020

Ok, I've uploaded my new version. I'm not sure if the graphemes function is working wholly correctly. For instance,

julia> [g for g in graphemes("👨🏻‍🤝‍👨🏽")]
1-element Array{SubString{String},1}:
 "👨🏻\u200d"

 julia> [c for c in "👨🏻‍🤝‍👨🏽"]
7-element Array{Char,1}:
 '👨': Unicode U+1F468 (category So: Symbol, other)
 '🏻': Unicode U+1F3FB (category Sk: Symbol, modifier)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '🤝': Unicode U+1F91D (category So: Symbol, other)
 '\u200d': Unicode U+200D (category Cf: Other, format)
 '👨': Unicode U+1F468 (category So: Symbol, other)
 '🏽': Unicode U+1F3FD (category Sk: Symbol, modifier)

The python graphemes package does catch this one correctly

>>> import grapheme
>>> str = "👨🏻‍🤝‍👨🏽"
>>> list(grapheme.graphemes(str))
['👨🏻\u200d🤝\u200d👨🏽']

EDIT:

Double regional indicators seem not to work either:

julia> [g for g in graphemes("🇸🇪🇸🇪")]
2-element Array{SubString{String},1}:
 "🇸🇪🇸"
 "🇪"

@archermarx
Copy link
Contributor Author

The problem with the graphemes seems to be related to #37680, and stems from a bug in the utf8-proc library.

@StefanKarpinski
Copy link
Member

I think the new behavior makes sense but now I worry about the name: "emoji" is both plural and singular, so it might be surprising to someone that this function considers a string that consists of two emoji to not be emoji. So perhaps this predicate should be called isoneemoji? Or perhaps it would be better to provide some kind of find-like interface instead, such as findemoji(s::AbstratString, i::Integer) which returns nothing if there are no emoji in the given string or a pair of indices if there are?

@stevengj
Copy link
Member

stevengj commented Nov 30, 2020

My basic concern here is about adding a nontrivial new function to the stdlib without clear explanation of what it is for. What is the application of this function?

@StefanKarpinski
Copy link
Member

I think the application is counting emoji in text. You would do it by calling graphemes and then calling isemoji on each grapheme. It seems like this might be better suited to a package which can evolve its API as necessary.

@archermarx
Copy link
Contributor Author

Yeah my main application for this has been parsing and analyzing message data to determine things like emoji frequencies, usage patterns etc. I have also been thinking a package might be easier for this at least for now.

@StefanKarpinski StefanKarpinski marked this pull request as draft December 3, 2020 17:12
@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2021

Closing, per apparent consensus among participants that this should be a separate package.

@vtjnash vtjnash closed this Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: isemoji function
4 participants