Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some emoji are not rendered as images #548

Open
CanePlayz opened this issue Apr 12, 2021 · 14 comments
Open

Some emoji are not rendered as images #548

CanePlayz opened this issue Apr 12, 2021 · 14 comments

Comments

@CanePlayz
Copy link
Contributor

CanePlayz commented Apr 12, 2021

Some emojis aren't shown in their Discord style as they are in Discord but instead in the Windows style (I assume the style depends on the system you're on, but for me (in the Chromium Edge with Windows) it's displayed like that).
Instead of:
image
It's shown like that:
image

@CanePlayz CanePlayz changed the title Show all emotes in the Discord-sytle Show all emojis in the Discord-sytle Apr 12, 2021
@Tyrrrz Tyrrrz changed the title Show all emojis in the Discord-sytle Some emojis are not rendered as images Apr 12, 2021
@96-LB
Copy link
Contributor

96-LB commented Apr 18, 2021

// Capture any country flag emoji (two regional indicator surrogate pairs)
// ... or "miscellaneous symbol" character
// ... or surrogate pair
// ... or digit followed by enclosing mark
// (this does not match all emojis in Discord but it's reasonably accurate enough)
private static readonly IMatcher<MarkdownNode> StandardEmojiNodeMatcher = new RegexMatcher<MarkdownNode>(
new Regex("((?:[\\uD83C][\\uDDE6-\\uDDFF]){2}|[\\u2600-\\u26FF]|\\p{Cs}{2}|\\d\\p{Me})",
DefaultRegexOptions),
m => new EmojiNode(m.Groups[1].Value)
);

The current emoji matcher appears to be very crude. Unfortunately, it turns out that emoji codepoints are a mess. The unicode values for Discord's current version of emoji can be found here. The most glaring issue with the current matcher seems to be the omission of all the emoji that can be expressed in a single byte (except those from \u2600 to \u26FF). Other discrepancies include:

  • Specifically excluding ♂️, ♀️, and ♾️ from being rendered in IgnoredEmojiTextNodeMatcher (I believe Discord used to skip these but the current version seems to have images for them)
  • False positives for \u26** non-emoji such as ☭ or ✃
  • Attempting to render surrogate pairs that don't map to a valid unicode point

I'm in favor of some support being added for the missed single-byte emoji since there are quite a few of those, but it looks like it'd be very difficult to get an entirely accurate mapping. It really doesn't help that Unicode's official files are quite messy (and the one I linked isn't even ordered fully by codepoint despite claiming that it is). I think at this point it's a matter of deciding how much is worth the effort. The first bullet point above is trivial to correct, but the second and third are both a significant amount of hardcoding to fix a relatively small issue.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Apr 18, 2021

Yes, that matcher used to be much more greedy, which resulted in lots of false matches. The export would show broken images for random unicode sequences that the parser assumed were emoji but actually weren't.

I believe Discord used to skip these but the current version seems to have images for them

That's probably the case.

I'm in favor of some support being added for the missed single-byte emoji since there are quite a few of those, but it looks like it'd be very difficult to get an entirely accurate mapping.

I was unable to come up with a reliable way to match emoji in the same way that Discord does. It's worth noting that Discord's emoji support is particularly extensive, for example it supports compound emoji where multiple separate emoji can be combined to render a different one (e.g. an emoji for man, woman, and a child, renders the emoji for family).

Overall I just thought it wasn't worth the effort because not getting some emoji was not as bad as incorrectly matching unrelated text and breaking the export. After all, the emoji still gets rendered, just not using the Twitter's/Discord's image set.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Jun 19, 2021

With #549 and #599 closed, we have the EmojiIndex class with the emoji mappings provided by @96-LB and @CanePlayz. We might be able to use this here too, although it's a bit more complicated. The most naive approach would be to use our list of emoji (which has over 4000 items in it) to check every series of characters we encounter, prioritizing from longest to shortest. This is obviously not very efficient and we need to find a better way to do it. Some potential ideas: relax the existing regex pattern to match surrogate pairs and then refine the match using EmojiIndex. Open for ideas.

@Tyrrrz Tyrrrz changed the title Some emojis are not rendered as images Some emoji are not rendered as images Jun 21, 2021
@CanePlayz
Copy link
Contributor Author

CanePlayz commented Jan 28, 2023

Since this issue has recently popped up again:

I doubt it could be useful since this is a JS library, but I think this concept could technically be what we need: https://www.npmjs.com/package/twemoji-parser?activeTab=readme

It basically checks a series of characters for emojis and returns the found emojis, even with links to the Twemoji cdn.

Here's an example:

This

const { parse } = require('twemoji-parser');
const entities = parse('I am ☺️. A bit more complicated one: 👩🏿‍🤝‍👨🏻');
console.log(entities);

returns

[
    url: 'https://twemoji.maxcdn.com/v/latest/svg/263a.svg',
    indices: [ 5, 7 ],
    text: '☺️',
    type: 'emoji'
  },
  {
    url: 'https://twemoji.maxcdn.com/v/latest/svg/1f469-1f3ff-200d-1f91d-200d-1f468-1f3fb.svg',
    indices: [ 37, 49 ],
    text: '👩🏿‍🤝‍👨🏻',
    type: 'emoji'
  }
]

Oh wait, I just noticed that this is probably the same as

The most naive approach would be to use our list of emoji (which has over 4000 items in it) to check every series of characters we encounter, prioritizing from longest to shortest.

I will send this comment anyways, no harm in doing so. U+1F609

There is also this database https://emojibase.dev/docs/datasets, but I think it just contains what an to Emoji 14.0 updated version of our emoji list contains.

Lastly, it's also worth noting that our emoji list needs an update as well since it's missing emojis from Emoji 14.0 and 15.0, which will soon be released on Discord. This would also prevent #599 from becoming an issue again. (There seems to be a problem with that anyways, but that needs a bit more investigating from my side.) Getting that list or updating the current one probably won't be as easy as last time, I couldn't find the emoji.json on the new Discord APKs. This seems like a good alternative, it's up to date with Emoji 14.0 and just formatted a bit differently than our list.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Feb 14, 2023

@CanePlayz was this issue fixed by your earlier PR, by any chance?

Also, if you can update the emoji list, that would be nice too :)

@CanePlayz
Copy link
Contributor Author

@CanePlayz was this issue fixed by your earlier PR, by any chance?

Unfortunately not, it just fixed false positives. Emoji codepoints for some emojis are just too complex for our regex. For example these two:

image

image

render as this:

image

image

Also, if you can update the emoji list, that would be nice too :)

Yeah, I will do that in the coming days.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Feb 14, 2023

Ah yes, the compound emoji are not handled properly. But in your original post, you pointed out that "⏩" didn't render as an emoji – is that not fixed now? Along with other cases pointed out by @96-LB.

As for the compound emoji (skin color, family) not working, I think we can just let it be. Unless there's an easy way to fix it that I missed somehow.

@CanePlayz
Copy link
Contributor Author

CanePlayz commented Feb 14, 2023

Ah yes, the compound emoji are not handled properly. But in your original post, you pointed out that "⏩" didn't render as an emoji – is that not fixed now? Along with other cases pointed out by @96-LB.

No, unfortunately, it's not. And in case I'm not missing something integral, I don't know why it should be. My pull request just narrowed down which Unicode sequences we render as emojis. But all the ones that are now rendered were previously rendered as well.

As for the compound emoji (skin color, family) not working, I think we can just let it be. Unless there's an easy way to fix it that I missed somehow.

Yeah, if it's too complex that's totally fine.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Feb 14, 2023

No, unfortunately, it's not. And in case I'm not missing something integral, I don't know why it should be. my pull request just narrowed down which Unicode sequences we render as emojis. But all the ones that are now rendered were previously rendered as well.

Yeah it makes sense. I just thought I'd ask to make sure.

@CanePlayz
Copy link
Contributor Author

Also, if you can update the emoji list, that would be nice too :)

Turns out that's a bit more complex since not only does the APK not have that file anymore, but the more up-to-date list I provided earlier is not complete either. I will continue looking for alternatives though.

@CanePlayz
Copy link
Contributor Author

It will take some time but I will add them manually. 👍

@Tyrrrz
Copy link
Owner

Tyrrrz commented Feb 15, 2023

No worries

@Tyrrrz
Copy link
Owner

Tyrrrz commented Aug 5, 2023

Is this still relevant? 🙂

@CanePlayz
Copy link
Contributor Author

Is this still relevant? 🙂

I guess yes, but if those emojis which consist of multiple Unicode symbols are too complex to be caught by our regex, it's fine if you close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants