Skip to content

janlelis/unicode-emoji

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Unicode::Emoji [version] [ci]

Provides regular expressions to find Emoji in strings, incorporating the latest Unicode and Emoji standards.

Additional features:

  • A categorized list of recommended Emoji
  • Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.)

Emoji version: 16.0 (September 2024)

CLDR version (used for sub-region flags): 45 (April 2024)

Gemfile

gem "unicode-emoji"

Usage โ€“ Regex Matching

The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.

require "unicode/emoji"

string = "String which contains all kinds of emoji:

- Singleton Emoji: ๐Ÿ˜ด
- Textual singleton Emoji with Emoji variation: โ–ถ๏ธ
- Emoji with skin tone modifier: ๐Ÿ›Œ๐Ÿฝ
- Region flag: ๐Ÿ‡ต๐Ÿ‡น
- Sub-Region flag: ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
- Keycap sequence: 2๏ธโƒฃ
- Sequence using ZWJ (zero width joiner): ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ

"

string.scan(Unicode::Emoji::REGEX) # => ["๐Ÿ˜ด", "โ–ถ๏ธ", "๐Ÿ›Œ๐Ÿฝ", "๐Ÿ‡ต๐Ÿ‡น", "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ", "2๏ธโƒฃ", "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ"]

There are multiple levels of Emoji detection:

Main Regexes

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX Use this one if unsure! Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of recommended Emoji sequences ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿค โ€๐Ÿคข, 1, 1โƒฃ
Unicode::Emoji::REGEX_VALID Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of valid Emoji sequences ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต, 1, 1โƒฃ
Unicode::Emoji::REGEX_WELL_FORMED Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of well-formed Emoji sequences ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ‡ต๐Ÿ‡ต ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, 1, 1โƒฃ
Unicode::Emoji::REGEX_POSSIBLE Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, 1 1โƒฃ

Include Text Emoji

By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in REGEX_POSSIBLE). However, if you wish to match for them too, you can include them in your regex by appending the _INCLUDE_TEXT suffix:

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX_INCLUDE_TEXT REGEX + REGEX_TEXT ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿ˜ด๏ธŽ, โ–ถ, 1โƒฃ ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿค โ€๐Ÿคข, 1
Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT REGEX_VALID + REGEX_TEXT ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ˜ด๏ธŽ, โ–ถ, 1โƒฃ ๐Ÿป, ๐Ÿ‡ต๐Ÿ‡ต, 1
Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT REGEX_WELL_FORMED + REGEX_TEXT ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, 2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, ๐Ÿ‡ต๐Ÿ‡ต, ๐Ÿ˜ด๏ธŽ, โ–ถ, 1โƒฃ ๐Ÿป, 1

Singleton Regexes

Matches only simple one-codepoint (+ optional variation selector) Emoji:

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX_BASIC Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all ๐Ÿ˜ด, โ–ถ๏ธ ๐Ÿ˜ด๏ธŽ, โ–ถ, ๐Ÿป, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, ๐Ÿ‡ต๐Ÿ‡ต,2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, 1
Unicode::Emoji::REGEX_TEXT Matches only textual singleton Emoji (except for singleton components, like digits) ๐Ÿ˜ด๏ธŽ, โ–ถ ๐Ÿ˜ด, โ–ถ๏ธ, ๐Ÿป, ๐Ÿ›Œ๐Ÿฝ, ๐Ÿ‡ต๐Ÿ‡น, ๐Ÿ‡ต๐Ÿ‡ต,2๏ธโƒฃ, ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ, ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ, ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ, ๐Ÿค โ€๐Ÿคข, 1

Comparison

  1. Fully-qualified RGI Emoji ZWJ sequence
  2. Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
  3. Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character)
  4. Non-RGI Emoji ZWJ sequence
  5. Valid Region made from pair of Regional Indicators
  6. Any Region made from pair of Regional Indicators
  7. RGI Flag Emoji Tag Sequences (England, Scotland, Wales)
  8. Valid Flag Emoji Tag Sequences (any known sub-division)
  9. Any Flag Emoji Tag Sequences (any tag sequence)
  10. Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector
  11. Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector
  12. Non-Emoji (unqualified) keycap sequence
Regex 1 RGI/FQE 2 RGI/MQE 3 RGI/UQE 4 Non-RGI 5 Valid Reยญgion 6 Any Reยญgion 7 RGI Tag 8 Valid Tag 9 Any Tag 10 Basic Emoji 11 Basic Text 12 Text Keyยญcap
REGEX โœ… โŒ โŒ โŒ โœ… โŒ โœ… โŒ โŒ โœ… โŒ โŒ
REGEX INCLUDE TEXT โœ… โŒ โŒ โŒ โœ… โŒ โœ… โŒ โŒ โœ… โœ… โœ…
REGEX VALID โœ… โœ… (โœ…)ยน โœ… โœ… โŒ โœ… โœ… โŒ โœ… โŒ โŒ
REGEX VALID INCLUDE TEXT โœ… โœ… โœ… โœ… โœ… โŒ โœ… โœ… โŒ โœ… โœ… โœ…
REGEX WELL FORMED โœ… โœ… (โœ…)ยน โœ… โœ… โœ… โœ… โœ… โœ… โœ… โŒ โŒ
REGEX WELL FORMED INCLUDE TEXT โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ…
REGEX POSSIBLE โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ… โŒ
REGEX BASIC โŒ โŒ โŒ โŒ โŒ โŒ โŒ โŒ โŒ โœ… โŒ โŒ
REGEX TEXT โŒ โŒ โŒ โŒ โŒ โŒ โŒ โŒ โŒ โŒ โœ… โœ…

ยน Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)

See spec files for detailed examples about which regex matches which kind of Emoji.

Picking the Right Emoji Regex

  • Usually you just want REGEX (RGI set)
  • If you want broader matching (any ZWJ sequences, more sub-region flags), choose REGEX_VALID
  • If you need to match any region flag and any tag sequence, choose REGEX_WELL_FORMED
  • Use the _INCLUDE_TEXT suffix with any of the above, if you want to also match basic textual Emoji
  • And finally, there is also the option to use REGEX_POSSIBLE , which is a simplified test for possible Emoji. It might contain false positives, however, the regex is less complex and suggested in the Unicode standard itself as a first check.

Examples

Desc Emoji Escaped REGEX (RGI) REGEX_VALID (Valid) REGEX_WELL_FORMED (Well-formed) REGEX_POSSIBLE
RGI ZWJ Sequence ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ \u{1F93E 1F3FD 200D 2640 FE0F} โœ… โœ… โœ… โœ…
Valid ZWJ Sequence ๐Ÿค โ€๐Ÿคข \u{1F920 200D 1F922} โŒ โœ… โœ… โœ…
Known Region ๐Ÿ‡ต๐Ÿ‡น \u{1F1F5 1F1F9} โœ… โœ… โœ… โœ…
Unknown Region ๐Ÿ‡ต๐Ÿ‡ต \u{1F1F5 1F1F5} โŒ โŒ โœ… โœ…
RGI Tag Sequence ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ \u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F} โœ… โœ… โœ… โœ…
Valid Tag Sequence ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ \u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F} โŒ โœ… โœ… โœ…
Well-formed Tag Sequence ๐Ÿ˜ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ \u{1F634 E0067 E0062 E0061 E0061 E0061 E007F} โŒ โŒ โœ… โœ…

Please see the standard for more details, examples, explanations.

More info about valid vs. recommended Emoji can also be found in this blog article on Emojipedia.

Extended Pictographic Regex

Unicode::Emoji::REGEX_PICTO matches single codepoints with the Extended_Pictographic property. For example, it will match โœ€ BLACK SAFETY SCISSORS.

Unicode::Emoji::REGEX_PICTO_NO_EMOJI matches single codepoints with the Extended_Pictographic property, but excludes Emoji characters.

See character.construction/picto for a list of all non-Emoji pictographic characters.

Partial Regexes

Please note: Might get removed or renamed in the future. This the same as \p{Emoji}

Matches potential Emoji parts (often, this is not what you want):

Regex Description Example Matches Example Non-Matches
Unicode::Emoji::REGEX_ANY Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! ๐Ÿ˜ด, โ–ถ, ๐Ÿป, ๐Ÿ›Œ, ๐Ÿฝ, ๐Ÿ‡ต, ๐Ÿ‡น, 2, ๐Ÿด, ๐Ÿคพ, โ™€, ๐Ÿค , ๐Ÿคข -

Usage โ€“ List

Use Unicode::Emoji::LIST or the list method to get a ordered and categorized list of Emoji:

Unicode::Emoji.list.keys
# => ["Smileys & Emotion", "People & Body", "Component", "Animals & Nature", "Food & Drink", "Travel & Places", "Activities", "Objects", "Symbols", "Flags"]

Unicode::Emoji.list("Food & Drink").keys
# => ["food-fruit", "food-vegetable", "food-prepared", "food-asian", "food-marine", "food-sweet", "drink", "dishware"]

Unicode::Emoji.list("Food & Drink", "food-asian")
=> ["๐Ÿฑ", "๐Ÿ˜", "๐Ÿ™", "๐Ÿš", "๐Ÿ›", "๐Ÿœ", "๐Ÿ", "๐Ÿ ", "๐Ÿข", "๐Ÿฃ", "๐Ÿค", "๐Ÿฅ", "๐Ÿฅฎ", "๐Ÿก", "๐ŸฅŸ", "๐Ÿฅ ", "๐Ÿฅก"]

Please note that categories might change with future versions of the Emoji standard, although this has not happened often.

A list of all Emoji (generated from this gem) can be found at character.construction/emoji.

Usage โ€“ Properties Data

Allows you to access the codepoint data form Unicode's emoji-data.txt file:

require "unicode/emoji"

Unicode::Emoji.properties "โ˜" # => ["Emoji", "Emoji_Modifier_Base"]

Also See

MIT