(Go: >> BACK << -|- >> HOME <<)

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Unicode grapheme clusters #54

Closed
tbu- opened this issue Mar 7, 2015 · 14 comments
Closed

support Unicode grapheme clusters #54

tbu- opened this issue Mar 7, 2015 · 14 comments

Comments

@tbu-
Copy link
tbu- commented Mar 7, 2015

The regex engine doesn't consider characters (graphemes) that consist of multiple code points correctly.

For example the letter 'ä' has two representations, that should both be matched by the regex ., howver only the latter is.

Bash                 | Rust       | Codepoints
echo $'\x61\xcc\x88' | "\u{e4}"   | U+00e4
echo $'\xc3\xa4'     | "a\u{308}" | U+0061 U+0308
@BurntSushi
Copy link
Member

Yup. The regex crate assumes a single character is a Unicode scalar value.

Do you know of other regex libraries that handle graphemes correctly?

@kwantam
Copy link
Contributor
kwantam commented Apr 16, 2015

@tbu- one way to handle this is to apply a normalization to strings before matching. For example, converting a string to NFC form (or was it NFKC? See UAX#15 and the unicode-normalizations crate) should convert equivalent representations like the one you've listed to a single, canonical form. Then, of course, your regex would have to match against the canonical version of the character.

@tbu-
Copy link
Author
tbu- commented Apr 16, 2015

@kwantam I believe that this still does not work for the . pattern which should match a single grapheme (which might well be more than one codepoint, even after normalization).

@kwantam
Copy link
Contributor
kwantam commented Apr 16, 2015

@tbu- UAX#18§2.2 discusses matching grapheme clusters. The rule here, RL2.2, does not say that . should match a grapheme cluster.

For example, an implementation could interpret \X as matching any
extended grapheme cluster, while interpreting "." as matching any
single code point. It could interpret \b{g} as a zero-width match
against any extended grapheme cluster boundary, and \B{g} as
the negation of that.

I agree that it would be nice to have \b{g} (and \b{w}, \b{l}, and \b{s}) defined. I haven't yet implemented UAX#14 or the UAX#29 sentence break boundary algorithms in the unicode-segmentation crate, but it's on the roadmap. At that point, it would be pretty straightforward to implement \X as .+?\b{g}, as discussed in UAX#18§2.2.

As regards other languages, the perlunicode man page has a nice summary of perl's support for UAX#18. One assumes if perl's implementation is missing it, most everyone else's is, too, and indeed they do not support RL2.2 at least as of perl5.20.1.

@tbu-
Copy link
Author
tbu- commented Apr 16, 2015

Oh, I incorrectly assumed it would be described in "UAX#18§2.2.1 Grapheme Cluster Mode".

@kwantam
Copy link
Contributor
kwantam commented Apr 16, 2015

Nothing incorrect about that; it would be nice to have that mode available, too! :)

(And, I think, once RL2.2 is implemented, it would be pretty easy to add a switch for grapheme cluster mode.)

@BurntSushi BurntSushi changed the title Unicode grapheme clusters not detected correctly support Unicode grapheme clusters Jun 19, 2015
@BurntSushi
Copy link
Member

I changed the title of this issue because this crate doesn't support grapheme clusters, so what you're seeing is intended behavior.

I actually don't know much about graphemes and how they're encoded, but if they can be made to fit into the Input abstraction, then it should be relatively easy to actually experiment with this. (The representation of InputAt would have to change.)

@BurntSushi
Copy link
Member

While I'm not against potentially adding this, I do think it would be a mighty big undertaking and it's not quite clear it could be feasibly done in the DFA. If someone wanted to move forward on working on this, then we can re-open this issue. Before writing any code, I would like to see a plan for how it would be added to the DFA though and at least some analysis on the memory required.

@Trolldemorted
Copy link

Any chance for grapheme cluster support in the near future? One of our projects requires that we match them, and I really would like to stick with rust and not switch to python, which unfortunately can do it:

import regex

haystack = "🏳️‍🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇"
needle = "🏳️‍🌈\\X\\X\\X\\X"

foo = regex.search(needle, haystack)
print(foo)

prints

PS C:\Users\Benni\repositories\graphemeclusters> python .\sample.py
<regex.Match object; span=(0, 44), match='🏳️\u200d🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇'>

@BurntSushi
Copy link
Member

No, I don't see it happening. You can write the regex yourself if you want, which is here: https://github.com/BurntSushi/bstr/blob/master/scripts/regex/grapheme.sh

The regex is quite gnarly though, but that's exactly what this crate would do if it supported \X. My guess is that it will be pretty slow, but you can try it.

If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here. It would be pretty surprising to me to switch your entire programming language based on this one thing.

You could also use the pcre2 crate, which I believe supports the \X syntax explicitly.

@phiresky
Copy link

If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here.

I'm also interested in having this. For me the use case is finding emojis in text. Sounds like a simple problem, but consider these strings found in the wild:

  • 👨\u200d👩\u200d👦.
    This one is rendered as a family of three people when supported: 👨‍👩‍👦 . Note that this works for any combination of genders and you can also add skin color modifiers etc. The zero-width joiner shows that just doing emoji + modifiers is not enough, because you can join an arbitrary amount of emojis together.

  • \ud83e\udd26\u200d\u2642\ufe0f\ud83c\udfff A sequence of the following:
    ['FACE PALM',
    'ZERO WIDTH JOINER',
    'MALE SIGN',
    'VARIATION SELECTOR-16',
    'EMOJI MODIFIER FITZPATRICK TYPE-6']
    So it's a black male person facepalming. The variation-selector-16 apparently means "interpret the previous character as an emoji and not a text symbol, referring to rendering the "male sign" as "\u2642\ufe0f" = ♂️ and not "\u2642\ufe0e" = '♂︎'.

The best algorithm I found to find these (and all other emoji) is this one:

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)

    return emoji_list

Everything else I've seen basically uses a whitelist of known emojis , but that misses variants that are supported in different devices by combining other emoji as shown above.

So for each grapheme cluster, if at least one of the characters is a known emoji character, the whole cluster is counted as an emoji. This python code uses the PCRE2 library with the \\X escape mentioned above, as well as a list of "base-emoji" generated from the unicode consortium website.

The algorithm PCRE2 uses is described in detail in the Extended grapheme clusters section of the docs. I guess it's pretty similar to the .sh file @BurntSushi linked above, though it looks somewhat simpler

@BurntSushi
Copy link
Member
BurntSushi commented Jan 16, 2021

@phiresky I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic.

I'm not an expert on emojis but I eould consult Unicode's specification for them. They likely have an algorithm for detecting them already written.

And are you sure the regex package in Python uses PCRE2? I thought it didn't. On mobile, otherwise I would check.

@phiresky
Copy link
phiresky commented Jan 16, 2021

@BurntSushi

The regex package in python is confusingly not the internal regex engine (called re). The internal one does not support \X.

I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic.

That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ so I'll try that, thanks! I guess that might also be a solution for other people looking for grapheme clusters in the rust regex crate.

They likely have an algorithm for detecting them already written.

I guess that's off topic here then, but as far as I have found they only have a list of base emojis, and a list of "known" zero-width-joined sequences, as well as the general grapheme splitting algorithm. But sequences not in this list are not invalid, and they already don't have all combinations of skin tones etc in there I think. I think phone manufacturers etc just add emojis when they want so it's not really possible to create a comprehensive list.

https://unicode.org/emoji/charts/index.html
https://unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt

Edit: You're right, the regex pypi package does implement \X, but they have their own implementation separate from PCRE2.

@BurntSushi
Copy link
Member

That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ so I'll try that, thanks! I guess that might also be a solution for other people looking for grapheme clusters in the rust regex crate.

Right. As does bstr, with the advantage of working on &[u8].

Also, it's worth mentioning that there are two reasonable interpretations to what "support Unicode grapheme clusters" actually means in the context of a regex engine. The simplest is support for \X, which matches a single grapheme cluster. That could be reasonably added to the regex crate via a simple substitution of \X with the complex regex linked to above. However, it's not clear to me just how useful this really is. As shown in your case, it would be better to just iterate over the graphemes.

The other interpretation is more pervasive, whereby the regex engine itself becomes "grapheme aware." That is, things like . and characters classes more generally become able to match grapheme clusters. This is much harder to implement, or rather, hard to implement efficiently. I feel confident that this aspect of grapheme support will never make it into the regex crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants