Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54

mruoss · 2023-09-30T19:20:20Z

Hey,

This is a proposal for supporting regular expressions as pattern in Tokenizers.PreTokenizer.split/3 as discussed in #53. As I already mentioned in #53, I'm not a Rust developer so while I could successfully test my changes, I'm not sure this code is "good to go". What do you think?

Thanks,
Michael

mruoss · 2023-09-30T19:51:58Z

This is the code I used to test the pre-tokenizer in IEX:

alias Tokenizers.{Model, Tokenizer, PreTokenizer, Trainer}

create_tokenizer = fn -> 
  {:ok, model} = Model.WordLevel.init(%{}, unk_token: "[UNK]")
  {:ok, tokenizer} = Tokenizer.init(model)
  
  tokenizer =
    tokenizer
    |> Tokenizer.set_normalizer(Normalizer.lowercase())
    |> Tokenizer.set_pre_tokenizer(
      PreTokenizer.sequence([
        PreTokenizer.split(~r/\?\d{2}\?/, :removed), # <---- here I can pass a Regex now.
        PreTokenizer.whitespace_split()
      ])
    )
  
  {:ok, trainer} = Trainer.wordlevel(special_tokens: ["[UNK]"])
  Tokenizer.train_from_files(tokenizer, ["path_to_file.txt"], trainer: trainer)
end

{:ok, tokenizer} = create_tokenizer.()
Tokenizer.encode(tokenizer, "These?11?Are?22?Four?33?Tokens")

Output:

{:ok, #Tokenizers.Encoding<[length: 4, ids: [16, 17, 18, 19]]>}

Virviil · 2023-10-08T07:58:55Z

It's a great work!

I think that probably it's possible to pass regex into NIF directly. We can parse a struct in Rust code. It will probably keep the code more dry, because it will be no need to add additional option :use_regex.

iex> Map.to_list ~r/.*/ui
[
  __struct__: Regex,
  opts: "ui",
  re_pattern: {:re_pattern, 0, 1, 0,
   <<69, 82, 67, 80, 73, 0, 0, 0, 1, 8, 0, 32, 1, 129, 0, 0, 255, 255, 255, 255,
     255, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0,
     ...>>},
  re_version: {"8.44 2020-02-12", :little},
  source: ".*"
]

#[derive(Debug, NifStruct)]
#[module = "Regex"]
struct RegexStruct {
   opts: String,
   re_pattern: String,
   ...
}

@mruoss @jonatanklosko @josevalim WDYT?

josevalim · 2023-10-08T08:06:44Z

My biggest concern is if they use the same underlying regular expression engine. If they don't, then we may have subtle differences, and therefore I don't think we should pass a regex downstream, rather a escape string, such as split_regex(..., ~S"(this|that)").

Virviil · 2023-10-08T08:15:41Z

@josevalim
They are using https://github.com/kkos/oniguruma as an regex engine, so probably they don't match.

If we want to pass a string, there are several options:

PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: true)
//
PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
//
PreTokenizer.split({:regex, ~S(\?\d{2}\?)}, :removed)

josevalim · 2023-10-08T08:35:35Z

Good call on the API. The first option is my least favorite. Both tuples and a separate function are fine to me. We will probably have this same issue in github.com/elixir-explorer/explorer, so we should probably try to find a unified solution.

Virviil · 2023-10-08T09:05:33Z

@josevalim It seems that the first option with options is the best in this situation.

# With regex
PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: true) 

# vs 

# Can be specified explicitly
PreTokenizer.split("token", :removed, use_regex: false)

# Raw string can be used with undesired behaviour
PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: false)  

# default is false
PreTokenizer.split("token", :removed)

@mruoss Would you like to implement changes by yourself or you need some help?

mruoss · 2023-10-08T09:08:11Z

I can do it and let you review my Rust code.
So did I get you right: We keep the :use_regex option? Means we keep passinh the option down to rust and distinguish the cases there, right?

josevalim · 2023-10-08T09:09:03Z

Sorry, I meant to say "the first option is my least favorite" (now edited). I am not necessarily a big of having an option changing the meaning of the input drastically. split may also have different options depending if the input is a string or a regex, so a separate function may be best.

Virviil · 2023-10-08T09:19:12Z

I don't see the possibility of having different options according to the type of an input, just from the way of how the Rust library is built.

Mb best idea will be to go with separate type then? Probably it can be then used in other functions (I'm not sure, but mb it can be passed in other builders).

Also, this type can be properly documented with links to https://github.com/kkos/oniguruma and further explanations about building regexes specifically for this library.

Something like:

%PreTokenizer.SplitRegex{expression: ~S(\?\d{2}\?)}
|> PreTokenizer.split(:removed)

mruoss · 2023-10-08T11:31:57Z

Okay, I have dug a bit further into Rustler and things and here is a new proposal (to support the discussion, not necessarily meant as a final approach). The code that comes closest to what Rust expects if we use #[derive(NifTaggedEnum)] would look as follows:

  @spec split(String.t() | {:string, String.t()}| {:regex, String.t()} , split_delimiter_behaviour(), keyword()) :: t()
  def split(pattern, behavior, opts \\ [])

  def split(pattern, behavior, opts) when is_binary(pattern) do
    split({:string, pattern}, behavior, opts)
  end

  def split(pattern, behavior, opts) do
    Tokenizers.Native.pre_tokenizers_split(pattern, behavior, opts)
  end

which is basically the third option in your proposal. In rust we could then get rid of the :use_regex option, too and simply define a new local pattern enum:

#[derive(NifTaggedEnum)]
pub enum LocalSplitPattern {
    String(String),
    Regex(String)
}

#[rustler::nif]
pub fn pre_tokenizers_split(
    pattern: LocalSplitPattern,
    behavior: SplitDelimiterBehavior,
    options: Vec<SplitOption>,
) -> Result<ExTokenizersPreTokenizer, rustler::Error> {
    struct Opts {
        invert: bool,
    }
    let mut opts = Opts { invert: false };
    let final_pattern = match pattern {
        LocalSplitPattern::String(pattern) => SplitPattern::String(pattern),
        LocalSplitPattern::Regex(pattern) => SplitPattern::Regex(pattern),
    };

    for option in options {
        match option {
            SplitOption::Invert(invert) => opts.invert = invert,
        }
    }

    Ok(ExTokenizersPreTokenizer::new(
        tokenizers::pre_tokenizers::split::Split::new(final_pattern, behavior.into(), opts.invert)
            .map_err(|_| rustler::Error::BadArg)?,
    ))

I have pushed these changes on top of this PR so you see the changes. What do you think? Maybe there is even a better way than the match expression I have used to convert the LocalSplitPattern to tokenizers::pre_tokenizers::split::SplitPattern?

mruoss · 2023-10-08T19:40:35Z

The same Rust code would also support the second approach in your proposal. This would be closer to the current API and we would not have to add a special case to make it backwards compatible...

  @spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()
  def split(pattern, behavior, opts \\ []) when is_binary(pattern) do
    Tokenizers.Native.pre_tokenizers_split({:string, pattern}, behavior, opts)
  end

  @spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()
  def split_regex(pattern, behavior, opts \\ []) when is_binary(pattern) do
    Tokenizers.Native.pre_tokenizers_split({:regex, pattern}, behavior, opts)
  end

I have pushed this variant now. Feels like the best to me so far.

Virviil · 2023-10-10T13:20:25Z

LGTM

@josevalim @jonatanklosko can we proceed?

mruoss · 2023-10-10T15:31:52Z

yeah sorry, I've ran the linter locally now. Should be green this time.

josevalim · 2023-10-10T15:45:21Z

💚 💙 💜 💛 ❤️

Add support for regular expressions in Tokenizers.PreTokenizer.split/3

7eda3d3

add a test to verify split accepts regular expressions

b3bfd26

use tuples to pass the pattern as enum

171ebc1

mruoss force-pushed the add-support-for-regex branch from 3f663c4 to 171ebc1 Compare October 8, 2023 11:40

implement regex split pre-tokenizer as separate function

11dc09d

mruoss force-pushed the add-support-for-regex branch from 9cf5aa2 to 11dc09d Compare October 8, 2023 20:12

mruoss force-pushed the add-support-for-regex branch from cdb45c9 to f968d8a Compare October 10, 2023 15:25

fix the formatting in pre_tokenizers.rs

e6287bf

mruoss force-pushed the add-support-for-regex branch from f968d8a to e6287bf Compare October 10, 2023 15:31

josevalim merged commit a8a7464 into elixir-nx:main Oct 10, 2023

mruoss deleted the add-support-for-regex branch October 10, 2023 15:45

mruoss mentioned this pull request Oct 10, 2023

Using regular expressions in Tokenizers.PreTokenizer.split/3 #53

Closed

mruoss mentioned this pull request Apr 21, 2024

add support for regular expressions in Tokenizers.Normalizer.replace/2 #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54

Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54

mruoss commented Sep 30, 2023 •

edited

Loading

mruoss commented Sep 30, 2023 •

edited

Loading

Virviil commented Oct 8, 2023 •

edited

Loading

josevalim commented Oct 8, 2023

Virviil commented Oct 8, 2023 •

edited

Loading

josevalim commented Oct 8, 2023 •

edited

Loading

Virviil commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 8, 2023 •

edited

Loading

josevalim commented Oct 8, 2023

Virviil commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 8, 2023 •

edited

Loading

Virviil commented Oct 10, 2023

mruoss commented Oct 10, 2023 •

edited

Loading

josevalim commented Oct 10, 2023

Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54

Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54

Conversation

mruoss commented Sep 30, 2023 • edited Loading

mruoss commented Sep 30, 2023 • edited Loading

Virviil commented Oct 8, 2023 • edited Loading

josevalim commented Oct 8, 2023

Virviil commented Oct 8, 2023 • edited Loading

josevalim commented Oct 8, 2023 • edited Loading

Virviil commented Oct 8, 2023 • edited Loading

mruoss commented Oct 8, 2023 • edited Loading

josevalim commented Oct 8, 2023

Virviil commented Oct 8, 2023 • edited Loading

mruoss commented Oct 8, 2023 • edited Loading

mruoss commented Oct 8, 2023 • edited Loading

Virviil commented Oct 10, 2023

mruoss commented Oct 10, 2023 • edited Loading

josevalim commented Oct 10, 2023

mruoss commented Sep 30, 2023 •

edited

Loading

mruoss commented Sep 30, 2023 •

edited

Loading

Virviil commented Oct 8, 2023 •

edited

Loading

Virviil commented Oct 8, 2023 •

edited

Loading

josevalim commented Oct 8, 2023 •

edited

Loading

Virviil commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 8, 2023 •

edited

Loading

Virviil commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 8, 2023 •

edited

Loading

mruoss commented Oct 10, 2023 •

edited

Loading