Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54

Merged
merged 5 commits into from
Oct 10, 2023

Conversation

mruoss
Copy link
Contributor

@mruoss mruoss commented Sep 30, 2023

Hey,

This is a proposal for supporting regular expressions as pattern in Tokenizers.PreTokenizer.split/3 as discussed in #53. As I already mentioned in #53, I'm not a Rust developer so while I could successfully test my changes, I'm not sure this code is "good to go". What do you think?

Thanks,
Michael

@mruoss
Copy link
Contributor Author

mruoss commented Sep 30, 2023

This is the code I used to test the pre-tokenizer in IEX:

alias Tokenizers.{Model, Tokenizer, PreTokenizer, Trainer}

create_tokenizer = fn -> 
  {:ok, model} = Model.WordLevel.init(%{}, unk_token: "[UNK]")
  {:ok, tokenizer} = Tokenizer.init(model)
  
  tokenizer =
    tokenizer
    |> Tokenizer.set_normalizer(Normalizer.lowercase())
    |> Tokenizer.set_pre_tokenizer(
      PreTokenizer.sequence([
        PreTokenizer.split(~r/\?\d{2}\?/, :removed), # <---- here I can pass a Regex now.
        PreTokenizer.whitespace_split()
      ])
    )
  
  {:ok, trainer} = Trainer.wordlevel(special_tokens: ["[UNK]"])
  Tokenizer.train_from_files(tokenizer, ["path_to_file.txt"], trainer: trainer)
end

{:ok, tokenizer} = create_tokenizer.()
Tokenizer.encode(tokenizer, "These?11?Are?22?Four?33?Tokens")

Output:

{:ok, #Tokenizers.Encoding<[length: 4, ids: [16, 17, 18, 19]]>}

@Virviil
Copy link
Contributor

Virviil commented Oct 8, 2023

It's a great work!

I think that probably it's possible to pass regex into NIF directly. We can parse a struct in Rust code. It will probably keep the code more dry, because it will be no need to add additional option :use_regex.

iex> Map.to_list ~r/.*/ui
[
  __struct__: Regex,
  opts: "ui",
  re_pattern: {:re_pattern, 0, 1, 0,
   <<69, 82, 67, 80, 73, 0, 0, 0, 1, 8, 0, 32, 1, 129, 0, 0, 255, 255, 255, 255,
     255, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0,
     ...>>},
  re_version: {"8.44 2020-02-12", :little},
  source: ".*"
]
#[derive(Debug, NifStruct)]
#[module = "Regex"]
struct RegexStruct {
   opts: String,
   re_pattern: String,
   ...
}

@mruoss @jonatanklosko @josevalim WDYT?

@josevalim
Copy link
Contributor

My biggest concern is if they use the same underlying regular expression engine. If they don't, then we may have subtle differences, and therefore I don't think we should pass a regex downstream, rather a escape string, such as split_regex(..., ~S"(this|that)").

@Virviil
Copy link
Contributor

Virviil commented Oct 8, 2023

@josevalim
They are using https://github.com/kkos/oniguruma as an regex engine, so probably they don't match.

If we want to pass a string, there are several options:

PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: true)
//
PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
//
PreTokenizer.split({:regex, ~S(\?\d{2}\?)}, :removed)

@josevalim
Copy link
Contributor

josevalim commented Oct 8, 2023

Good call on the API. The first option is my least favorite. Both tuples and a separate function are fine to me. We will probably have this same issue in github.com/elixir-explorer/explorer, so we should probably try to find a unified solution.

@Virviil
Copy link
Contributor

Virviil commented Oct 8, 2023

@josevalim It seems that the first option with options is the best in this situation.

# With regex
PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: true) 

# vs 

# Can be specified explicitly
PreTokenizer.split("token", :removed, use_regex: false)

# Raw string can be used with undesired behaviour
PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: false)  

# default is false
PreTokenizer.split("token", :removed)

@mruoss Would you like to implement changes by yourself or you need some help?

@mruoss
Copy link
Contributor Author

mruoss commented Oct 8, 2023

I can do it and let you review my Rust code.
So did I get you right: We keep the :use_regex option? Means we keep passinh the option down to rust and distinguish the cases there, right?

@josevalim
Copy link
Contributor

Sorry, I meant to say "the first option is my least favorite" (now edited). I am not necessarily a big of having an option changing the meaning of the input drastically. split may also have different options depending if the input is a string or a regex, so a separate function may be best.

@Virviil
Copy link
Contributor

Virviil commented Oct 8, 2023

I don't see the possibility of having different options according to the type of an input, just from the way of how the Rust library is built.

Mb best idea will be to go with separate type then? Probably it can be then used in other functions (I'm not sure, but mb it can be passed in other builders).

Also, this type can be properly documented with links to https://github.com/kkos/oniguruma and further explanations about building regexes specifically for this library.

Something like:

%PreTokenizer.SplitRegex{expression: ~S(\?\d{2}\?)}
|> PreTokenizer.split(:removed)

@mruoss
Copy link
Contributor Author

mruoss commented Oct 8, 2023

Okay, I have dug a bit further into Rustler and things and here is a new proposal (to support the discussion, not necessarily meant as a final approach). The code that comes closest to what Rust expects if we use #[derive(NifTaggedEnum)] would look as follows:

  @spec split(String.t() | {:string, String.t()}| {:regex, String.t()} , split_delimiter_behaviour(), keyword()) :: t()
  def split(pattern, behavior, opts \\ [])

  def split(pattern, behavior, opts) when is_binary(pattern) do
    split({:string, pattern}, behavior, opts)
  end

  def split(pattern, behavior, opts) do
    Tokenizers.Native.pre_tokenizers_split(pattern, behavior, opts)
  end

which is basically the third option in your proposal. In rust we could then get rid of the :use_regex option, too and simply define a new local pattern enum:

#[derive(NifTaggedEnum)]
pub enum LocalSplitPattern {
    String(String),
    Regex(String)
}

#[rustler::nif]
pub fn pre_tokenizers_split(
    pattern: LocalSplitPattern,
    behavior: SplitDelimiterBehavior,
    options: Vec<SplitOption>,
) -> Result<ExTokenizersPreTokenizer, rustler::Error> {
    struct Opts {
        invert: bool,
    }
    let mut opts = Opts { invert: false };
    let final_pattern = match pattern {
        LocalSplitPattern::String(pattern) => SplitPattern::String(pattern),
        LocalSplitPattern::Regex(pattern) => SplitPattern::Regex(pattern),
    };

    for option in options {
        match option {
            SplitOption::Invert(invert) => opts.invert = invert,
        }
    }

    Ok(ExTokenizersPreTokenizer::new(
        tokenizers::pre_tokenizers::split::Split::new(final_pattern, behavior.into(), opts.invert)
            .map_err(|_| rustler::Error::BadArg)?,
    ))

I have pushed these changes on top of this PR so you see the changes. What do you think? Maybe there is even a better way than the match expression I have used to convert the LocalSplitPattern to tokenizers::pre_tokenizers::split::SplitPattern?

@mruoss
Copy link
Contributor Author

mruoss commented Oct 8, 2023

The same Rust code would also support the second approach in your proposal. This would be closer to the current API and we would not have to add a special case to make it backwards compatible...

  @spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()
  def split(pattern, behavior, opts \\ []) when is_binary(pattern) do
    Tokenizers.Native.pre_tokenizers_split({:string, pattern}, behavior, opts)
  end

  @spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()
  def split_regex(pattern, behavior, opts \\ []) when is_binary(pattern) do
    Tokenizers.Native.pre_tokenizers_split({:regex, pattern}, behavior, opts)
  end

I have pushed this variant now. Feels like the best to me so far.

@Virviil
Copy link
Contributor

Virviil commented Oct 10, 2023

LGTM

@josevalim @jonatanklosko can we proceed?

@mruoss
Copy link
Contributor Author

mruoss commented Oct 10, 2023

yeah sorry, I've ran the linter locally now. Should be green this time.

@josevalim josevalim merged commit a8a7464 into elixir-nx:main Oct 10, 2023
@josevalim
Copy link
Contributor

💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants