-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for regular expressions in Tokenizers.PreTokenizer.split/3 #54
Conversation
This is the code I used to test the pre-tokenizer in IEX: alias Tokenizers.{Model, Tokenizer, PreTokenizer, Trainer}
create_tokenizer = fn ->
{:ok, model} = Model.WordLevel.init(%{}, unk_token: "[UNK]")
{:ok, tokenizer} = Tokenizer.init(model)
tokenizer =
tokenizer
|> Tokenizer.set_normalizer(Normalizer.lowercase())
|> Tokenizer.set_pre_tokenizer(
PreTokenizer.sequence([
PreTokenizer.split(~r/\?\d{2}\?/, :removed), # <---- here I can pass a Regex now.
PreTokenizer.whitespace_split()
])
)
{:ok, trainer} = Trainer.wordlevel(special_tokens: ["[UNK]"])
Tokenizer.train_from_files(tokenizer, ["path_to_file.txt"], trainer: trainer)
end
{:ok, tokenizer} = create_tokenizer.()
Tokenizer.encode(tokenizer, "These?11?Are?22?Four?33?Tokens") Output: {:ok, #Tokenizers.Encoding<[length: 4, ids: [16, 17, 18, 19]]>}
|
It's a great work! I think that probably it's possible to pass regex into NIF directly. We can parse a struct in Rust code. It will probably keep the code more dry, because it will be no need to add additional option iex> Map.to_list ~r/.*/ui
[
__struct__: Regex,
opts: "ui",
re_pattern: {:re_pattern, 0, 1, 0,
<<69, 82, 67, 80, 73, 0, 0, 0, 1, 8, 0, 32, 1, 129, 0, 0, 255, 255, 255, 255,
255, 255, 255, 255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0,
...>>},
re_version: {"8.44 2020-02-12", :little},
source: ".*"
] #[derive(Debug, NifStruct)]
#[module = "Regex"]
struct RegexStruct {
opts: String,
re_pattern: String,
...
} @mruoss @jonatanklosko @josevalim WDYT? |
My biggest concern is if they use the same underlying regular expression engine. If they don't, then we may have subtle differences, and therefore I don't think we should pass a regex downstream, rather a escape string, such as |
@josevalim If we want to pass a string, there are several options: PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: true)
//
PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
//
PreTokenizer.split({:regex, ~S(\?\d{2}\?)}, :removed) |
Good call on the API. The first option is my least favorite. Both tuples and a separate function are fine to me. We will probably have this same issue in github.com/elixir-explorer/explorer, so we should probably try to find a unified solution. |
@josevalim It seems that the first option with options is the best in this situation. # With regex
PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: true)
# vs
# Can be specified explicitly
PreTokenizer.split("token", :removed, use_regex: false)
# Raw string can be used with undesired behaviour
PreTokenizer.split(~S(\?\d{2}\?), :removed, use_regex: false)
# default is false
PreTokenizer.split("token", :removed) @mruoss Would you like to implement changes by yourself or you need some help? |
I can do it and let you review my Rust code. |
Sorry, I meant to say "the first option is my least favorite" (now edited). I am not necessarily a big of having an option changing the meaning of the input drastically. |
I don't see the possibility of having different options according to the type of an input, just from the way of how the Rust library is built. Mb best idea will be to go with separate type then? Probably it can be then used in other functions (I'm not sure, but mb it can be passed in other builders). Also, this type can be properly documented with links to Something like: %PreTokenizer.SplitRegex{expression: ~S(\?\d{2}\?)}
|> PreTokenizer.split(:removed) |
Okay, I have dug a bit further into Rustler and things and here is a new proposal (to support the discussion, not necessarily meant as a final approach). The code that comes closest to what Rust expects if we use @spec split(String.t() | {:string, String.t()}| {:regex, String.t()} , split_delimiter_behaviour(), keyword()) :: t()
def split(pattern, behavior, opts \\ [])
def split(pattern, behavior, opts) when is_binary(pattern) do
split({:string, pattern}, behavior, opts)
end
def split(pattern, behavior, opts) do
Tokenizers.Native.pre_tokenizers_split(pattern, behavior, opts)
end which is basically the third option in your proposal. In rust we could then get rid of the #[derive(NifTaggedEnum)]
pub enum LocalSplitPattern {
String(String),
Regex(String)
}
#[rustler::nif]
pub fn pre_tokenizers_split(
pattern: LocalSplitPattern,
behavior: SplitDelimiterBehavior,
options: Vec<SplitOption>,
) -> Result<ExTokenizersPreTokenizer, rustler::Error> {
struct Opts {
invert: bool,
}
let mut opts = Opts { invert: false };
let final_pattern = match pattern {
LocalSplitPattern::String(pattern) => SplitPattern::String(pattern),
LocalSplitPattern::Regex(pattern) => SplitPattern::Regex(pattern),
};
for option in options {
match option {
SplitOption::Invert(invert) => opts.invert = invert,
}
}
Ok(ExTokenizersPreTokenizer::new(
tokenizers::pre_tokenizers::split::Split::new(final_pattern, behavior.into(), opts.invert)
.map_err(|_| rustler::Error::BadArg)?,
)) I have pushed these changes on top of this PR so you see the changes. What do you think? Maybe there is even a better way than the match expression I have used to convert the |
3f663c4
to
171ebc1
Compare
The same Rust code would also support the second approach in your proposal. This would be closer to the current API and we would not have to add a special case to make it backwards compatible... @spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()
def split(pattern, behavior, opts \\ []) when is_binary(pattern) do
Tokenizers.Native.pre_tokenizers_split({:string, pattern}, behavior, opts)
end
@spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()
def split_regex(pattern, behavior, opts \\ []) when is_binary(pattern) do
Tokenizers.Native.pre_tokenizers_split({:regex, pattern}, behavior, opts)
end I have pushed this variant now. Feels like the best to me so far. |
9cf5aa2
to
11dc09d
Compare
LGTM @josevalim @jonatanklosko can we proceed? |
cdb45c9
to
f968d8a
Compare
f968d8a
to
e6287bf
Compare
yeah sorry, I've ran the linter locally now. Should be green this time. |
💚 💙 💜 💛 ❤️ |
Hey,
This is a proposal for supporting regular expressions as
pattern
in Tokenizers.PreTokenizer.split/3 as discussed in #53. As I already mentioned in #53, I'm not a Rust developer so while I could successfully test my changes, I'm not sure this code is "good to go". What do you think?Thanks,
Michael