Skip to content

Commit

Permalink
implement regex split pre-tokenizer as separate function
Browse files Browse the repository at this point in the history
  • Loading branch information
mruoss committed Oct 8, 2023
1 parent 171ebc1 commit 9cf5aa2
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 20 deletions.
41 changes: 27 additions & 14 deletions lib/tokenizers/pre_tokenizer.ex
Original file line number Diff line number Diff line change
Expand Up @@ -134,31 +134,44 @@ defmodule Tokenizers.PreTokenizer do
| :contiguous

@doc """
Creates a Split pre-tokenizer.
Creates a Split pre-tokenizer using a string as split pattern.
Versatile pre-tokenizer that splits on provided pattern and according
to provided behavior. The pattern should be in the form of a tuple
`{:string, pattern}` or `{:regex, pattern}` depending on whether the tuple is
a regular expression or not. For convenience, a simple binary is accepted
as well in which case the pattern is converted to the tuple
`{:string, pattern}`.
to provided behavior.
## Options
* `:invert` - whether to invert the split or not. Defaults to `false`
"""
@spec split(String.t() | {:string, String.t()}| {:regex, String.t()} , split_delimiter_behaviour(), keyword()) :: t()
def split(pattern, behavior, opts \\ [])

def split(pattern, behavior, opts) when is_binary(pattern) do
split({:string, pattern}, behavior, opts)
@spec split(String.t(), split_delimiter_behaviour(), keyword()) :: t()
def split(pattern, behavior, opts \\ []) when is_binary(pattern) do
Tokenizers.Native.pre_tokenizers_split({:string, pattern}, behavior, opts)
end

def split(pattern, behavior, opts) do
Tokenizers.Native.pre_tokenizers_split(pattern, behavior, opts)
end
@doc ~S"""
Creates a Split pre-tokenizer using a regular expression as split pattern.
Versatile pre-tokenizer that splits on provided regex pattern and according
to provided behavior.
The `pattern` should be a string representing a regular expression
according to the [Oniguruma Regex Engine](https://github.com/kkos/oniguruma).
## Options
* `:invert` - whether to invert the split or not. Defaults to `false`
## Example
iex> Tokenizers.PreTokenizer.split_regex(~S(\?\d{2}\?), :removed)
#Tokenizers.PreTokenizer<[pre_tokenizer_type: "Split"]>
"""
@spec split_regex(String.t(), split_delimiter_behaviour(), keyword()) :: t()
def split_regex(pattern, behavior, opts \\ []) when is_binary(pattern) do
Tokenizers.Native.pre_tokenizers_split({:regex, pattern}, behavior, opts)
end

@doc """
Creates a Punctuation pre-tokenizer.
Expand Down
15 changes: 9 additions & 6 deletions test/tokenizers/pre_tokenizer_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,24 @@ defmodule Tokenizers.PreTokenizerTest do

describe "Split pretokenizer" do
test "accepts no parameters" do
assert %Tokenizers.PreTokenizer{} = Tokenizers.PreTokenizer.split({:string, " "}, :removed)
assert %Tokenizers.PreTokenizer{} = Tokenizers.PreTokenizer.split(" ", :removed)
end

test "accepts regular expressions" do
test "accepts options" do
assert %Tokenizers.PreTokenizer{} =
Tokenizers.PreTokenizer.split({:regex, ~S/.*/}, :removed)
Tokenizers.PreTokenizer.split(" ", :removed, invert: true)
end
end

test "accepts binaries" do
assert %Tokenizers.PreTokenizer{} = Tokenizers.PreTokenizer.split(" ", :removed)
describe "Regex split pretokenizer" do
test "accepts regular expressions" do
assert %Tokenizers.PreTokenizer{} =
Tokenizers.PreTokenizer.split_regex(".*", :removed)
end

test "accepts options" do
assert %Tokenizers.PreTokenizer{} =
Tokenizers.PreTokenizer.split(" ", :removed, invert: true)
Tokenizers.PreTokenizer.split_regex(".*", :removed, invert: true)
end
end

Expand Down

0 comments on commit 9cf5aa2

Please sign in to comment.