Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tokenizers to 0.15.0 #55

Merged
merged 4 commits into from
Dec 13, 2023
Merged

Update tokenizers to 0.15.0 #55

merged 4 commits into from
Dec 13, 2023

Conversation

cigrainger
Copy link
Member

As on the tin. This gets us various improvements and bugfixes as detailed in the release notes.

Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if you want to add the option that would be great!

@cigrainger
Copy link
Member Author

Okay @jonatanklosko exposed the option <3

Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐑

@cigrainger cigrainger merged commit f560e8f into main Dec 13, 2023
3 checks passed
@cigrainger cigrainger deleted the cg/upgrade-tokenizers branch December 13, 2023 16:22

let byte_fallback = match options
.iter()
.find(|opt| matches!(opt, UnigramOption::ByteFallback(_)))
Copy link
Contributor

@Virviil Virviil Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter . find on each opt is $O(n^2)$ complexity.
Probably it's better to find another solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With two options it's fine, but we could also do the same as here:

struct Opts {
prefix: Option<String>,
}
// Default values
let mut opts = Opts { prefix: None };
options.into_iter().for_each(|option| match option {
ModelSaveOption::Prefix(prefix) => opts.prefix = Some(prefix),
});

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I mean we're talking about an incredibly negligible bit of performance, but yeah mutating an opts struct is probably more performant :).

_ => None,
};

let byte_fallback = match options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we matching the same 2 times (here and inside matches!)? Is it compiler optimised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, the outer match is on the found inner match, which is on each element.

@@ -125,7 +125,7 @@ fn apply_load_options(mut tokenizer: ExTokenizerImpl, options: Vec<LoadOption>)
}

if opts.disable_truncation {
tokenizer.with_truncation(None);
tokenizer.with_truncation(None).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is API changed to return error? Why? I'm not sure unwrap handles it properly (mb yes)

native/ex_tokenizers/src/trainers.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants