Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suport for offset mapping? #22

Closed
xuxiaoxia96 opened this issue Aug 7, 2024 · 1 comment
Closed

suport for offset mapping? #22

xuxiaoxia96 opened this issue Aug 7, 2024 · 1 comment

Comments

@xuxiaoxia96
Copy link

Hey!
Thanks for this great library, this helped us to avoid installing the whole transformers library to be able to use the tokenizer!

I want to ask how can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text?
e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...]

I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.

I have not found solution to this. The library only supports Encode and Decode method. Any insights would be appreciated. Thank you!

@daulet
Copy link
Owner

daulet commented Aug 9, 2024

@xuxiaoxia96 checkout latest release and in particular this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants