suport for offset mapping? #22

xuxiaoxia96 · 2024-08-07T11:29:00Z

Hey！
Thanks for this great library, this helped us to avoid installing the whole transformers library to be able to use the tokenizer!

I want to ask how can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text?
e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...]

I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.

I have not found solution to this. The library only supports Encode and Decode method. Any insights would be appreciated. Thank you!

The text was updated successfully, but these errors were encountered:

daulet · 2024-08-09T23:23:08Z

@xuxiaoxia96 checkout latest release and in particular this PR

xuxiaoxia96 closed this as completed Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suport for offset mapping? #22

suport for offset mapping? #22

xuxiaoxia96 commented Aug 7, 2024

daulet commented Aug 9, 2024

suport for offset mapping? #22

suport for offset mapping? #22

Comments

xuxiaoxia96 commented Aug 7, 2024

daulet commented Aug 9, 2024