A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

lsy641 · 2023-10-24T16:54:02Z

I saw this previous dicussion is really interesting at Multi-word segmentation #220 and knew you project members have experimented the segmentation beyond word-level on MT datasets and didn't see significant improvement.

I think it is because the segmentation of sub-word vocabulary was already trained from MT data, and there is little improvement room in effectiveness by changing granularity, although increaing granularity can bring efficiency boost. But in the era of pretraining models, I rethink to change the granularity and compositionality of generation in downstream domain.

Recently, our work((https://arxiv.org/abs/2310.05317)) provides a solution to make pretraining model be able to adopt a task-adaptive tokenizer, which supports variable segmentation optimized by the downstream data. Then it allows multi bigger granular segamentations (still retaining sub-word level) to be sampled. It does bring significant improvement in both generation effectiveness and efficiency for the tasks where task-specific terminologies often show up (e.g., medical, mental healh)
The improvement is from two sources. 1. The gap between the pretraining vocabulary (for exampl, Bert vocabulary is optimized by GNMT benchmark that may be suitable for MT, but not for other tasks) and the downstream language style. 2.The second is the potential of varabile segamentation on efficiency.

To build a task-adaptive tokenizer, currently I manually sew the pretraining vocabulary and the downstream vocabulary by using the ProtoBuf apis provided by sentencepiece_model_bp2.py and sentencepiece_bp2.py and build a new tokenizer compatible with HuggingFace. I saw wondering if your project is interested to provide a funtion for researchers to easily build a task-adatpive tokenizer.

RubyBit · 2024-02-05T22:24:32Z

I read the paper, is there any code available which showcases the algorithm?

lsy641 · 2024-02-21T23:02:43Z

I read the paper, is there any code available which showcases the algorithm?

@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account

RubyBit · 2024-02-25T13:40:58Z

I read the paper, is there any code available which showcases the algorithm?

@RubyBit Hello. I am currently organizing the codes but I am able to add you to our repository in advance. If you would like to join the repository, please give me your github account

Yes that would be great (this is my github account: RubyBit)

RubyBit · 2024-03-20T12:54:41Z

@lsy641 I am so sorry, can you resend the invite? I didn't check my mail in time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

lsy641 commented Oct 24, 2023 •

edited

Loading

RubyBit commented Feb 5, 2024

lsy641 commented Feb 21, 2024

RubyBit commented Feb 25, 2024

RubyBit commented Mar 20, 2024

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

A recent EMNLP work to share about task-adaptive tokenization with variable segmentation #924

Comments

lsy641 commented Oct 24, 2023 • edited Loading

RubyBit commented Feb 5, 2024

lsy641 commented Feb 21, 2024

RubyBit commented Feb 25, 2024

RubyBit commented Mar 20, 2024

lsy641 commented Oct 24, 2023 •

edited

Loading