What is distillation? (IE: a distilled dataset) #23

AaronWard · 2023-10-26T11:07:34Z

AaronWard
Oct 26, 2023
Maintainer

Distillation involves training a smaller model, known as a student model, to mimic the behavior of a larger model, known as a teacher model. The teacher model is usually a state-of-the-art LLM like GPT-4, which has a large number of parameters and requires significant computational resources. The smaller student model can be trained faster and with fewer computational resources.

To circumvent these deployment challenges, practitioners often choose to deploy smaller specialized models instead. These smaller models are trained using one of two common paradigms: fine-tuning or distillation. Fine-tuning updates a pre-trained smaller model using downstream manually-annotated data. Distillation trains the same smaller models with labels generated by a larger LLM. Unfortunately, to achieve comparable performance to LLMs, fine-tuning methods require human-generated labels, which are expensive and tedious to obtain, while distillation requires large amounts of unlabeled data, which can also be hard to collect.

Distilling step-by-step consists of two main stages:

In the first stage, we leverage few-shot CoT prompting to extract rationales from LLMs.
Specifically, given a task, we prepare few-shot exemplars in the LLM input prompt where each example is composed of a triplet containing: (1) input, (2) rationale, and (3) output.

Given the prompt, an LLM is able to mimic the triplet demonstration to generate the rationale for any new input. For instance, in a commonsense question answering task, given the input question “Sammy wanted to go to where the people are. Where might he go? Answer Choices: (a) populated areas, (b) race track, (c) desert, (d) apartment, (e) roadblock”, distilling step-by-step provides the correct answer to the question, “(a) populated areas”, paired with the rationale that provides better connection from the question to the answer, “The answer must be a place with a lot of people. Of the above choices, only populated areas have a lot of people.” By providing CoT examples paired with rationales in the prompt, the in-context learning ability allows LLMs to output corresponding rationales for future unseen inputs.

Distilling step-by-step reduces both the training dataset required to curate task-specific smaller models and the model size required to achieve, and even surpass, a few-shot prompted LLM’s performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is distillation? (IE: a distilled dataset) #23

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

What is distillation? (IE: a distilled dataset) #23

AaronWard Oct 26, 2023 Maintainer

Replies: 0 comments

AaronWard
Oct 26, 2023
Maintainer