Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.
Currently, three modes are supported to generate labeling programs :
- ScriptoriumWS mode: One stage: LLM (your choice) is used to generate labeling functions for the given dataset.
- Alchemist without RAG mode: Two stages. In the first stage, it prompts a language model (such as GPT or Claude) to generate heuristics by utilizing prior knowledge along with the user's mission statement. In the second stage, it uses these generated heuristics to prompt a CodeLLM, creating labeling functions for the given dataset.
- Alchemist with RAG mode: Two stages. In the first stage, it pulls relevant information from a vector store, which holds pre-embedded documents or data (related to the dataset). This knowledge is used to generate heuristics. In the second stage, the model uses those heuristics to create labeling functions with the help of CodeLLM.
- Create an
.env
file in the Alchemist folder and addexport OPENAI_API_KEY= <your API key>
to it.- If desired, modify the
LF_saved_dir
field inconfig.json
to the directory that you wish the generated labeling functions to be saved to.
- If desired, modify the
- Run the
python main.py
command on the terminal. This should start an interactive command line interface. See the following section on user inputs. - Type
exit
at any time to stop execution of the program.
- Dataset and the Task Description
Initially, you will describe your dataset and the labeling task at hand. This ensures that the generated labeling functions are aligned with your specific requirements.
Upon running main.py
, you will be prompted to provide various inputs through the terminal. Here’s what to expect:
-
Dataset Selection: Choose the dataset that you'd like to generate labeling functions for:
- youtube (spam review classification)
- sms (spam text classification)
- imdb (sentiment classification)
- yelp (sentiment classification)
- agnews (topic classification)
- medabs (topic classification)
- cancer (topic classification)
- reddit (topic classification)
- french (sentiment classification)
- finance (sentiment classification)
-
Mode Selection: Choose the mode that best fits your needs:
- ScriptoriumWS mode
- Alchemist without RAG mode
- Alchemist with RAG mode
-
LLM Selection for Code Generation: Choose the model you'd like to use for labeling function code generation:
- gpt-3.5-turbo (gpt-3.5-turbo-0125)
- gpt-4 (gpt-4-0613)
- claude 2.1 (claude-2.1)
- claude 3 Sonnet (claude-3-sonnet-20240229)
-
LLM Selection for Prior Knowledge Generation: If you selected "Alchemist without RAG mode" or "Alchemist with RAG mode", you will be prompted to select the model you'd like to use for prior knowledge generation:
- gpt-3.5-turbo (gpt-3.5-turbo-0125)
- gpt-4 (gpt-4-0613)
- claude 2.1 (claude-2.1)
- claude 3 Sonnet (claude-3-sonnet-20240229)
-
Specify RAG Path: If you selected "Alchemist with RAG mode", you will be prompted to specify the file path to your external knowledge.
-
Specify Heuristics for Generating Prior Knowledge: If you selected "Alchemist without RAG mode" or "Alchemist with RAG mode", you must select which type of prior knowledge to generate:
- labeling heuristics, rules, and guidance
- keywords
- dataset and class description
- 5 data examples for each class
main.py
: Main file used to run the code that starts up an interactive command line and launches different modes based on user inputs.config.py
: Manages the initial setup by capturing user inputs to configure the application accordingly.executor.py
: Instantiates the corresponding mode object and initiates the execution process tailored to that mode.modes
: Directory that contains code for different modes. See above for descriptions of the last three modes.base_mode.py
: Parent class to all the modes.scriptoriumws_mode.py
alchemist_without_RAG_mode.py
alchemist_with_RAG_mode.py
pricing.py
: Used to parse the generated labeling function files and returns the total cost by dataset, mode, mode, and heuristic mode.
Please cite our paper if you find the repository helpful.
@inproceedings{huang2024the,
title={The {ALCHE}mist: Automated Labeling 500x {CHE}aper than {LLM} Data Annotators},
author={Tzu-Heng Huang and Catherine Cao and Vaishnavi Bhargava and Frederic Sala},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=T0glCBw28a}
}