Skip to content

LLM Annotations

Zdenek Kasner edited this page Oct 30, 2024 · 7 revisions

LLM Annotations

With factgenie, you can query LLMs to annotate spans in the generated outputs:

LLM eval scheme

✨ LLM APIs

The LLMs need to run externally: factgenie will query them through an API. Currently, factgenie supports the following APIs:

πŸ’‘ In principle, factgenie can operate with any API as long as the response is in JSON format: see factgenie/metrics.py. If you wish to contribute here, please see the Contributing guidelines.

🚦 Setting up an LLM evaluation campaign

You can easily set up a new LLM eval (i.e., a campaign using LLMs for annotations) through the web interface:

  1. Go to /llm_campaign?mode=llm_eval and click on New LLM campaign.
  2. Insert a unique campaign ID .
  3. Configure the LLM evaluator.
  4. Select the datasets and splits you want to annotate.

Let us now look into the steps 3 and 4 in more details.

πŸ› οΈ Configuring the LLM evaluator

The fields you need to configure are the following:

  • API: The model API,
  • Model: The identifier of the model you are querying.
  • Prompt template: The prompt for the model.
    • ❗️See below for the instructions for setting up the prompt.
  • System message: For OpenAI API, this will be sent in the system_message parameter.
  • API URL: For Ollama API, this is the API URL, e.g. http://my-server.com:11434/api/generate. The parameter is ignored for OpenAI API.
  • Model arguments: Additional arguments for the model API.
  • Annotation categories: Names and colors of the categories you want to annotate. You need to specify the details about the categories in the prompt (see below for details). The colors will be later used for the highlights in the web interface.
  • Extra arguments: Additional arguments for the metric class.

The pre-defined YAML configurations for LLM campaigns are stored in factgenie/config/llm-eval. If you wish, you can also edit these files manually. You can also save the configuration in a YAML file through the web interface.

After creating the campaign, all the configuration parameters will be saved in the file factgenie/annotations/<llm-eval-id>/metadata.json.

πŸ’¬ Configuring the prompt

It is important to set up the prompt for the model correctly so that you get accurate results from the LLM evaluator.

If you wish to include the input data and the generated output in the prompt, use the following placeholders:

  • {data} for inserting the raw representation of the input data,
  • {text} for inserting the output text.

The placeholders will be replaced with the actual values for each example.

Factgenie also needs to parse the model response. Even though our example API calls should return a valid JSON, you still need to prompt the model to produce JSON outputs in a specific format:

{
  "annotations": [
    { 
      "text": [TEXT_SPAN],
      "type": [ANNOTATION_SPAN_CATEGORY]
    },
    ...
  ]
}

where:

  • TEXT_SPAN is the actual text snippet from the output.
  • ANNOTATION_SPAN_CATEGORY is a number from the list of annotation categories.

πŸ’‘ Optionally, you can also ask the model for the field reason (or note, which is equivalent), which is a string containing a reasoning trace about the annotation. The content of this field will be visible on hover in the web interface.

For instructing the model about the annotation categories, you can include a variant of the following snippet in the prompt (customized to your needs):

The value of "type" is one of {0, 1, 2, 3} based on the following list:
- 0: Incorrect fact: The fact in the text contradicts the data.
- 1: Not checkable: The fact in the text cannot be checked in the data.
- 2: Misleading: The fact in the text is misleading in the given context.
- 3: Other: The text is problematic for another reason, e.g. grammatically or stylistically incorrect, irrelevant, or repetitive.

Note that we do not ask for indices since the model cannot reliably index the output text. Instead, we perform forward string matching:

  • If we match the string in the output, we shift the initial position for matching the next string to the next character after the currently matched string.
  • If we do not match the string in the output, we ignore the annotation.

❗️ You should ask the model to order the annotations sequentially for this algorithm to work properly.

πŸ’‘ The provided examples should help you with setting up the prompt.

πŸ“šοΈ Configuring the data

In the next step, you can select the datasets and splits you want to annotate.

Note that for make the selection process easier, we always select a cartesian product of the selected datasets, splits, and model outputs (existing combinations only).

You can then filter the selected combinations in the box below.

Data selection

After the campaign is created, the selected examples will be listed in factgenie/annotations/<llm-eval-id>/db.csv. You can edit this file before starting the campaign if you wish to customize the data selection.

πŸ€– Running a LLM eval

After the LLM evaluation campaign is created, it will appear in the list on the /llm_eval page:

LLM eval

Now you need to go to the campaign details and run the evaluation. The annotated examples will be marked as finished:

LLM eval detail

You can view the annotations from the model as soon as they are received.

πŸ’» Command line interface

Alternatively, you can run an evaluation campaign from the command line using factgenie run-llm-campaign.

First, make sure that you prepared a .yaml configuration file in factgenie/config/llm-eval with corresponding parameters.

Then run the following command:

factgenie run-llm-campaign \
  --mode llm_eval \
  --campaign_id "campaign_id" \
  --dataset_id "dataset_id" \
  --split "split" \
  --setup_id "setup_id" \
  --llm_metric_config "config_file_name"
  [--overwrite]

Example:

factgenie run-llm-campaign --mode llm_eval --campaign_id llm-eval-test --dataset_id quintd1-ice_hockey --split test --setup_id llama2 --llm_metric_config openai-gpt3.5.yaml

If you use the --overwrite flag, any campaign with the existing campaign_id will be removed first.

Clone this wiki locally