Skip to content

02 Data Management

Zdenek Kasner edited this page Jul 31, 2024 · 11 revisions

Data Management

๐Ÿ“” Terminology

Before you start using factgenie, you need to have a dataset and the corresponding generated outputs. You will need to load each of these separately into factgenie.

We assume that you already have these from your previous experiments.

Let us first dive into these concepts!

๐Ÿ“Š Dataset

A dataset is one of the basic concepts in factgenie. It is represented by a Python class that can load the input data, visualize them, and manage the corresponding outputs.

Each dataset contains one or more splits. For example, a dataset can have train, dev and test splits, as it is common in a machine learning setup. Note that in factgenie, the splits can be named arbitrarily: they will only serve us for a more fine-grained data division.

Each split contains 1 to N examples. The examples are our "input data": the data that were presented to the model to generate the output. More importantly, it is something that will help the annotators to assess the factual accuracy of the output.

How can the examples look like? Generally, the examples can be anything that can be displayed in the web interface, from plain text to images, tables, or charts. However, if you are not that familiar with Python and want to use one of the pre-defined data loaders, you will need to get by with one of the several basic data formats.

๐Ÿ“„ Generated outputs

The generated outputs are plain text strings corresponding to input examples.

Each output was generated with a specific setup, for example a LLM with a particular prompt and a set of hyperparameters.

Later, we will learn how to annotate the outputs, i.e. assign categories to specific text spans.

๐Ÿฅฝ Overview

Here is a brief overview of the concepts as displayed on the Browse page in the web interface:

Main screen

๐Ÿ—‚๏ธ Adding datasets

There are two ways how to integrate your data into factgenie:

  • basic: using pre-defined data loaders
  • advanced: using custom data loaders

The pre-defined data loaders support a limited set of data visualizations for common formats such as .txt, .jsonl, .csv or .html. If you decide to using the pre-defined loaders, adding a dataset is as simple as uploading the data file through the web interface.

For adding a custom data loaders, you will need to write a Python class implementing the data loader yourself. The advantage is the customizability of the data loading and rendering methods.

โœ๏ธ Basic

For using a pre-defined data loader, navigate to /datasets and select Add dataset. You can select from the following formats:

Format Input Notes
Text Plain text file containing one input example per row. Each row will be displayed as a separate example. If the text contains newlines, make sure to escape them with \\n.
JSONL JSONL (JSON Lines) file containing one input example per row. Each JSON object will be visualized using the json2table package.
CSV CSV file with headers containing one input example per row. Each row will be displayed as a set of key-value pairs where the values are the row values and the keys are corresponding column headings.
HTML ZIP archive with a set of HTML files in the root directory (and optionally external files). Each of the HTML pages will be displayed as a separate example. The files will be sorted by name, taking numbers into account. You can include external files by using relative paths, e.g. for images in the ./img subdirectory use <img src="img/xyz.png"/>.

Note that you will need to upload each split in a separate file.

๐Ÿ–Œ๏ธ Advanced

You can also write your custom data loader. The advantage of writing a custom data loader is that you can load data in arbitrary formats and customize the data visualization to your need.

To write a data loader, you need to:

  1. Create a new file <your-module>.py in factgenie/loaders.
  2. Add a class to <your-module>.py. The class needs to be a subclass of the Dataset class from factgenie/loaders/dataset.py.
  3. Implement the load_data() and render() methods:
  • The load_data() method receives two arguments: split and data_path.
    • By default, the data_path will be the factgenie/data/<dataset_id>.
    • You can ignore this argument if you are loading your dataset e.g. from Huggingface Datasets.
    • The method needs to return a list of examples for the given split.
  • The render() method will receive a specific example and needs to return a str containing a HTML representation of the example.
  1. Optionally override any other methods of the Dataset class.
  2. Add your dataset to datasets.yml (see below).

To make your job easier, we provide a set of example data loaders in factgenie/loaders: see the Example Datasets page.

๐Ÿ—’๏ธ Adding model outputs

Similarly to the procedure of addding datasets, there are two ways to add model outputs to factgenie:

  • through the web interface,
  • manually.

๐Ÿ•ธ๏ธ Web interface

For adding the model outputs through the web interface, prepare a plain text file with one output per line:

... output for example 1 ...
... output for example 2 ...
.
.
.

... output for example N ...

If the text contains newlines, make sure to escape them with \\n.

In factgenie, navigate to /model_outputs and select Upload outputs. In the modal window, select the corresponding dataset and split, and input a unique identifier for the outputs.

Note that the total number of lines need to correspond with the total number of examples in the split.

๐Ÿ‘ Manually

You can add the model outputs also manually by adding them to the factgenie/outputs/<dataset_id>/<split> directory.

The JSON file with the outputs needs to have the following structure:

{
  "dataset": "<dataset_id>",
  "setup": {"id": "<setup_id>"},
  "generated": [
        {"out": "<output_1>"},
        {"out": "<output_2>"},
        ...
        {"out": "<output_N>"}
  ]
}

You can optionally include any other fields in the JSON file.

๐Ÿงค Managing datasets and model outputs

Factgenie provides a user-friendly web interface for managing both the datasets and model outputs.

This is the web interface for managing datasets that you can find under /datasets:

Main screen

Analogically, you can find the interface for managing model outputs under /model_outputs.

As we said previously, you can use the web interface for adding new dataset and outputs (as long as they are in a pre-defined format). You can also delete the datasets or model outputs along with several other options.

Of course, you can also click through to view the datasets and model outputs on the /browse page:

Main screen

datasets.yml

The dataset configuration is stored in factgenie/loaders/datasets.yml.

The file has the following structure:

datasets:
  <dataset_id>:
    class: <module>.<class> # module is the filename (without .py)
    description: <dataset_description> # can contain HTML tags
    enabled: <boolean>
    splits:
    - <split1>
    - ...
    - <splitN>
    type: {json,table,text,default} # for now determines just an icon

You can manage the datasets by editing this file. Note that you may need to restart the factgenie server after the edit.

Clone this wiki locally