diff --git a/README.md b/README.md index 04970ee..8fd95a1 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,215 @@ To start InspectorRAGet in production mode, please run the following command. yarn start ``` +## Usage + +Once you've started the InspectorRAGet application, then next step is to format experiment results in a prescribed format. + +The experiment result file can be broadly split into six sections along the functional boundaries. The first section captures general details about the experiment in `name`, `description` and `timestamp` fields. The second and third sections describe the +sets of models and metrics used in the experiment via the `models` and `metrics` fields, respectively. The last three sections cover the dataset and the outcome of evaluation experiment in the form of `documents`, `tasks` and `evaluations` fields. + +#### 1. Metadata + +```json +{ + "name": "Sample experiment name", + "description": "Sample example description", + ... +``` + +#### 2. Models + +```json + "models": [ + { + "model_id": "model_1", + "name": "Model 1", + "owner": "Model 1 owner", + }, + { + "model_id": "model_2", + "name": "Model 2", + "owner": "Model 2 owner", + } + ], +``` + +Notes: + +1. Each model must have a unique `model_id` and `name`. + +#### 3. Metrics + +```json + "numerical": [ + { + "name": "metric_a", + "display_name": "Metric A", + "description": "Metric A description", + "author": "algorithm | human", + "type": "numerical", + "aggregator": "average", + "range": [0, 1, 0.1] + }, + { + "name": "metric_b", + "display_name": "Metric B", + "description": "Metric B description", + "author": "algorithm | human", + "type": "categorical", + "aggregator": "majority | average", + "values": [ + { + "value": "value_a", + "display_value": "A", + "numeric_value": 1 + }, + { + "value": "value_b", + "display_value": "B", + "numeric_value": 0 + } + ] + }, + { + "name": "metric_c", + "display_name": "Metric C", + "description": "Metric C description", + "author": "algorithm | human", + "type": "text" + } + ], +``` +Notes: + +1. Each metric must have a unique name. +2. Metric can be of `numerical`, `categorical`, or `text` type. +3. Numerical type metrics must specify `range` field in `[start, end, bin_size]` format. +4. Categoricl type metrics must specify `values` field where each value must have `value` and `numerical_value` fields. +5. Text type metric are only accesible in instance level view and not used in any experiment level aggregate statistics and visual elements. + +#### 4. Documents + +```json + "documents": [ + { + "document_id": "GUID 1", + "text": "document text 1", + "title": "document title 1" + }, + { + "document_id": "GUID 2", + "text": "document text 2", + "title": "document title 2" + }, + { + "document_id": "GUID 3", + "text": "document text 3", + "title": "document title 3" + } + ], +``` +Notes: + +1. Each document must have a unique `document_id` field. +2. Each document must have a `text` field. + +#### 5. Tasks + +```json + "filters": ["category"], + "tasks": [ + { + "task_id": "task_1", + "task_type": "rag", + "category": "grounded", + "input": [ + { + "speaker": "user", + "text": "Sample user query" + } + ], + "contexts": [ + { + "document_id": "GUID 1" + } + ], + "targets": [ + { + "text": "Sample response" + } + ] + }, + { + "task_id": "task_2", + "task_type": "rag", + "category": "random", + "input": [ + { + "speaker": "user", + "text": "Hello" + } + ], + "contexts": [ + { + "document_id": "GUID 2" + } + ], + "targets": [ + { + "text": "How can I help you?" + } + ] + } + ], +``` +Notes: + +1. Each task must have a unique `task_id`. +2. Task type can be of `question_answering`, `conversation`, or of `rag` type. +3. `input` is an array of utterances. An utterance's speaker could be either `user` or `agent`. Each utterance must have a `text` field. +4. `contexts` field represents a subset of documents from the `documents` field relevant to the `input` and is available to the generative models. +5. `targets` field is an array of expected gold or reference texts. +6. `category` is an optional field that represents the type of task for grouping similar tasks. +7. `filters` is a top-level field (parallel to `tasks`) which specifies an array of fields defined inside `tasks` for filtering tasks during analysis. + +#### 6. Evaluations + +```json +"evaluations": [ + { + "task_id": "task_1 | task_2", + "model_id": "model_1 | model_2", + "model_response": "Model response", + "annotations": { + "metric_a": { + "system": { + "value": 0.233766233766233 + } + }, + "metric_b": { + "system": { + "value": "value_a | value_b" + } + }, + "metric_c": { + "system": { + "value": "text" + } + }, + } + } +] +``` +Notes: + +1. `evaluations` field must contain evaluation for every model defined in `models` section and on every task in `tasks` section. Thus, total number of evaluations is equal to number of models (M) X number of tasks (T) = M X T +2. Each evaluation must be associated with single task and single model. +3. Each evaluation must have model prediction on a task captured in the `model_response` field. +4. `annotations` field captures ratings on the model for a given task and for every metric specified in the `metrics` field. +5. Each metric annotation is a dictionary containing worker ids as keys. In the example above, `system` is a worker id. +6. Annotation from any worker on all metrics must be in the form of a dictionary. At minimum, such dictionary contains `value` key capturing model's rating for the metric by the worker. + ## Citation If you use InspectorRAGet in your research, please cite our paper: