Annotation Tool AI

Neurobagel's annotation tool-ai project aims at taking BIDS-style phenotypic data and corresponding data description files and gives users a first pass annotation by employing LLMs which use the Neurobagel data model for preparation to inject that modeled data into Neurobagel's graph database for federated querying.

We are attempting to achieve this automation using LLMs (at present gemma) and various libraries like Pydantic.

Respository Setup

clone the repo

git clone https://github.com/neurobagel/annotation-tool-ai

create virtual environment

python3 -m venv venv

source venv/bin/activate

set up pre-commit ( flake8, black, mypy)

pre-commit install

Install ollama (Currently the tool is supported only on linux )

curl -fsSL https://ollama.com/install.sh | sh

complete installations pip install -r requirements.txt

Deployment

The tool can be deployed locally - to do so please follow the instructions here
The tool can be run via a docker container - to do so please follow the instructions here

Further information: Details of the codebase | License

Local deployment

To run the current version of the LLM-based Annotation Tool locally execute the following command to start the uvicorn server locally:

python3 app/api.py --host 127.0.0.1 --port 9000

For accessing the API via the browser please follow the instructions here.
For running the tool via the command line please follow the instructions here.

Dockerized version

Since the annotation tool uses ollama to run the LLM it has to be provided by the docker container. This is done by extending the available ollama container For this instruction it is assumed that docker is installed.

Using pre-built containers

Clone the repository or download the Docker compose file
Start the tool by running the following command in the repo directory or where the docker-compose.yaml is stored. If you want to use the tool with GPU support, make sure to uncomment the respective section in the docker-compose.yaml

docker compose up

Access the tool via localhost:3000

If you run the tool and it responds an empty JSON file or a Network error, please execute the steps described in the Troubleshooting section.

Build the image from scratch

The container can be built from the dockerfile available in the repo

docker build -t annotation-tool-ai .

Let's break down the command:

docker build: According to the instructions in the Dockerfile this command builds the image.
-t : This flag allows you to name (or tag) an image.
annotation-tool-ai: This is the nice name for the image

Run the container from the built image

CPU only:

docker run -d -v ollama:/root/.ollama -v /some/local/path/output:/app/output/  --name instance_name -p 9000:9000 annotation-tool-ai

Nvidia GPU (Nvidia container toolkit has to be installed first)

docker run -d --gpus=all -v ollama:/root/.ollama -v /some/local/path/output:/app/output/  --name instance_name -p 9000:9000 annotation-tool-ai

Let't break down the commands:

docker run -d: The -d flag runs the container in the background without any output in the terminal.
--gpus=all: GPUs should be used to run the model.
-v ollama:root/.ollama: The -v flag mounts external volumes into the container. In this case the models used within the container are stored locally as well as Docker volumes - these are created and managed by Docker itself and is not directly accessible via the local file system.
-v /path/to/some/local/folder/:/app/output/: This is a bind mount (also indicated by the -v flag) and makes a local directory accessible to the container. Via this folder the input and output files (i.e. the .tsv input and .json output files) are passed to the container but since the directory is mounted also locally accessible. Within the container the files are located in app/output/. For more information about Docker volumes vs. Bind mounts see here.
--name instance-name: Here you choose a (nice) name for your container from the image we created in the step above.
-p 9000:9000: Mount port for API requests inside the container
annotation-tool-ai: Name of the image we create the instance of.

NOTE

If you want to access the API only from outside the container (which might be usually the case) it is not necessary to mount a directory when running the container. However, it has been kept in the command since it might be useful for debugging purposes.

Access the tool

After successful deployment there are three options how to access the tool - either accessing the API directly, accessing it via the UI or running it via the command line. Independent of the access mode there are two parameters that have to be set:

Parameter	Value	Info
`code_system`	`cogatlas`	If assessment tools are identified within the provided `.tsv` file, the TermURLs and Labels from the Cognitive Atlas are assigned (if available). `cogatlas` is the default value.
	`snomed`	If assessment tools are identified within the provided `.tsv` file, the TermURLs and Labels from SNOMED CT are assigned (if available).
`response_type`	`file`	After categorization and annotation the API provides a `.json` file ready to download. `file` is the default value
	`json`	After categorization and annotation the API provides the raw JSON output.

Access the API directly

Once the docker run command or the app/api.py script has been executed, the uvicorn server for the FastAPI application will be initiated. To access the GUI for the API, please enter the following in your browser and follow the instructions provided.

http://127.0.0.1:9000/docs

Results API

If file is the chosen response_type a .json file will be provided for download:

If json is the chosen response_type the direct JSON output will be provided by the API:

Well done - you have annotated your tabular file! (It's clear that this documentation is written in a way that you can follow the instructions and annotate your tabular file.)

Access the Tool via the the User-Interface

If you don't want to access the tool directly through the API, but rather through a more user-friendly interface, you can set up the integrated UI locally on your machine.

First, since the UI is a react application, nodejs and npm (the node package manager) need to be installed on the system:

sudo apt-get update
sudo apt-get install nodejs
sudo apt-get install npm

Second, to access the interface, the application must be started locally. This is done from the `ui-integration' directory of the repository.

cd annotation-tool-ai/ui-integration
npm start

If this was successful, the terminal shows:

and the userinterface is accessible via http://localhost:3000. Please follow the instructions there.

Results UI

If JSON is the parameter chosen for response type, after running you data you should get something like:

If File is the chosen response type, a file will be automatically downloaded.

Using the tool from the command line

The following command runs the script for the annotation process if you deployed it via docker (i.e. access is from INSIDE the docker container):

Please choose the code_system, response_typeand indicate the correct instance_name and filepaths.

docker exec -it instance_name curl -X POST "http://127.0.0.1/9000/process/?code_system=<snomed | cogatlas>&response_type=<file | json>" 
-F "file=@<filepath-to-tsv-inside-container>.tsv" 
-o <filepath-to-output-file-inside-container>.json

If you chose the local deployment or you want to access the container from outside of it you can run the tool via this command:

curl -X POST "http://127.0.0.1:9000/process/?code_system=<snomed | cogatlas>&response_type=<file | json>" 
-F "file=@<filepath-to-tsv-outside-container>.tsv" 
 This is the command you want to execute in the interactive terminal session within the container. The input file is the to-be-annotated `.tsv` file and the output file is the `.json` file.

-o <filepath-to-output-file-outside-container>.json

Let's break down this again (for local/outside docker deployment ignore the first 3 list items):

docker exec: This command is used to execute a command in a running Docker container.
-it: Here are the -i and -t flag combined which allows for interactive terminal session. It is needed, for example, when you run commands that require input.
api_test: Name of the instance.
curl -X POST "http://127.0.0.1:9000/process/?code_system=<snomed | cogatlas>" -F "file=@<filepath-to-tsv-inside/outside-container>.tsv" -o <filepath-to-output-file-inside/outside-container>.json: This is the command that makes a POST request to the API. The input file is the to-be-annotated .tsv file and the output file is the .json file.

NOTE

The -o <filepath-to-output-file-inside/outside-container>.json is only necessary if file is chosen as response_type parameter.

Troubleshooting

Empty JSON file is responded (only column headers)

Sometimes the model is not available in the container. This results in empty output (only column headers are displayed).

In this case you can start an interactive terminal session inside the running annotation tool container:

docker exec -it annotation-tool-ai-app /bin/bash

By executing ollama list the current models are shown. If this section is empty you can pull the respective model using ollama pull gemma.

UI does not work via SSH tunnel in VS-Code (Network error)

Because of it's internal forwarding logic, port that are automatically forwarded are sometimes mapped to different ports (e.g. 9001 instead of 9000). In VS Code under Ports you can delete the automatically forwarded ports and add the ports again manually.

Additional resources:

Details of the codebase

Currently the development of the tool is divided into 2 aspects: Parsing and Categorization

Parsing | Categorization |

Parsing

Introduction

The codebase is designed to handle and annotate TSV data by converting it into JSON format. It leverages the Pydantic library to enforce data structures, ensuring the consistens and valid data throughout the annotation process. The main components of the code include defining data structures for various annotation categories, handling the annotation of the categories itself and create a JSON file containing the annotations. The scope of this milestone was to create annotations for the entities available in the current Neurobagel data model (i.e., ParticipantID, SessionID, Age, Sex, Diagnosis, and Assessment Tool).

Assumptions and Requirements

Since we have separated the categorization and parsing steps, some assumptions have been made about the format of the LLM response for the different data model entities. The goal was to produce correct annotations with a minimum of information. Thus, the following LLM responses are assumed for the different entities:

Participant ID: {"TermURL":"nb:ParticipantID"}
Session ID: {"TermURL":"nb:Session"}
Age:

    {
    "TermURL": "nb:Age", 
    "Format": "europeanDecimalValue"
    }

Sex:

{
    "TermURL": "nb:Sex",
    "Levels": {
        "M": "male",
        "F": "female"
    }
}

Diagnosis:

{
    "TermURL": "nb:Diagnosis",
    "Levels": {
        "MDD": "Major depressive disorder",
        "CTRL": "healthy control"
    }
}

Assessment Tool:

{
    "TermURL": "nb:AssessmentTool",
    "AssessmentTool": "future events structured interview"
}

By now the code depends on

typing for providing type hints for defined functions
pandas TSV file handling
json JSON handling
pydantic for data validation and modularized parsing

Representing Data Model Entities

Conceptually, the desired JSON output has a common base for all entities (i.e. the IsAbout section), but depending on the entity being handled, different additional fields are present. For instance, the example below demonstrates that the participant_id contains the additional field Identifies and the age contains the additional fields Transformation and MissingValues but does not contain the Identifies entry.

{
    "participant_id": {
        "Description": "A participant ID",
        "Annotations": {
            "IsAbout": {
                "Label": "Subject Unique Identifier",
                "TermURL": "nb:ParticipantID"
            },
            "Identifies": "participant"
        }
    },
    "age": {
        "Annotations": {
            "IsAbout": {
                "Label": "Age",
                "TermURL": "nb:Age"
            },
            "Transformation": {
                "Label": "integer value",
                "TermURL": "nb:FromInt"
            },
            "MissingValues": []
        },
        "Description": "The age of the participant at data acquisition"
    }
}

To keep the code readable, a separate class was defined for each entity to be annotated. The common base in the desired output - the IsAbout section, which contains two strings (a TermURL and a Label), serves as the initial data structure for all entities in the data model. Subsequently, the IsAbout base model is further differentiated for each data model entity (e.g., IsAboutParticipant, IsAboutSession, etc.) and adds the specific (static) label to it.

class IsAboutBase(BaseModel):
    Label: str
    TermURL: str
    
class IsAboutParticipant(IsAboutBase):
    Label: str = Field(default="Subject Unique Identifier")
    TermURL: str

The IsAbout section is encapsulated in the Annotations section, but here potentially different fields can be included. To implement this, another data structure has been defined that contains the specific IsAbout section (depending on the `TermURL' entry returned by the LLM) and all the optional fields.

class Annotations(BaseModel):  
    IsAbout: Union[
        IsAboutParticipant,
        IsAboutSex,
        IsAboutAge,
        IsAboutSession,
        IsAboutGroup,
        IsAboutAssessmentTool,
    ]
    Identifies: Optional[str] = None
    Levels: Optional[Dict[str, Dict[str, str]]] = None
    Transformation: Optional[Dict[str, str]] = None
    IsPartOf: Optional[Union[List[Dict[str, str]], Dict[str, str], str]] = None

As a final step in creating the JSON output format, fields such as Description and, in the case of categorical variables (optional), the Levels present in the TSV file should be added. To implement this, another data structure was introduced that contains these fields and the Annotations (which in turn contains the isAbout).

class TSVAnnotations(BaseModel):
    Description: str
    Levels: Optional[Dict[str, str]] = None
    Annotations: Annotations

Based on the data structures described above, annotations can be composed for each entity. Here is a graphical representation of the data structures and how they are used for the final TSV annotation. The pink boxes represent fields that depend on the LLM response.

%%{init: {'theme': 'forest', "flowchart" : { "curve" : "basis" } } }%%
flowchart LR
subgraph TSV-Annotations
Description([Description:\n set for each entity])
Levels-Description([Levels-Description:\n used in Sex and Diagnosis,  responded by \n the LLM, mapped to the pre-defined terms \nand used for annotation in Levels-Explanation])
    subgraph Annotations
        subgraph Identifies
        identifies([used for ParticipantID \nand SessionID])
        end
        subgraph Levels-Explanation
        levels-explanation([used to provide a TermURL and \nLabel for the Elements of \nLevels-Description])
        end 
        subgraph Transformation
        transformation([used for Age, \nresponded by the LLM \nand used for annotation.])
        end
        subgraph IsPartOf
        ispartof([used for AssessmentTool,\n provides TermURL and Label\n for the Assessment Tool.])
        end
        subgraph IsAbout
        isabout([TermURL responded by \n the LLM categorization \n serves as controller \nfor further annotation])
        end
    end
end

style isabout fill:#f542bc
style transformation fill:#f542bc
style Levels-Description fill:#f542bc
style ispartof fill:#f542bc

Functions

Creating the desired JSON output requires several steps. These include:

Function	Purpose	Parameters
`convert_tsv_to_dict`	Extract the original column names (and their contents - for LLM queries) from the TSV file. This serves as a preparation step for the query passed to the LLM, as well as for the creation of the "raw" JSON file.	Input: `tsv_file:str` Output: `column_strings:Dict[str,str]`
`tsv_to_json`	This function initializes a JSON file with the columns of the TSV file as keys and empty strings as values.	Input: `tsv_file:str` `json_file:str` Output: `None`
	LLM Categorization (see here)
`process_parsed_output`	This function decides which handler function to call based on the TermURL of the LLM response.	Input: `llm_output:Dict[str, Union[str, Dict[str, str], None]]` `code_system: str` Output: `TSVAnnotations:Union[str, Any]`
`handle_participant` `handle_age` `handle_categorical` `handle_session` `handle_assessmentTool`	These functions create specific annotation instances to ensure that each annotated column contains only the fields required for it.	Input ParticipantID, Session, Age: `llm_response:Dict[str, Any]` Input Categorical, Assessment Tool `llm_response:Dict[str, Any]` `mapping:Mapping[str, Dict[str, str]]` Output: `TSVAnnotations`
`get_assessment_label`	In many cases the assessment tool columns of a tabular file are represented via acronyms of standardized abbreviations. If a column is categorized as assessment tool the header is checked for the commonly used abbreviations of the assessment tools available in the respective coding system (snomed or cognitive atlas).	Input: `key: str`, `code_system: str` Output:`Union[str, List[str]]`
`SexLevel`	If a column is categorized as a sex column this functions tries to map the content of the column to the respective snomed entities for male, female and others. For now all formats recommended by BIDS are included. Sex columns are always annotated with the snomed TermURLs	Input: `result_dict: Dict[str, str], key: str` Output:`Dict[str, Any]`
`AgeFormat`	Within the Neurbagel data model the format of the original column values of age columns is also annotation via `Transformation`. By now integer values, float values, european decimal values, bounded values as well as ISO8601 and YearUnit coded values van be annotated.	Input: `result_dict: Dict[str, str], key: str` Output:`Dict[str, Any]`
`load_levels_mapping` `load_assessmenttool_mapping`	These helper functions provide the mappings (i.e., the corresponding TermURLs to a specific label such as "Male" or "Alexia"). For diagnosis and sex the `load_levels_mapping` is used. For the assessment tool the `load_assessmenttool_mapping` is used. Two functions are needed because the original structure of the files (`diagnosisTerms.json` and `toolTerms.json`) is slightly different.	Input: `mapping_file:str` Output: `levels_mapping \|assessmenttool_mapping:Mapping[str, Dict[str, str]]`
`update_json_file`	Updates the "raw" JSON file with the processed data under the specific key (i.e. the original column name).	Input: `data: Union[str, TSVAnnotations]` `filename: str` `target_key: str` Output: `None`

Main Logic

Here the main script demonstrates the complete process of annotating each column of the TSV file.

def process_file(
    file_path: str, json_file: str, code_system: str
) -> Dict[str, str]:
    columns_dict = convert_tsv_to_dict(file_path) 
    tsv_to_json(file_path, json_file) #creates a json file with headers

    results = {}
    
    for key, value in columns_dict.items():
        try:
            input_dict = {key: value}
            llm_response = llm_invocation(input_dict, code_system) #calls the LLM for the categorization of each column
            result = process_parsed_output(llm_response, code_system) #depending on the column the json output is created
            results[key] = result
            update_json_file(result, json_file, key) #final annotations are written to JSON
        except Exception as e:
            results[key] = {"error": str(e)}

    return results

Categorization:

Introduction

Categorization is carried out using LLMS and this step return the required structure of output for furhter processing .The codebase is created to categorize/classify the columns present in the tsv input file into classes according to the already existing categories present in the neurobagel annotation tool. The LLM makes it predictions for a specific input string consisting of the column header and the column contents based on the examples provided to it beforehand in its promptemplate. The various tasks carried out by this codebase mainly utilise Langchain, the json library from python and the LLM 'Gemma' from Ollama.

Important Aspects:

1. Choice of LLM

2. Utilising Langchain and a well structured Prompt Template

3. Obtaining the Output in the req format to pass it on to the parser

1.Choice of LLM:

The choice of LLM was a very crucial aspect as it was necessary to select an LLM that hallucinates the least and has a freedom to be used enumerably without any limit.

The LLMs from hugging face: Even though we get to use many LLMs like Flant-T5 for free on Hugging face it had an API request Limit and hence a conclusion was made that such an approach is not feasible for something to be sent to production.Thereafter we have been using LLMs from Ollama for the project.
Llama2/Llama3

Even though Llama models appeared to work fine at first, hallucination was detected when tested further The attatched Screenshots show the LLM response in the interval of abt 10-20 seconds

Gemma Model The gemma model also from Ollama is working fine with no visible hallucinations until now. And further tasks are being carried out using the Gemma model.

However an open mind has been kept in context of the choice of LLM and we keep on testing new LLMs that we come across because there can always be a scope of improvement in this context.

2. Utilising Langchain and a well structured Prompt Template

Utilizing LangChain’s components like PromptTemplate, ChatOllama and implementing chains provides a robust framework for developing and deploying our application.

PromptTemplate:

Here the PromptTemplate specifies the input form examples for the LLM to base the response on and the form in which the output is expected. It also seperately defines the input variables that will be given to the LLM.

Chatollama

Chatollama from langchain_community.chat_models was used to implement various LLMs from Ollama.

The Chain In LangChain, the concept of chains refers to a sequence or pipeline of operations applied to data. Chains in LangChain are flexible, allowing developers to incorporate various components (like ChatOllama, PromptTemplate, data parsers, etc.) into customized workflows. The ' chain.invoke(...)' function executes each operation in the defined chain sequentially.

3. Obtaining the Output in the req format to pass it on to the parser

This depends on the type of values present in the column ie categorical, defined numeric indices, Continuos etc

The LLM output is of the form of 'AI_Model Object' so it was required to convert it into string object in order to further process the LLM output to obtain the required structered output from the code.

So funtions (SexLevel(...) and AgeFormat(....) ) are defined for different types of columns which require extra identification other than the labelling done by the LLM also the output has been structured in a way that is required by the parser using the 'json.dumps()' function.

The following Screenshot is the output of a test code before integrating the codebase with the parsing code

License

The Neurobagel Annotation-tool-AI uses the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
app		app
docs/img		docs/img
rag_documents		rag_documents
tests		tests
ui-integration		ui-integration
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Readme.md		Readme.md
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt

License

neurobagel/annotation-tool-ai

Folders and files

Latest commit

History

Repository files navigation

Annotation Tool AI

Respository Setup

Deployment

Local deployment

Dockerized version

Using pre-built containers

Build the image from scratch

Run the container from the built image

Access the tool

Access the API directly

Results API

Access the Tool via the the User-Interface

Results UI

Using the tool from the command line

Troubleshooting

Details of the codebase

Parsing

Introduction

Assumptions and Requirements

Representing Data Model Entities

Functions

Main Logic

Categorization:

Introduction

Important Aspects:

1. Choice of LLM

2. Utilising Langchain and a well structured Prompt Template

3. Obtaining the Output in the req format to pass it on to the parser

1.Choice of LLM:

2. Utilising Langchain and a well structured Prompt Template

3. Obtaining the Output in the req format to pass it on to the parser

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages