Neurobagel's annotation tool-ai project aims at taking BIDS-style phenotypic data and corresponding data description files and gives users a first pass annotation by employing LLMs which use the Neurobagel data model for preparation to inject that modeled data into Neurobagel's graph database for federated querying.
We are attempting to achieve this automation using LLMs (at present gemma) and various libraries like Pydantic.
- clone the repo
git clone https://github.com/neurobagel/annotation-tool-ai
- create virtual environment
python3 -m venv venv
source venv/bin/activate
- set up pre-commit ( flake8, black, mypy)
pre-commit install
- Install ollama (Currently the tool is supported only on linux )
curl -fsSL https://ollama.com/install.sh | sh
- complete installations
pip install -r requirements.txt
- The tool can be deployed locally - to do so please follow the instructions here
- The tool can be run via a docker container - to do so please follow the instructions here
Further information: Details of the codebase | License
To run the current version of the LLM-based Annotation Tool locally execute the following command to start the uvicorn server locally:
python3 app/api.py --host 127.0.0.1 --port 9000
- For accessing the API via the browser please follow the instructions here.
- For running the tool via the command line please follow the instructions here.
Since the annotation tool uses ollama to run the LLM it has to be provided by the docker container. This is done by extending the available ollama container For this instruction it is assumed that docker is installed.
-
Clone the repository or download the Docker compose file
-
Start the tool by running the following command in the repo directory or where the
docker-compose.yaml
is stored. If you want to use the tool with GPU support, make sure to uncomment the respective section in thedocker-compose.yaml
docker compose up
- Access the tool via
localhost:3000
If you run the tool and it responds an empty JSON file or a Network error, please execute the steps described in the Troubleshooting section.
The container can be built from the dockerfile available in the repo
docker build -t annotation-tool-ai .
Let's break down the command:
docker build
: According to the instructions in theDockerfile
this command builds the image.-t
: This flag allows you to name (or tag) an image.annotation-tool-ai
: This is the nice name for the image
- CPU only:
docker run -d -v ollama:/root/.ollama -v /some/local/path/output:/app/output/ --name instance_name -p 9000:9000 annotation-tool-ai
- Nvidia GPU (Nvidia container toolkit has to be installed first)
docker run -d --gpus=all -v ollama:/root/.ollama -v /some/local/path/output:/app/output/ --name instance_name -p 9000:9000 annotation-tool-ai
Let't break down the commands:
docker run -d
: The -d flag runs the container in the background without any output in the terminal.--gpus=all
: GPUs should be used to run the model.-v ollama:root/.ollama
: The -v flag mounts external volumes into the container. In this case the models used within the container are stored locally as well as Docker volumes - these are created and managed by Docker itself and is not directly accessible via the local file system.-v /path/to/some/local/folder/:/app/output/
: This is a bind mount (also indicated by the -v flag) and makes a local directory accessible to the container. Via this folder the input and output files (i.e. the.tsv
input and.json
output files) are passed to the container but since the directory is mounted also locally accessible. Within the container the files are located inapp/output/
. For more information about Docker volumes vs. Bind mounts see here.--name instance-name
: Here you choose a (nice) name for your container from the image we created in the step above.-p 9000:9000
: Mount port for API requests inside the containerannotation-tool-ai
: Name of the image we create the instance of.
NOTE
If you want to access the API only from outside the container (which might be usually the case) it is not necessary to mount a directory when running the container. However, it has been kept in the command since it might be useful for debugging purposes.
After successful deployment there are three options how to access the tool - either accessing the API directly, accessing it via the UI or running it via the command line. Independent of the access mode there are two parameters that have to be set:
Parameter | Value | Info |
---|---|---|
code_system |
cogatlas |
If assessment tools are identified within the provided .tsv file, the TermURLs and Labels from the Cognitive Atlas are assigned (if available). cogatlas is the default value. |
snomed |
If assessment tools are identified within the provided .tsv file, the TermURLs and Labels from SNOMED CT are assigned (if available). |
|
response_type |
file |
After categorization and annotation the API provides a .json file ready to download. file is the default value |
json |
After categorization and annotation the API provides the raw JSON output. |
Once the docker run
command or the app/api.py
script has been executed, the uvicorn server for the FastAPI application will be initiated. To access the GUI for the API, please enter the following in your browser and follow the instructions provided.
http://127.0.0.1:9000/docs
If file
is the chosen response_type
a .json
file will be provided for download:
If json
is the chosen response_type
the direct JSON output will be provided by the API:
Well done - you have annotated your tabular file! (It's clear that this documentation is written in a way that you can follow the instructions and annotate your tabular file.)
If you don't want to access the tool directly through the API, but rather through a more user-friendly interface, you can set up the integrated UI locally on your machine.
First, since the UI is a react application, nodejs
and npm
(the node package manager) need to be installed on the system:
sudo apt-get update
sudo apt-get install nodejs
sudo apt-get install npm
Second, to access the interface, the application must be started locally. This is done from the `ui-integration' directory of the repository.
cd annotation-tool-ai/ui-integration
npm start
If this was successful, the terminal shows:
and the userinterface is accessible via http://localhost:3000
. Please follow the instructions there.
If JSON
is the parameter chosen for response type, after running you data you should get something like:
If File
is the chosen response type, a file will be automatically downloaded.
The following command runs the script for the annotation process if you deployed it via docker (i.e. access is from INSIDE the docker container):
Please choose the code_system
, response_type
and indicate the correct instance_name
and filepaths.
docker exec -it instance_name curl -X POST "http://127.0.0.1/9000/process/?code_system=<snomed | cogatlas>&response_type=<file | json>"
-F "file=@<filepath-to-tsv-inside-container>.tsv"
-o <filepath-to-output-file-inside-container>.json
If you chose the local deployment or you want to access the container from outside of it you can run the tool via this command:
curl -X POST "http://127.0.0.1:9000/process/?code_system=<snomed | cogatlas>&response_type=<file | json>"
-F "file=@<filepath-to-tsv-outside-container>.tsv"
This is the command you want to execute in the interactive terminal session within the container. The input file is the to-be-annotated `.tsv` file and the output file is the `.json` file.
-o <filepath-to-output-file-outside-container>.json
Let's break down this again (for local/outside docker deployment ignore the first 3 list items):
docker exec
: This command is used to execute a command in a running Docker container.-it
: Here are the-i
and-t
flag combined which allows for interactive terminal session. It is needed, for example, when you run commands that require input.api_test
: Name of the instance.curl -X POST "http://127.0.0.1:9000/process/?code_system=<snomed | cogatlas>" -F "file=@<filepath-to-tsv-inside/outside-container>.tsv" -o <filepath-to-output-file-inside/outside-container>.json
: This is the command that makes a POST request to the API. The input file is the to-be-annotated.tsv
file and the output file is the.json
file.
NOTE
The -o <filepath-to-output-file-inside/outside-container>.json
is only necessary if file
is chosen as response_type
parameter.
- Empty JSON file is responded (only column headers)
Sometimes the model is not available in the container. This results in empty output (only column headers are displayed).
In this case you can start an interactive terminal session inside the running annotation tool container:
docker exec -it annotation-tool-ai-app /bin/bash
By executing ollama list
the current models are shown. If this section is empty you can pull the respective model using ollama pull gemma
.
- UI does not work via SSH tunnel in VS-Code (Network error)
Because of it's internal forwarding logic, port that are automatically forwarded are sometimes mapped to different ports (e.g. 9001 instead of 9000).
In VS Code under Ports
you can delete the automatically forwarded ports and add the ports again manually.
Additional resources:
- Mapping ports in docker compose
- Publishing and forwarding ports in containers (VSCode)
- Ports forwarding in remote settings (SSH)
Currently the development of the tool is divided into 2 aspects: Parsing and Categorization
The codebase is designed to handle and annotate TSV data by converting it into JSON format. It leverages the Pydantic library to enforce data structures, ensuring the consistens and valid data throughout the annotation process. The main components of the code include defining data structures for various annotation categories, handling the annotation of the categories itself and create a JSON file containing the annotations. The scope of this milestone was to create annotations for the entities available in the current Neurobagel data model (i.e., ParticipantID, SessionID, Age, Sex, Diagnosis, and Assessment Tool).
Since we have separated the categorization and parsing steps, some assumptions have been made about the format of the LLM response for the different data model entities. The goal was to produce correct annotations with a minimum of information. Thus, the following LLM responses are assumed for the different entities:
- Participant ID:
{"TermURL":"nb:ParticipantID"}
- Session ID:
{"TermURL":"nb:Session"}
- Age:
{
"TermURL": "nb:Age",
"Format": "europeanDecimalValue"
}
- Sex:
{
"TermURL": "nb:Sex",
"Levels": {
"M": "male",
"F": "female"
}
}
- Diagnosis:
{
"TermURL": "nb:Diagnosis",
"Levels": {
"MDD": "Major depressive disorder",
"CTRL": "healthy control"
}
}
- Assessment Tool:
{
"TermURL": "nb:AssessmentTool",
"AssessmentTool": "future events structured interview"
}
By now the code depends on
typing
for providing type hints for defined functionspandas
TSV file handlingjson
JSON handlingpydantic
for data validation and modularized parsing
Conceptually, the desired JSON output has a common base for all entities (i.e. the IsAbout
section), but depending on the entity being handled, different additional fields are present. For instance, the example below demonstrates that the participant_id
contains the additional field Identifies
and the age
contains the additional fields Transformation
and MissingValues
but does not contain the Identifies
entry.
{
"participant_id": {
"Description": "A participant ID",
"Annotations": {
"IsAbout": {
"Label": "Subject Unique Identifier",
"TermURL": "nb:ParticipantID"
},
"Identifies": "participant"
}
},
"age": {
"Annotations": {
"IsAbout": {
"Label": "Age",
"TermURL": "nb:Age"
},
"Transformation": {
"Label": "integer value",
"TermURL": "nb:FromInt"
},
"MissingValues": []
},
"Description": "The age of the participant at data acquisition"
}
}
To keep the code readable, a separate class was defined for each entity to be annotated. The common base in the desired output - the IsAbout
section, which contains two strings (a TermURL and a Label), serves as the initial data structure for all entities in the data model. Subsequently, the IsAbout
base model is further differentiated for each data model entity (e.g., IsAboutParticipant
, IsAboutSession
, etc.) and adds the specific (static) label to it.
class IsAboutBase(BaseModel):
Label: str
TermURL: str
class IsAboutParticipant(IsAboutBase):
Label: str = Field(default="Subject Unique Identifier")
TermURL: str
The IsAbout
section is encapsulated in the Annotations
section, but here potentially different fields can be included.
To implement this, another data structure has been defined that contains the specific IsAbout
section (depending on the `TermURL' entry returned by the LLM) and all the optional fields.
class Annotations(BaseModel):
IsAbout: Union[
IsAboutParticipant,
IsAboutSex,
IsAboutAge,
IsAboutSession,
IsAboutGroup,
IsAboutAssessmentTool,
]
Identifies: Optional[str] = None
Levels: Optional[Dict[str, Dict[str, str]]] = None
Transformation: Optional[Dict[str, str]] = None
IsPartOf: Optional[Union[List[Dict[str, str]], Dict[str, str], str]] = None
As a final step in creating the JSON output format, fields such as Description
and, in the case of categorical variables (optional), the Levels
present in the TSV file should be added. To implement this, another data structure was introduced that contains these fields and the Annotations
(which in turn contains the isAbout
).
class TSVAnnotations(BaseModel):
Description: str
Levels: Optional[Dict[str, str]] = None
Annotations: Annotations
Based on the data structures described above, annotations can be composed for each entity. Here is a graphical representation of the data structures and how they are used for the final TSV annotation. The pink boxes represent fields that depend on the LLM response.
%%{init: {'theme': 'forest', "flowchart" : { "curve" : "basis" } } }%%
flowchart LR
subgraph TSV-Annotations
Description([Description:\n set for each entity])
Levels-Description([Levels-Description:\n used in Sex and Diagnosis, responded by \n the LLM, mapped to the pre-defined terms \nand used for annotation in Levels-Explanation])
subgraph Annotations
subgraph Identifies
identifies([used for ParticipantID \nand SessionID])
end
subgraph Levels-Explanation
levels-explanation([used to provide a TermURL and \nLabel for the Elements of \nLevels-Description])
end
subgraph Transformation
transformation([used for Age, \nresponded by the LLM \nand used for annotation.])
end
subgraph IsPartOf
ispartof([used for AssessmentTool,\n provides TermURL and Label\n for the Assessment Tool.])
end
subgraph IsAbout
isabout([TermURL responded by \n the LLM categorization \n serves as controller \nfor further annotation])
end
end
end
style isabout fill:#f542bc
style transformation fill:#f542bc
style Levels-Description fill:#f542bc
style ispartof fill:#f542bc
Creating the desired JSON output requires several steps. These include:
Function | Purpose | Parameters |
---|---|---|
convert_tsv_to_dict |
Extract the original column names (and their contents - for LLM queries) from the TSV file. This serves as a preparation step for the query passed to the LLM, as well as for the creation of the "raw" JSON file. | Input: tsv_file:str Output: column_strings:Dict[str,str] |
tsv_to_json |
This function initializes a JSON file with the columns of the TSV file as keys and empty strings as values. | Input: tsv_file:str json_file:str Output: None |
LLM Categorization (see here) | ||
process_parsed_output |
This function decides which handler function to call based on the TermURL of the LLM response. | Input: llm_output:Dict[str, Union[str, Dict[str, str], None]] code_system: str Output: TSVAnnotations:Union[str, Any] |
handle_participant handle_age handle_categorical handle_session handle_assessmentTool |
These functions create specific annotation instances to ensure that each annotated column contains only the fields required for it. | Input ParticipantID, Session, Age:llm_response:Dict[str, Any] Input Categorical, Assessment Tool llm_response:Dict[str, Any] mapping:Mapping[str, Dict[str, str]] Output: TSVAnnotations |
get_assessment_label |
In many cases the assessment tool columns of a tabular file are represented via acronyms of standardized abbreviations. If a column is categorized as assessment tool the header is checked for the commonly used abbreviations of the assessment tools available in the respective coding system (snomed or cognitive atlas). | Input: key: str , code_system: str Output: Union[str, List[str]] |
SexLevel |
If a column is categorized as a sex column this functions tries to map the content of the column to the respective snomed entities for male, female and others. For now all formats recommended by BIDS are included. Sex columns are always annotated with the snomed TermURLs | Input: result_dict: Dict[str, str], key: str Output: Dict[str, Any] |
AgeFormat |
Within the Neurbagel data model the format of the original column values of age columns is also annotation via Transformation . By now integer values, float values, european decimal values, bounded values as well as ISO8601 and YearUnit coded values van be annotated. |
Input: result_dict: Dict[str, str], key: str Output: Dict[str, Any] |
load_levels_mapping load_assessmenttool_mapping |
These helper functions provide the mappings (i.e., the corresponding TermURLs to a specific label such as "Male" or "Alexia"). For diagnosis and sex the load_levels_mapping is used. For the assessment tool the load_assessmenttool_mapping is used. Two functions are needed because the original structure of the files (diagnosisTerms.json and toolTerms.json ) is slightly different. |
Input:mapping_file:str Output: levels_mapping |assessmenttool_mapping:Mapping[str, Dict[str, str]] |
update_json_file |
Updates the "raw" JSON file with the processed data under the specific key (i.e. the original column name). | Input: data: Union[str, TSVAnnotations] filename: str target_key: str Output: None |
Here the main script demonstrates the complete process of annotating each column of the TSV
file.
def process_file(
file_path: str, json_file: str, code_system: str
) -> Dict[str, str]:
columns_dict = convert_tsv_to_dict(file_path)
tsv_to_json(file_path, json_file) #creates a json file with headers
results = {}
for key, value in columns_dict.items():
try:
input_dict = {key: value}
llm_response = llm_invocation(input_dict, code_system) #calls the LLM for the categorization of each column
result = process_parsed_output(llm_response, code_system) #depending on the column the json output is created
results[key] = result
update_json_file(result, json_file, key) #final annotations are written to JSON
except Exception as e:
results[key] = {"error": str(e)}
return results
Categorization is carried out using LLMS and this step return the required structure of output for furhter processing .The codebase is created to categorize/classify the columns present in the tsv input file into classes according to the already existing categories present in the neurobagel annotation tool. The LLM makes it predictions for a specific input string consisting of the column header and the column contents based on the examples provided to it beforehand in its promptemplate. The various tasks carried out by this codebase mainly utilise Langchain, the json library from python and the LLM 'Gemma' from Ollama.
The choice of LLM was a very crucial aspect as it was necessary to select an LLM that hallucinates the least and has a freedom to be used enumerably without any limit.
-
The LLMs from hugging face: Even though we get to use many LLMs like Flant-T5 for free on Hugging face it had an API request Limit and hence a conclusion was made that such an approach is not feasible for something to be sent to production.Thereafter we have been using LLMs from Ollama for the project.
-
Llama2/Llama3
Even though Llama models appeared to work fine at first, hallucination was detected when tested further The attatched Screenshots show the LLM response in the interval of abt 10-20 seconds
- Gemma Model The gemma model also from Ollama is working fine with no visible hallucinations until now. And further tasks are being carried out using the Gemma model.
However an open mind has been kept in context of the choice of LLM and we keep on testing new LLMs that we come across because there can always be a scope of improvement in this context.
Utilizing LangChain’s components like PromptTemplate, ChatOllama and implementing chains provides a robust framework for developing and deploying our application.
- PromptTemplate:
Here the PromptTemplate specifies the input form examples for the LLM to base the response on and the form in which the output is expected. It also seperately defines the input variables that will be given to the LLM.
- Chatollama
Chatollama from langchain_community.chat_models was used to implement various LLMs from Ollama.
- The Chain In LangChain, the concept of chains refers to a sequence or pipeline of operations applied to data. Chains in LangChain are flexible, allowing developers to incorporate various components (like ChatOllama, PromptTemplate, data parsers, etc.) into customized workflows. The ' chain.invoke(...)' function executes each operation in the defined chain sequentially.
This depends on the type of values present in the column ie categorical, defined numeric indices, Continuos etc
The LLM output is of the form of 'AI_Model Object' so it was required to convert it into string object in order to further process the LLM output to obtain the required structered output from the code.
So funtions (SexLevel(...) and AgeFormat(....) ) are defined for different types of columns which require extra identification other than the labelling done by the LLM also the output has been structured in a way that is required by the parser using the 'json.dumps()' function.
The following Screenshot is the output of a test code before integrating the codebase with the parsing code
The Neurobagel Annotation-tool-AI uses the MIT License.