IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials
Official code implementation
View Paper
·
Report Bug
·
Request Feature
Large Language models (LLMs) have demonstrated state-of-the-art performance in various natural language processing (NLP) tasks across multiple domains, yet they are prone to shortcut learning and factual inconsistencies. This research investigates LLMs' robustness, consistency, and faithful reasoning when performing Natural Language Inference (NLI) on breast cancer Clinical Trial Reports (CTRs) in the context of SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We examine the reasoning capabilities of LLMs and their adeptness at logical problem-solving. A comparative analysis is conducted on pre-trained language models (PLMs), GPT-3.5, and Gemini Pro under zero-shot settings using Retrieval-Augmented Generation (RAG) framework, integrating various reasoning chains.
📂 NLI4CT
|_📁 Gemini
|_📄 run-gemini-chain.py # Multi-turn conversation using Gemini Pro
|_📄 prep_results.py # Converting the labels to Entailment/Contradiction
|_📄 Gemini_results.json # Output of Gemini Pro - explanations and labels
|_📄 results.json # Final labels
|_📁 GPT-3.5 # Experimentation with GPT-3.5
|_📄 GPT3.5.py
|_📄 ChatGPT_results.json
|_📁 training-data # Training data - Clinical Trial Reports (CTRs)
|_📁 Experiments # Experimentation with other models - Flan T5 and Pre-trained Language Models (PLMs)
|_📄 flant5-label.ipynb
|_📄 PLMs.ipynb
|_📄 Makefile # Creating conda environment and installing dependencies
|_📄 LICENSE
|_📄 requirements.txt
|_📄 .gitignore
Run the following command -
make
This will create a new anaconda environment and install the required dependencies. In case you do not use anaconda, run the following command to install the dependencies.
pip install -r requirements.txt
Create a .env
file in the main directory. Fetch the API Keys for GPT-3.5 and Gemini Pro and put them in the .env
file as follows -
GOOGLE_API_KEY = "..."
OPENAI_API_KEY = "..."
Run the multi-turn conversation chain using the following command -
python run-gemini-chain.py
Gemini Pro will generate an explanation and a label (Yes/No) for each statement in the dataset.
The zero-shot evaluation of Gemini Pro yielded an F1 score of 0.69, with a consistency of 0.71 and a faithfulness score of 0.90 on the official test dataset. Our system achieved a fifth-place ranking based on the faithfulness score, a sixteenth-place ranking based on the consistency score, and a twenty-first-place ranking based on the F1 score. Gemini Pro outperforms GPT-3.5 with an improvement in F1 score by +1.9%, while maintaining almost similar consistency score. Additionally, the faithfulness score of Gemini Pro improves by +3.5% compared to GPT-3.5.