CMIRB (Chinese Medical Information Retrieval Benchmark) is a specialized multi-task dataset designed specifically for medical information retrieval. It consists of data collected from various medical online websites, encompassing 5 tasks and 10 datasets, and has practical application scenarios. The data is available at CMIRB.
CMIRB is developed based on C-MTEB. You can refer to C-MTEB for setting up the environment.
These packages are necessary: transformers, datasets, beir, mteb.
- With our scripts
You can reproduce the results of baai-general-embedding (bge) using the provided python script (see eval_CMIRB.py )
python eval_CMIRB.py --model_name_or_path BAAI/bge-large-zh-v1.5
- With C-MTEB scripts
You can directly evaluate MIR tasks within the C-MTEB code framework. Simply create a new task in Retrieval.py, following the structure of the MedExamRetrieval.
Then, you can conduct evaluations
evaluation = MTEB(tasks=["MedExamRetrieval"], task_langs=['zh', 'zh-CN'])
Model | Dim. | Avg. | MedExam | DuBaike | DXYDisease | Medical | Cmedqa | DXYConsult | Covid | IIYiPost | CSLCite | CSLRel |
---|---|---|---|---|---|---|---|---|---|---|---|---|
text2vec-large-zh | 1024 | 30.56 | 41.39 | 21.13 | 41.52 | 30.93 | 15.53 | 21.92 | 60.48 | 29.47 | 20.21 | 23.01 |
mcontriever(masmarco) | 768 | 35.20 | 51.5 | 22.25 | 44.34 | 38.5 | 22.71 | 20.04 | 56.01 | 28.11 | 34.59 | 33.95 |
bm25 | - | 35.35 | 31.95 | 17.89 | 40.12 | 29.33 | 6.83 | 17.78 | 78.9 | 66.95 | 33.74 | 29.97 |
text-embedding-ada-002 | - | 42.55 | 53.48 | 43.12 | 58.72 | 37.92 | 22.36 | 27.69 | 57.21 | 48.6 | 32.97 | 43.4 |
m3e-large | 768 | 45.25 | 33.29 | 46.48 | 62.57 | 48.66 | 30.73 | 41.05 | 61.33 | 45.03 | 35.79 | 47.54 |
multilingual-e5-large | 1024 | 52.08 | 53.96 | 53.27 | 72.1 | 51.47 | 28.67 | 41.35 | 75.54 | 63.86 | 42.65 | 37.94 |
piccolo-large-zh | 1024 | 54.75 | 43.11 | 45.91 | 70.69 | 59.04 | 41.99 | 47.35 | 85.04 | 65.89 | 44.31 | 44.21 |
gte-large-zh | 1024 | 55.40 | 41.22 | 42.66 | 70.59 | 62.88 | 43.15 | 46.3 | 88.41 | 63.02 | 46.4 | 49.32 |
bge-large-zh-v1.5 | 1024 | 55.40 | 58.61 | 44.26 | 71.71 | 59.6 | 42.57 | 47.73 | 73.33 | 67.13 | 43.27 | 45.79 |
peg | 1024 | 57.46 | 52.78 | 51.68 | 77.38 | 60.96 | 44.42 | 49.3 | 82.56 | 70.38 | 44.74 | 40.38 |
The data preprocessing process can be seen in data_collection_and_processing.
An overview datasets available in CMIRB is provided in the following table:
Name | Hub URL | Description | Query #Samples | Doc #Samples |
---|---|---|---|---|
MedExamRetrieval | CMIRB/MedExamRetrieval | Medical multi-choice exam | 697 | 27,871 |
DuBaikeRetrieval | CMIRB/DuBaikeRetrieval | Medical search query from BaiDu Search | 318 | 56,441 |
DXYDiseaseRetrieval | CMIRB/DXYDiseaseRetrieval | Disease question from medical website | 1,255 | 54,021 |
MedicalRetrieval | CMIRB/MedicalRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in medical domain | 1,000 | 100,999 |
CmedqaRetrieval | CMIRB/CmedqaRetrieval | Online medical consultation text | 3,999 | 100,001 |
DXYConsultRetrieval | CMIRB/DXYConsultRetrieval | Online medical consultation text | 943 | 12,577 |
CovidRetrieval | CMIRB/CovidRetrieval | COVID-19 news articles | 949 | 100,001 |
IIYiPostRetrieval | CMIRB/IIYiPostRetrieval | Medical post articles | 789 | 27,570 |
CSLCiteRetrieval | CMIRB/CSLCiteRetrieval | Medical literature citation prediction | 573 | 36,703 |
CSLRelatedRetrieval | CMIRB/CSLRelatedRetrieval | Medical similar literatue | 439 | 36,758 |
We thank the great tool from FlagEmbedding and the open-source datasets from Chinese NLP community.
If you find this repository useful, please consider citation
@misc{li2024automireffectivezeroshotmedical,
title={AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels},
author={Lei Li and Xiangxu Zhang and Xiao Zhou and Zheng Liu},
year={2024},
eprint={2410.20050},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2410.20050},
}