Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

This repository contains code and instructions for reproducing the experiments in the paper Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding (Findings of EMNLP 2023).

Installation

We use one NVIDIA RTX A6000 GPU to run the evaluation code in our experiments. The code is written in Python 3.8. You can install the dependencies as follows.

git clone --recurse-submodules https://github.com/yuzhimanhua/SciMult
cd SciMult

# get the DPR codebase
mkdir third_party
cd third_party
git clone https://github.com/facebookresearch/DPR.git
cd ../

# create the sandbox
conda env create --file=environment.yml --name=scimult
conda activate scimult

# add the `src/` and `third_party/DPR` to the list of places python searches for packages
conda develop src/ third_party/DPR/

# download spacy models
python -m spacy download en_core_web_sm

Quick Start

You need to first download the evaluation datasets and the pre-trained models. After you unzip the dataset file, put the folder (i.e., data/) under the repository main folder ./. After you download the four model checkpoints (i.e., scimult_vanilla.ckpt, scimult_moe.ckpt, scimult_moe_pmcpatients_par.ckpt, and scimult_moe_pmcpatients_ppr.ckpt), put them under the model folder ./model/.

Then, you can run the evaluation code for each task:

cd src

# evaluate fine-grained classification (MAPLE [CS-Conference, Chemistry-MeSH, Geography, Psychology])
./eval_classification_fine.sh

# evaluate coarse-grained classification (SciDocs [MAG, MeSH])
./eval_classification_coarse.sh

# evaluate link prediction under the retrieval setting (SciDocs [Cite, Co-cite], PMC-Patients [PPR])
./eval_link_prediction_retrieval.sh

# evaluate link prediction under the reranking setting (Recommendation)
./eval_link_prediction_reranking.sh

# evaluate search (SciRepEval [Search, TREC-COVID], BEIR [TREC-COVID, SciFact, NFCorpus])
./eval_search.sh

The metrics will be shown at the end of the terminal output as well as in scores.txt.

Getting embeddings of your own data

If you have some documents (e.g., scientific papers) and want to get the embedding of each document using SciMult, we provide the following sample code for your reference:

cd src
python3.8 get_embedding.py

PMC-Patients

NOTE: The performance of SciMult on PMC-Patients reported in our paper is based on the old version of PMC-Patients (i.e., the version when we wrote the SciMult paper). The PMC-Patients Leaderboard at that time can be found here.

To reproduce our reported performance on the "old" PMC-Patients Leaderboard:

cd src
./eval_pmc_patients.sh

The metrics will be shown at the end of the terminal output as well as in scores.txt. The similarity scores that we submitted to the leaderboard can be found at ../output/PMCPatientsPAR_test_out.json and ../output/PMCPatientsPPR_test_out.json.

For the performance of SciMult on the new version of PMC-Patients, please refer to the up-to-date PMC-Patients Leaderboard.

SciDocs

To reproduce our performance on the SciDocs benchmark:

cd src
./eval_scidocs.sh

The output embedding files can be found at ../output/cls.jsonl and ../output/user-citation.jsonl. Then, run the adapted SciDocs evaluation code:

cd ../
git clone https://github.com/yuzhimanhua/SciDocs.git
cd scidocs

# install dependencies
conda deactivate
conda create -y --name scidocs python==3.7
conda activate scidocs
conda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorch
pip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0
python setup.py install

# run evaluation
python eval.py

The metrics will be shown at the end of the terminal output.

Datasets

The preprocessed evaluation datasets can be downloaded from here. The aggregate version is released under the ODC-By v1.0 License. By downloading this version you acknowledge that you have read and agreed to all the terms in this license.

Similar to Tensorflow datasets or Hugging Face's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have the license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

More details about each constituent dataset are as follows.

Dataset	Folder	#Queries	#Candidates	Source	License
MAPLE (CS-Conference)	`classification_fine/`	261,781	15,808	Link	ODC-By v1.0
MAPLE (Chemistry-MeSH)	`classification_fine/`	762,129	30,194	Link	ODC-By v1.0
MAPLE (Geography)	`classification_fine/`	73,883	3,285	Link	ODC-By v1.0
MAPLE (Psychology)	`classification_fine/`	372,954	7,641	Link	ODC-By v1.0
SciDocs (MAG Fields)	`classification_coarse/`	25,001	19	Link	CC BY 4.0
SciDocs (MeSH Diseases)	`classification_coarse/`	23,473	11	Link	CC BY 4.0
SciDocs (Cite)	`link_prediction_retrieval/`	92,214	142,009	Link	CC BY 4.0
SciDocs (Co-cite)	`link_prediction_retrieval/`	54,543	142,009	Link	CC BY 4.0
PMC-Patients (PPR, Zero-shot)	`link_prediction_retrieval/`	100,327	155,151	Link	CC BY-NC-SA 4.0
PMC-Patients (PAR, Supervised)	`pmc_patients/`	5,959	1,413,087	Link	CC BY-NC-SA 4.0
PMC-Patients (PPR, Supervised)	`pmc_patients/`	2,812	155,151	Link	CC BY-NC-SA 4.0
SciDocs (Co-view)	`scidocs/`	1,000	reranking, 29.98 for each query on average	Link	CC BY 4.0
SciDocs (Co-read)	`scidocs/`	1,000	reranking, 29.98 for each query on average	Link	CC BY 4.0
SciDocs (Cite)	`scidocs/`	1,000	reranking, 29.93 for each query on average	Link	CC BY 4.0
SciDocs (Co-cite)	`scidocs/`	1,000	reranking, 29.95 for each query on average	Link	CC BY 4.0
Recommendation	`link_prediction_reranking/`	137	reranking, 16.28 for each query on average	Link	N/A
SciRepEval-Search	`search/`	2,637	reranking, 10.00 for each query on average	Link	ODC-By v1.0
TREC-COVID in SciRepEval	`search/`	50	reranking, 1386.36 for each query on average	Link	ODC-By v1.0
TREC-COVID in BEIR	`search/`	50	171,332	Link	Apache License 2.0
SciFact	`search/`	1,109	5,183	Link	Apache License 2.0, CC BY-NC 2.0
NFCorpus	`search/`	3,237	3,633	Link	Apache License 2.0

Models

Our pre-trained models can be downloaded from here. Please refer to the Hugging Face README for more details about the models.

Citation

If you find SciMult useful in your research, please cite the following paper:

@inproceedings{zhang2023pre,
  title={Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding},
  author={Zhang, Yu and Cheng, Hao and Shen, Zhihong and Liu, Xiaodong and Wang, Ye-Yi and Gao, Jianfeng},
  booktitle={Findings of EMNLP'23},
  pages={12259--12275},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
environment.yml		environment.yml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

Links

Installation

Quick Start

Getting embeddings of your own data

PMC-Patients

SciDocs

Datasets

Models

Citation

About

Licenses found

Releases

Packages

Languages

License

Licenses found

yuzhimanhua/SciMult

Folders and files

Latest commit

History

Repository files navigation

Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

Links

Installation

Quick Start

Getting embeddings of your own data

PMC-Patients

SciDocs

Datasets

Models

Citation

About

Topics

Resources

License

Licenses found

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages