Skip to content

carmanzhang/PSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Information overload of the considerable, ever-increasing scientific publications is a critical issue confronted by the whole research community. Searching for the highly similar/relevant articles from vast article datasets (e.g., PubMed) using traditional information retrieval systems is not easy, because the short query (usually several keywords) is not informative to reflect user's information need. In this context, similar article recommendation task can be very useful, it can mitigate the information overload issue by recommending those highly similar/relevant articles to users according to the informative queries, article. In other words, recommending more articles by the query articles that users are of interested is the aim of the similar article recommendation task. This project provides an evaluation framework for this task, and benchmarked nearly 20 approaches covering most exiting methods and web systems, and a variety of text representation models as well.

Evaluation framework

The evaluation framework consists of four modules: dataset preparing, training, scoring and evaluating, which form a complete workflow of evaluation. In the first stage, datasets are required by the framework, they can be either the new ones curated by experimenters or selected from existing ones. In the training stage, new models can be similarly implemented in the trainer folder, or simply by choosing from the provided list of models. This stage will train multiple models while saving the best models in the training phase. In the third stage, the saved models are reloaded for inferring on the test datasets. In the last stage, the performance evaluation is conducted based on the predictions of the previous stage. some commonly used ranking metrics have been implemented in this framework such as MAG and NDCG. After all these stages finished, the evaluation metrics of different models on different datasets in scenarios can obtain with the framework.

evaluation-framework

This framework considers two recommendation scenario: article-oriented (AO) and user-oriented (UO) scenarios (depicted in the framework diagram). The article-oriented scenario ranks the candidates articles according to the similarity between the candidate and the query article, In user-oriented scenario, the ranking orders of candidates are determined by its similarities to the two sets of articles (positives and negatives). To facilitate reuse, This framework abstracted several programming interfaces to enable the scoring process of each method is implemented independently.

Benchmarked approaches and performance

the dataset

The evaluation now conducted on RELISH, a large dataset toward benchmarking biomedical similar articles recommenders, it was curated via crow-resourcing, with more than 1,500 word-wide biomedical scientists from various research areas participating. This dataset can meet a high standard of quality, it was assessed by rigorous quality control. To benchmark AO methods, queries of RELISH are split into the standard training/test/validation folds following the ratio 8:1:1. However, benchmarking UO methods should be slightly different from benchmarking AO methods, as they learn recommendation from relevant and irrelevant articles instead of query-candidate pairs. To copy with it, the candidates under each query are split to the training/test/validation parts to build the evaluation dataset for UO methods.

fine-tuning BERT models as article recommenders

In addition to some shallow text representation models. BERT, the state-of-the-art language technique in NLP, is also introduced to this task. Here two pre-trained models AllenAI's SPECTER, BioBERT consider in the evaluation. To apply BERT for article recommendation, the triplet loss to optimize BERT models for the AO scenario, where a valid training instance is a triplet: a query article, and two kinds of candidates, i.e., relevant candidates, and the irrelevant candidates. The tuning process with the triplet loss in this scenario is to minimize the distance between query article and relevant article, and maximize the distance between query article and irrelevant article. In UO scenario, the tuning process should learn the preference of users from the relevant and irrelevant article sets due to the lacking of explicit queries. In this regard, the contrastive loss was used to minimize the distance among the positives while maximizing that between the negatives.

benchmark

The following table is the experimental results of article-oriented (AO) recommenders on the RELISH dataset

Approach Group Approach MAP@5 MAP@10 MAP@15 NDCG@5 NDCG@10 NDCG@15 Avg.
Random Random 79.33 77.22 75.41 80.7 77.67 76.4 77.79
Lexicon XPRC 84.34 81.98 80.59 85.32 82.43 81.78 82.74
BM25 88.91 86.72 84.54 89.48 87.39 86.21 87.21
PMRA 90.3 87.57 85.75 90.95 88.4 87.45 88.4
Word Embedding fastText 85.75 82.81 81.79 86.79 83.79 83.12 84.01
GloVe 86.71 83.72 82.27 87.5 84.24 83.83 84.71
BioWordVec 89.84 86.51 84.67 89.9 86.67 85.53 87.19
Sentence Embedding InferSent-v2 82.14 79.93 78.74 83.97 81.45 80.76 81.17
InferSent-v1 85.21 82.16 80.41 86.56 83.31 82.35 83.33
WikiSentVec 87.92 85.23 83.4 88.65 85.74 84.81 85.96
BioSentVec 90.76 88.1 86.16 90.05 87.76 86.89 88.29
Document Embedding LDA 85.44 82.66 80.36 86.51 82.91 81.31 83.2
Doc2vec 88.31 85.83 84.61 89.02 86.23 85.57 86.6
BERT BioBERT 88.14 85.81 83.9 88.97 86.29 85.1 86.37
SPECTER 92.27 90 88.36 91.47 89.12 88.42 89.94
BERT with fine-tuning BioBERT (tuned) 94.11 92.1 90.64 92.85 90.72 89.93 91.73
SPECTER (tuned) 93.76 91.65 90.39 93.4 91.2 90.52 91.82

Although BM25 and PMRA show better performance than most of the methods, the AO-benchmark suggests that they are suboptimal to several text representation methods. The tuned models, SPECTER, BioBERT outperform them by a large margin. For some domain-specific models (e.g., BioSentVec, and BioWordVec) trained specifically for biomedicine are superior to their generic equivalents (e.g., InferSent, WikiSentVec, fastText, and GloVe). As shows in this benchmark, there are significant improvements over the original ones via fine-tuning, suggesting that fine-tuning is helpful for this task. After fine-tuning, we found SPECTER is the best-performing method among all the AO methods, it not only better than the original SPECTER, but more importantly, significant outperforms PMRA, the underlying method of "Similar Article" feature in the PubMed interface.

Similarly, experimental results of user-oriented (UO) recommenders on the RELISH dataset is as follows:

Approach Group Approach MAP@5 MAP@10 MAP@15 NDCG@5 NDCG@10 NDCG@15 Avg.
Random 78.14 76.32 75.72 80.73 77.65 76.71 77.55
Web System MScanner 87.19 84.92 83.73 87.16 84.48 83.21 85.12
BioReader 88.23 86.02 85.21 87.83 85.13 84.41 86.14
MedlineRanker 88.69 86.33 85.32 88.1 85.6 84.36 86.4
Word Embedding GloVe 88.59 86.51 85.31 88.19 85.61 84.45 86.44
fastText 88.88 86.73 85.23 88.35 85.79 84.13 86.52
BioWordVec 89.24 87.17 86 88.59 86.04 84.58 86.94
Sentence Embedding InferSent-v2 88.71 86.5 85.59 88.25 85.63 84.37 86.51
InferSent-v1 89.17 87.11 86.36 88.57 86.05 84.93 87.03
WikiSentVec 90.09 87.97 86.83 89.16 86.81 85.55 87.74
BioSentVec 91.03 89.15 88.16 89.89 87.63 86.65 88.75
Document Embedding LDA 86.22 83.7 83.43 86.46 83.51 82.86 84.36
Doc2vec 88.29 85.89 84.64 87.99 85.12 83.62 85.93
BERT BioBERT 89.56 87.01 86.17 89.71 87.38 86.7 87.76
SPECTER 90.65 88.49 87.54 90.52 88.66 87.78 88.94
BERT with fine-tuning SPECTER (tuned) 90.91 88.66 88.23 90.66 88.74 88.09 89.22
BioBERT (tuned) 90.81 88.59 88.04 90.81 88.88 88.2 89.22

Among the three recommendation systems (MScanner, BioReader, and MedlineRanker), MScanner performed least well than other competing systems. In addition, the BERT-like models further improved the existing recommendation systems by wide margins on the UO-benchmark, which suggests that the advanced technology can further boost the systems.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages