Skip to content

Latest commit

 

History

History
154 lines (109 loc) · 10.1 KB

README.md

File metadata and controls

154 lines (109 loc) · 10.1 KB

RELISH Preprocessing

The RELISH preprocessing repository is responsible for managing processes related to obtaining and transforming Medline articles for utilization in software pipelines focused on dictionary-based Named Entity Recognition (NER), word embeddings and document level embeddings targeting document-to-document relevance, similarity, and recommendations. The current functionality of the 'relish-preprocessing' involves processing a list of articles adhering to the RELISH format.

Table of Contents

  1. About
  2. Input Data
  3. Pipeline
    1. Retrieving PMID Articles
    2. Generating Ground Truth Data: PMID Pairs and Relevance Labels
    3. Text preprocessing for Generating Embeddings
    4. Splitting the Data
  4. Getting Started
  5. Code Implementation
  6. Output Data
  7. Tutorials

Data input

RELISH is an expert-curated database designed for benchmarking document similarity in biomedical literature. The database v1 was downloaded from its corresponding FigShare record on the 24th of January 2022. It consists of a JSON file with PubMed Ids (PMIDs) and the corresponding document-2-document relevance assessments wrt other PMIDs. Relevance is categorized as "relevant", "partial" or "irrelevant".

Please be aware that the files might not have been uploaded to the repository on the same date as they were initially downloaded.

Process

The following section outlines the primary processes applied to the input data of the RELISH dataset.

Retrieving PMID Articles

  • Iteration through articles in the RELISH JSON format using the BioC API to obtain XML files containing identifiers (PMIDs), titles, and abstracts. Refer to the provided XML sample files for RELISH. It's also possible to retrieve this information from the bulk download from Medline using the JATS format.
  • Recording missing PMIDs, indicating PMIDs for which the retrieval process failed or whose title/abstract is not available as text. Refer to the list of missing PMIDs for RELISH.
  • Creation of a TSV file with PMID, title and abstract. Review the TSV sample file for RELISH.

Generating Ground Truth Data: PMID Pairs and Relevance Labels

  • Creation of a TSV file that serves as a reference dataset from the RELISH JSON file. It comprises of all pairs of PMIDs along with its corresponding relevance labeled as 0,1, or 2. These labels represent the levels of relevance, specifically "non-relevant", "partially-relevant", and "relevant" respectively. This structured file aids in establishing a reliable ground truth for further analysis and evaluation.

At this stage, we have Medline articles in two formats: XML and plain-text TSV. We use XML files for NER and the TSV file for word embedding and document embedding approaches.

Text preprocessing for Generating Embeddings

For the purpose of generating embeddings, several cleaning and pruning steps are undertaken. The following outlines the cleaning processes applied to the XML and TSV files:

  • Removal of "structural expressions" within abstracts. Certain expressions, such as "Results:" and "Methodology:" recommended by journals for structured abstracts, are removed from the text to avoid introducing noise to the embeddings.
    • Initial analysis of the text to identify the most common "structural words." and creation of a plain-text file and a JSON file containing these common "structural words."
  • Additional common steps for word embeddings include:
    • Converting all text to lowercase.
    • Eliminating punctuation marks (excluding hyphens, as hyphenated words might carry a distinct meaning).
    • Removal of special characters.
    • Tokenization.
    • Stopwords removal.

After performing the proposed cleaning, the retrieved articles in TSV format are saved as a NumPy array. A sample of the processed TSV file and the numPy arrays are available for RELISH.

Splitting the Data

This script is designed to split a dataset into training and testing sets while considering specific criteria. The input data is assumed to be in TSV format and represents pairs of articles with associated relevance scores.

  • The input data is loaded from the file 'RELISH.tsv' using pandas.

  • Initial Data Analysis:

    • The unique reference and assessed articles are identified.
    • Articles that exist as reference but not as assessed are identified and stored in onlyRefDocs.
  • Filtering Data:

    • Rows corresponding to reference articles that do not exist as assessed articles are extracted and stored in onlyRefDocs_data.
    • Rows corresponding to reference articles that exist as assessed articles are stored in refRelMatrix.
  • Save Excluded Pairs:

    • The pairs being removed during the filtering process are saved in a file named 'valid.tsv'.
  • Loop for 1000 Iterations:

    • For each iteration, a different random seed is generated for reproducibility.
    • The onlyRefDocs_data is split into training and testing sets using the train_test_split function, with 80/20 ratio and stratification based on relevance.
    • The refRelMatrix is filtered based on the training set.
    • The error from the 80% target split is calculated, and the best split is updated if the error is smaller.
  • Report Best Results:

    • After the loop, the script reports the details of the best split found, including the sizes of train and test sets and the percentage of pairs in each.
    • The best train and test splits are saved in separate files named 'train_split.tsv' and 'test_split.tsv'.

Getting Started

To get started with this project, follow these steps:

Clone the Repository

First, clone the repository to your local machine using the following command:

Using HTTP:
git clone https://github.com/zbmed-semtec/relish-preprocessing.git
Using SSH:

Ensure you have set up SSH keys in your GitHub account.

git clone git@github.com:zbmed-semtec/relish-preprocessing.git

Create a virtual environment and install dependencies

To create a virtual environment within your repository, run the following command:

python3 -m venv .venv 
source .venv/bin/activate   # On Windows, use '.venv\Scripts\activate' 

To confirm if the virtual environment is activated and check the location of yourPython interpreter, run the following command:

which python    # On Windows command prompt, use 'where python'
                # On Windows PowerShell, use 'Get-Command python'

The code is stable with python 3.6 and higher. The required python packages are listed in the requirements.txt file. To install the required packages, run the following command:

pip install -r requirements.txt

To deactivate the virtual environment after running the project, run the following command:

deactivate

Code Implementation

Code scripts for the following processes can be found here:

Data output

The output files generated by the complete RELISH preprocessing pipeline include:

  • A RELISH TSV file consisting of three columns [PMIDs | title | abstract] used for downstream preprocessing.
  • Individual RELISH XML files (one per each PubMed article).
  • A processed RELISH.npy file where each row is a listof three numpy arrays representing the PMID, tokenized title, and tokenized abstract of a document.
  • RELISH ground truth TSV file with three columns [Reference PMID | Assessed PMID | Relevance score (0,1 or 2)].
  • RELISH pre-processed tokens.npy file into Train and Test TSV files using a 80-20% split by maximizing the Train pairs as per the RELISH database.
  • RELISH Ground Truth TSV file into 2 sets of pairs: Train and Test TSV files. These pairs are those that exist as part of the RELISH Database.

Tutorials

Tutorials are accessible in the form of Jupyter notebooks for the following processes: