Skip to content

Latest commit

 

History

History
84 lines (52 loc) · 4.56 KB

README.md

File metadata and controls

84 lines (52 loc) · 4.56 KB

biohackathon-2022-project-24

Abstract

Last year, during the Biohackathon 2021, under the project "FAIRX: Quantitative bias assessment in ELIXIR biomedical data resources", we assessed the partition of sex within two databases, EGA and dbGaP. The evolution of the project can be found in https://github.com/elixir-europe/biohackathon-projects-2021/tree/main/projects/35) with an article to be submitted for publication. Rather than analysing the available datasets, this time we concentrate on the scientific literature to uncover sex imbalance in the published research. We will leverage the EuroPMC repository (https://europepmc.org/) and their available API to access and mine the content of free-text articles published there. The end result would be an automated text parser to extract the mention of the sex in the reported information of preclinical and clinical studies if any and thus provide insights on the current state of sex imbalance in the research publications. The project will combine several strategies to ensure access to the data. First, it will concentrate on specific parts of the articles where data should be located (Material and methods, as well as additional files). We will prioritise article types of interest, such as publications linked to recent clinical trials and preclinical studies. We will also review the status of the policies and guidelines on sex disclosure in scientific publications adopted by the different journals (such as Key Resources Tables, STAR methods and SAGER guidelines). Recommendations for a fairer reporting of sex in scientific publications will be drawn from the analysis of the results, which will be presented in a form that is suitable for future publication.

Topics

Data Platform Federated Human Data Interoperability Platform Machine learning

Teams

Lead(s)

Collaborators

Process

Data retrieval

  1. Getting a list of pmcids from a specific query done directly to the EuropePMC
  2. Using that list of pmcids to do a request and download the xml resource
  3. Parsing the xml file to build the different field
  4. Transforming the XML file into a JSON file
  5. Extracting the different field
  6. Extracting the relevant data
  7. Plugging that data to analysis

Deployment of LLM models on Marenostrum

  1. Loading the required modules to run pytorch
module load  mkl/2024.0 nvidia-hpc-sdk/23.11-cuda11.8 openblas/0.3.27-gcc cudnn/9.0.0-cuda11 tensorrt/10.0.0-cuda11 impi/2021.11 hdf5/1.14.1-2-gcc gcc/11.4.0 python/3.11.5-gcc nccl/2.19.4 pytorch  ncurses tmux
  1. Installation Create the virtual environement and install the requirements located in requirements.txt(the file is tailored for Marenostrum, if the goal is to install the repository on a different computer it needs to be cleaned of marenostrum path)
python -m venv venv 
source venv/bin/activate
pip install -r requirements.txt
  1. Running

Analysis

Useful links

Biohackathon

EuropePMC