Last year, during the Biohackathon 2021, under the project "FAIRX: Quantitative bias assessment in ELIXIR biomedical data resources", we assessed the partition of sex within two databases, EGA and dbGaP. The evolution of the project can be found in https://github.com/elixir-europe/biohackathon-projects-2021/tree/main/projects/35) with an article to be submitted for publication. Rather than analysing the available datasets, this time we concentrate on the scientific literature to uncover sex imbalance in the published research. We will leverage the EuroPMC repository (https://europepmc.org/) and their available API to access and mine the content of free-text articles published there. The end result would be an automated text parser to extract the mention of the sex in the reported information of preclinical and clinical studies if any and thus provide insights on the current state of sex imbalance in the research publications. The project will combine several strategies to ensure access to the data. First, it will concentrate on specific parts of the articles where data should be located (Material and methods, as well as additional files). We will prioritise article types of interest, such as publications linked to recent clinical trials and preclinical studies. We will also review the status of the policies and guidelines on sex disclosure in scientific publications adopted by the different journals (such as Key Resources Tables, STAR methods and SAGER guidelines). Recommendations for a fairer reporting of sex in scientific publications will be drawn from the analysis of the results, which will be presented in a form that is suitable for future publication.
Data Platform Federated Human Data Interoperability Platform Machine learning
- Olivier Philippe - BSC - olivier.philippe@bsc.es - Oliph
- Blanca Calvo - BSC - bcalvo.bsc@gmail.com - BlancaCalvo
- Getting a list of pmcids from a specific query done directly to the EuropePMC
- Using that list of pmcids to do a request and download the xml resource
- Parsing the xml file to build the different field
- Transforming the XML file into a JSON file
- Extracting the different field
- Extracting the relevant data
- Plugging that data to analysis
- Loading the required modules to run pytorch
module load mkl/2024.0 nvidia-hpc-sdk/23.11-cuda11.8 openblas/0.3.27-gcc cudnn/9.0.0-cuda11 tensorrt/10.0.0-cuda11 impi/2021.11 hdf5/1.14.1-2-gcc gcc/11.4.0 python/3.11.5-gcc nccl/2.19.4 pytorch ncurses tmux
- Installation
Create the virtual environement and install the requirements located in
requirements.txt
(the file is tailored for Marenostrum, if the goal is to install the repository on a different computer it needs to be cleaned of marenostrum path)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Running
- Biohackathon webpage: https://biohackathon-europe.org/
- Program for the week: https://biohackathon-europe.org/programme.html
- Link to official repository: https://github.com/elixir-europe/biohackathon-projects-2022/tree/main/24
- Google slides: https://docs.google.com/presentation/d/1NY7OVvQRV7Xmer_kUMcxED5fQczaYodabIyovSOpm7Q/edit?usp=sharing
- Meeting notes: https://docs.google.com/document/d/1vELBqp-z_Nuc2lNQLFZisapFq_4SMF4Nq14sQGs1suI/edit?usp=sharing
-
EuropePMC webpage: https://europepmc.org/
-
EuropePMC API doc: https://europepmc.org/RestfulWebService
-
EuropePMC Archive: https://europepmc.org/ftp/oa/pmcid.txt.gz
-
Star method information: https://star-methods.com/
-
Gdrive Project: https://drive.google.com/drive/folders/1IPJtm82BgrglztLarnqosfqLJ-HkFu5o?usp=sharing
-
To share big data: https://www.eudat.eu/catalogue/b2drop