biohackathon-2022-project-24

Abstract

Last year, during the Biohackathon 2021, under the project "FAIRX: Quantitative bias assessment in ELIXIR biomedical data resources", we assessed the partition of sex within two databases, EGA and dbGaP. The evolution of the project can be found in https://github.com/elixir-europe/biohackathon-projects-2021/tree/main/projects/35) with an article to be submitted for publication. Rather than analysing the available datasets, this time we concentrate on the scientific literature to uncover sex imbalance in the published research. We will leverage the EuroPMC repository (https://europepmc.org/) and their available API to access and mine the content of free-text articles published there. The end result would be an automated text parser to extract the mention of the sex in the reported information of preclinical and clinical studies if any and thus provide insights on the current state of sex imbalance in the research publications. The project will combine several strategies to ensure access to the data. First, it will concentrate on specific parts of the articles where data should be located (Material and methods, as well as additional files). We will prioritise article types of interest, such as publications linked to recent clinical trials and preclinical studies. We will also review the status of the policies and guidelines on sex disclosure in scientific publications adopted by the different journals (such as Key Resources Tables, STAR methods and SAGER guidelines). Recommendations for a fairer reporting of sex in scientific publications will be drawn from the analysis of the results, which will be presented in a form that is suitable for future publication.

Topics

Data Platform Federated Human Data Interoperability Platform Machine learning

Teams

Lead(s)

Olivier Philippe - BSC - olivier.philippe@bsc.es - Oliph
Blanca Calvo - BSC - bcalvo.bsc@gmail.com - BlancaCalvo

Collaborators

Process

Data retrieval

Getting a list of pmcids from a specific query done directly to the EuropePMC
Using that list of pmcids to do a request and download the xml resource
Parsing the xml file to build the different field
Transforming the XML file into a JSON file
Extracting the different field
Extracting the relevant data
Plugging that data to analysis

Deployment of LLM models on Marenostrum

Loading the required modules to run pytorch

module load  mkl/2024.0 nvidia-hpc-sdk/23.11-cuda11.8 openblas/0.3.27-gcc cudnn/9.0.0-cuda11 tensorrt/10.0.0-cuda11 impi/2021.11 hdf5/1.14.1-2-gcc gcc/11.4.0 python/3.11.5-gcc nccl/2.19.4 pytorch  ncurses tmux

Installation Create the virtual environement and install the requirements located in requirements.txt(the file is tailored for Marenostrum, if the goal is to install the repository on a different computer it needs to be cleaned of marenostrum path)

python -m venv venv 
source venv/bin/activate
pip install -r requirements.txt

Running

Analysis

Useful links

Biohackathon

Biohackathon webpage: https://biohackathon-europe.org/
Program for the week: https://biohackathon-europe.org/programme.html
Link to official repository: https://github.com/elixir-europe/biohackathon-projects-2022/tree/main/24
Google slides: https://docs.google.com/presentation/d/1NY7OVvQRV7Xmer_kUMcxED5fQczaYodabIyovSOpm7Q/edit?usp=sharing
Meeting notes: https://docs.google.com/document/d/1vELBqp-z_Nuc2lNQLFZisapFq_4SMF4Nq14sQGs1suI/edit?usp=sharing

EuropePMC

EuropePMC webpage: https://europepmc.org/
EuropePMC API doc: https://europepmc.org/RestfulWebService
EuropePMC Archive: https://europepmc.org/ftp/oa/pmcid.txt.gz
Star method information: https://star-methods.com/
Gdrive Project: https://drive.google.com/drive/folders/1IPJtm82BgrglztLarnqosfqLJ-HkFu5o?usp=sharing
To share big data: https://www.eudat.eu/catalogue/b2drop

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
analysis		analysis
article		article
config		config
llm_inference		llm_inference
mn5		mn5
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
duckdb_sqlite_installation.py		duckdb_sqlite_installation.py
requirements_311.txt		requirements_311.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biohackathon-2022-project-24

Abstract

Topics

Teams

Lead(s)

Collaborators

Process

Data retrieval

Deployment of LLM models on Marenostrum

Analysis

Useful links

Biohackathon

EuropePMC

About

Releases

Packages

Contributors 10

Languages

social-link-analytics-group-bsc/biohackathon-2022-project-24

Folders and files

Latest commit

History

Repository files navigation

biohackathon-2022-project-24

Abstract

Topics

Teams

Lead(s)

Collaborators

Process

Data retrieval

Deployment of LLM models on Marenostrum

Analysis

Useful links

Biohackathon

EuropePMC

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages