This repository contains the reproducibility study conducted for our submitted publication "Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI4H". We provide a step-by-step guide for those who wish to reproduce our data collection and analysis procedures as outlined in the figure below.
The complete dataset used in our analysis can be found in:
data/processed_final.csv
Our figures and numerical analyses are implemented in the following Jupyter notebooks:
-
plot_general.ipynb
- Contains all code for generating Figure 1
-
plot_code.ipynb
- Contains all code for generating Figure 2
-
plot_data.ipynb
- Contains all code for generating Figure 3
While it is suggested to create a new environment for this project, please follow these steps to set up the environment and install the prerequisite packages:
- Create a new conda environment named "reproAI4H":
conda create -n reproAI4H python=3.10
- Activate the environment:
conda activate reproAI4H
- Install the required packages from requirements.txt:
pip install -r requirements.txt
Note: Make sure you have Anaconda or Miniconda installed on your system before running these commands.
We note that it is crucial that you have enough GPU memory to run open source 70B LLMs for topic classification and conference paper title and author extraction. We ran all of our experiments using a single A6000 GPU with 48GB of VRAM.
While one can potentially attempt to use smaller 8B or 7B models, their performance is typically poor.
We note that getting access to each service's API's may not be trivial. For getting access to SemanticScholar API, you will have to apply for it here. For getting access to Medline's API, you will need to get a PubMed API key here. Finally, SerpAPI can be accessed simply by accessing their website here.
One can retrace our steps for scraping conference papers by running
python3 conf.py
However, please note that you will have to manually download 2 years of the CHIL papers as they were unscrapeable due to them being stored on ACM's website.
We share our webscraping code in src/conf_proc/scrape_conf.py
We share our LLM title and author extraction code in src/conf_proc/clean_conf.py
.
All code for code sharing and public dataset usage is in src/conf_proc/measure_conf.py
.
We note that we primarily use semantic scholar and SerpAPI to retrieve conference paper statistics as PubMed doesn't actively store conference papers. The semantic scholar querying code is in src/citation/semantic_scholar.py
.
We note that SerpAPI was queried using a free account and is then done through the jupyter notebook serpapi_conference_papers.ipynb
with their provided API for all the papers missing from the initial semantic scholar check. Please make sure you run this notebook after running the above.
One can retrace our steps for scraping PubMed papers by running
python3 pmc.py
All code for querying PubMed's AI4H papers is in src/pubmed/query_pmid.py
We query all medline affiliations using the code defined in src/pubmed/medline.py
.
All processing and analysis code is defined in src/pubmed/pmc_scrape.py
Running our combined analysis such as topic classification and checking with papers with code can be done through
python3 combine_classify.py
We cross-check public dataset usage using papers with code using the code in src/citation/papers_with_code.py
.
We showcase our topic classification code in src/topic/classification.py
.
We also include manual evaluation results in
data/validation.csv
which can be used to compute the final validation results in the Appendix table using
validation.ipynb
Please make a github issue and/or email johnwu3@illinois.edu if something doesn't run properly. I haven't properly debugged this code after cleaning up the mess that I was originally doing to produce the drafted analysis.