This README accompanies the paper “Privacy-preserving patient clustering for personalized federated learning”. The main contribution of the paper is to provide a method to calculate individual-level patient similarity scores without leaking patient information. The patient similarity scores can be used to cluster patients into clinically meaningful groups for downstream analysis. The README is a step-by-step guide to replicate the main findings of the paper.
The figure above highlights the main steps of PBCFL:
- A federated autoencoder is trained to embed patient clinical records into 30-dimension vectors.
- Patient similarity is estimated using Secure Multi-party Computation (SMPC), using Du et al., 2004 protocol.
- Patients are clustered into groups using spectral clustering of the similarity matrix calculated.
- A separate prediction model is trained on each cluster.
We compare the performance of our protocol (PCBFL) against:
- CBFL (Huang et al., 2019) (federated)
- Traditional FedAvg (federated)
- Single site training (not federated)
- Centralized training (not federated)
All data can be downloaded from the publicly available eICU dataset (https://physionet.org/content/eicu-crd/2.0/, credentialed user access is required).
To replicate the findings, downloaded data must be processed using the code/fl_task_data_processing.ipynb workbook
.
The scripts directory can be used to run the full pipeline for each protocol. Each protocol has its own script which can be called via the bash
or sbatch
(if SLURM available) commands. Results for each run will be automatically saved and available for downstream analysis.
For example, to run PCBFL, simply run the command:
bash scripts/script_pcbfl
Note, the scripts should be updated to reflect the code directory on your machine.
All code was written in Python 3.9.7 and Pytorch 1.12.1