git clone https://github.com/louisblankemeier/STARR-Labeler
cd STARR-Labeler
conda create --name starr_env python=3.9
conda active starr_env
pip install -e .
First, create a config file in starr_labeler/features/configs. You can use the existing config files as templates. Use the following to generate features:
cd starr_labeler/features/
python main.py --config-name <name_of_config_file>
To run with slurm, edit the command in starr_labeler/features/slurm.py. Then, run:
python slurm.py
The generated features can be used to train a classifier for a specified disease (label). The output file is called features.csv.
DATA_PATH: /clinical_data/
SAVE_DIR: /STARR-Labeler/starr_labeler/features/results
NUM_PATIENTS: 23547
PREDICTION_DATES: 'CT'
DAYS_AFTER_PREDICTION_DATES: 14
EHR_TYPES:
DEMOGRAPHICS:
USE: True
FILE_NAME: "demographics.csv"
USE_COLS: ['Patient Id', 'Gender', 'Race', 'Ethnicity', 'Date of Birth']
REGEX_TO_FEATURE_NAME: {'Gender': None, 'Race': None, 'Ethnicity': None, 'Date of Birth': None} # Date of Birth here gets converted to age.
TIME_BINS: 1
TIME_BIN_DURATION: 1 # in years
AGGREGATE_ACROSS_TIME: 'mean'
FILL_NA: 'median'
SAVE: True
LOAD: False # whether to load from
LABS:
USE: True
FILE_NAME: "labs.csv"
USE_COLS: ['Patient Id', 'Value', 'Taken Date', 'Result', 'Reference High', 'Reference Low', 'Units']
TYPE: 'Result'
Value: 'Value'
REGEX_TO_FEATURE_NAME: {'^HDL Cholesterol': 'HDL Cholesterol', 'Cholesterol, Total': 'Cholesterol, Total', 'Creatinine, Ser/Plas': None}
-
A patient can have multiple CT scans over the course of the requested time, but
NUM_PATIENTS
corresponds to the number of unique patients. -
If the EHR data was not logged into the system on the same date as the CT Scan,
DAYS_AFTER_PREDICTION_DATES
defines how many days after the CT exam the EHR data is still usable/acceptable. -
With
USE_COLS
, load only specified columns of the csv for improved speed. -
Time resolution (in years) can be increased by getting multiple entries from one subject according to the specified
TIME_BINS
. If the value is 1, there will be 1 feature vector extracted for the time window specified above. A larger number of bins corresponds to multiple feature vectors extracted from multiple time windows according to the number of bins and the specified bin duration. The number of bins is also shown in the variable name of the output CSV with '_1' for bin 1, '_2' for bin 2 etc. -
If multiple entries are present for one subject specify aggregation strategy with
AGGREGATE_ACROSS_TIME
. -
The keys in
REGEX_TO_FEATURE_NAME
define regex expressions and all values that match the regex are mapped to a single feature with a name given by the value in the dictionary. In the example shown above all variables with names starting fromHDL Cholesterol
will be matched.
First, create a config file in starr_labeler/labels/disease_configs. You can use the existing config files as templates. Use the following to generate outcome labels:
cd starr_labeler/labels/
python main.py --config-name <name_of_config_file>
To run with slurm, edit the command in starr_labeler/labels/slurm.py. Then, run:
python slurm.py
The generated labels can be used to train a classifier for a specified outcome (label) using the EHR features discussed above. The output file is called labels.csv.
DISEASE_NAME: abdominal_aortic_aneurysm
DAYS_AFTER: 1825
DAYS_BEFORE: 0
CONSIDER_ONLY_FIRST_DIAGNOSIS: True
DATA_PATH: /clinical_data/
SAVE_DIR: /STARR-Labeler/starr_labeler/results
NUM_PATIENTS: 23547
PREDICTION_DATES: 'CT'
EHR_TYPES:
DIAGNOSES:
FILE_NAME: "diagnoses.csv"
USE_COLS: ['Patient Id', 'Date', 'ICD10 Code', 'ICD9 Code']
ICD10:
Hierarchical: True
Codes:
- I71.0
- I71.3
- I71.1
- I71.4
- I71.2
ICD9:
Hierarchical: True
Codes:
- 441.0
- 441.3
- 441.1
- 441.4
-
DAYS_AFTER
defines the end of the time window in days after the CT Scan during which evidence of the disease results in a positive label. -
DAYS_BEFORE
defines the start of the time window in days before the CT Scan during which evidence of the disease results in a positive label. - Formally, we may represent relevant points in a patient's timeline / history as the following:
-
$t_0$ : first time point of record in a patient's history. -
$t_d$ : time point marking the onset of disease. -
$t_b$ : time point markingDAYS BEFORE
a scan. -
$t_s$ : time point marking the date of the scan. -
$t_a$ : time point marking theDAYS AFTER
a scan (e.g., window) -
$t_h$ : time point marking the last date of record in a patient's history.
-
Note that
$t_a$ -$t_s$ , which is the window in which we're identifying events specified for our experimental design, may be longer than$t_h$ -$t_0$ , which is the time period a patient has data in the EHR for. This is crucial!
-
Class 0 → Patient not diagnosed with the specified disease before or after the CT Scan within the requested time window or any time in the patient's history.
- IF a patient develops disease, then
$t_d > t_a$ .
- IF a patient develops disease, then
- Class 1 → Patient diagnosed with specified disease during the time window between days before and days after.
- Commontly, only patients from Class 0 and 1 are included in the studies as controls/positive classes.
$t_b ≤ t_d ≤ t_a$
-
Class 2 → Patient diagnosed with specified disease earlier than the 'days before' window (e.g. patient is already "diseased" entering the cohort).
$t_d < t_b$
-
Class 3 → Patient has not been monitored long enough to definitively rule-out disease.
$t_h < t_a$
- In the
ICD10
andICD9
, the codes associated with the specified disease need to be defined.
Please cite this work if you use it for your research.