This repository provides
- a set of scripts to perform common operations and analyses on the Human Connectome Project's neuroimaging data with a focus on task-based fMRI
- a Python-based framework for connectome-based analysis (CPM) using Ray to parallelize CPM which has become the main focus of this project
Other than cloning this repository, you need to have bash installed (which is most likely the case if you use Linux, *BSD or even MacOS). For the Python code, the arguably easiest and cleanest way is to set up a Python virtual environment and install the dependencies there:
$ python3 -m venv hcp-suite # Setup the virtual environment
$ source ./hcp-suite/bin/activate # Activate the virtual environment
$ pip install pandas pingouin nilearn nibabel ray # Install Python dependencies within the virtual environment
$ pip install ipython jupyterlab # Install interactive Python shells to run hcp-suite code in
The following tutorial uses the gambling task as an example, the variable to be predicted is BMI.
Python code in this repository is written to be used in an interactive Python shell. Either Jupyter Lab or Jupyer Notebook is recommended as plots and images are conveniently displayed in-line but any Python shell (e.g. iPython) should work.
Generally speaking, the procedure is as follows:
- Creation of a list of subject IDs to be included in the analysis
- Parcellation of CIFTI files with
task.sh prepare
(at the code's current state, the HCP's folder structure is assumed) - Extraction of fMRI time series based on EV files provided by the HCP with
get_ev_timeseries()
fromhcpsuite.py
- Computing of correlation matrices via
compute_correlations()
fromhcpsuite.py
- Performing of CPM, the functions are provided by
cpm_ray.py
- Establishing of statistical significance by way of permutation testing (also
cpm_ray.py
)
The following line is an example for preparation of the gambling task (to be run in the system, e.g. Linux, shell):
$ ./task.sh prepare -s /path/to/gambling_subjects.ids -t GAMBLING -p RenTianGlasser -e
Calling task.sh
without arguments will provide an explanation of parameters (output is also provided at the end of this README).
To accelerate the procedure, IDs files can be split and task.sh can be run in parallel:
$ split -l 270 /path/to/gambling_subjects.ids /tmp/gambling_ # Creates /tmp/gambling_aa, /tmp/gambling_bb etc. with 270 lines each
$ for id in /tmp/gambling_a?; do test -e $id.lock && continue; touch $id.lock; ~/Projekte/diss/hcp-suite/bin/task.sh prepare -s $id -t EMOTION -p RenTianGlasser -e; done # Runs one task.sh process for each /tmp/gambling_a? file
Run like this, task.sh will output a list of subject IDs the preparation failed for.
The following is run in an interactive Python shell (e.g. the "Console" of Jupyter Lab):
# We need to append Python's path variable as hcp-suite is not part of the standard path
import sys
sys.path.append("/path/to/hcp-suite/lib")
# ... so we can import the functions from hcpsuite.py
from hcpsuite import *
# First we will load the subject IDs into a Python list structure
ids = load_list_from_file("./path/to/gambling_subjects.ids")
# get_ev_timeseries() reads our parcellated CIFTI files and extracts information from the
# HCP-provided EV files (in this example, win.txt for the win condition of the gambling task)
# regarding the specific sections of the task time series we need for that condition
ev_data_dict = get_ev_timeseries(ids, ["win.txt"], task=GAMBLING, runs=('LR', 'RL'), parcellation=RenTianGlasser, data_dir='/home/HCP/S1200/', prewhiten=True) # Be sure to modify data_dir to match your HCP imaging data directory
# Now we save the extraced time series as text files in a directory of our choice
# (in this case: ./GAMBLING_win)
save_data_dict(ev_data_dict, path="./GAMBLING_win")
We continue in our Python shell to compute correlation matrices:
# We load the saved time series from the previous step
cd GAMBLING_win # save_data_dict() writes file names into file "ts_files" but without paths,
# thus the easiest way is to change into the directory containing the files
time_series, time_series_files = load_time_series("./ts_files") # ... and read them from there
correlation_measure, correlation_matrices = compute_correlations(time_series, kind='tangent')
# Tangent in our experience provides the best results, but there are alternatives:
# https://nilearn.github.io/dev/modules/generated/nilearn.connectome.ConnectivityMeasure.html
# We then save the matrices into a single file in the NIfTI format for downstream processing
save_matrix(cifti_dim_to_nifti_dim(correlation_matrices), "./GAMBLING_win-tangent.nii.gz")
For actual CPM, we need to install and run Ray (run this e.g. in your Python virtual environment as described in Installation):
$ pip install ray
$ ray start --head --num-cpus=16 # Run this on your main node.
# Processes take up a lot of RAM, be careful not to use too many CPUs
Optionally add more Ray nodes to form a cluster, see the Ray documentation for details.
For our analysis, we need values from both the unrestricted and restricted HCP data, which are available as separate CSV files. For easier handling, we merge them into a single CSV file:
import pandas as pd
unrestricted = pd.read_csv("/path/to/unrestricted.csv")
restricted = pd.read_csv("/path/to/restricted.csv")
merged = pd.merge(restricted, unrestricted, on="Subject")
merged.to_csv("./merged.csv") # Save the merged DataFrame as a CSV file in the current directory
import sys
sys.path.append("/path/to/hcp-suite/lib")
from cpm_ray import *
ids = load_list_from_file("./path/to/gambling_subjects.ids") # Load list of subject IDs
behav = 'BMI' # Which variable do we want to predict?
covars = ['Age_in_Yrs', 'Gender', 'Race'] # What variables do we want to correct for?
behav_data = get_behav_data("./path/to/merged.csv", ids) # Loading of behavioral and biometrical data as a Pandas DataFrame from a CSV file
# We need to binarize gender and race for our analysis
behav_data["Gender"] = behav_data["Gender"].replace("F", 0)
behav_data["Gender"] = behav_data["Gender"].replace("M", 1)
behav_data["Race"] = behav_data["Race"].replace("White", 0)
behav_data["Race"] = behav_data["Race"].replace("Black or African Am.", 1)
behav_data["Race"] = behav_data["Race"].replace("Asian/Nat. Hawaiian/Othr Pacific Is.", 1)
behav_data["Race"] = behav_data["Race"].replace("Am. Indian/Alaskan Nat.", 1)
behav_data["Race"] = behav_data["Race"].replace("Unknown or Not Reported", 1)
behav_data["Race"] = behav_data["Race"].replace("More than one", 1)
functional_data = convert_matrices_to_dataframe(get_nimg_data("./path/to/GAMBLING_win-tangent.nii.gz"), ids) # Loading of correlation matrices as Pandas DataFrame
ray_handler() is a Python class through which data management and the starting and coordination of Ray Actors (i.e. the processes working in parallel) is being handled.
n_folds = 128 # In this example, we use 128 folds, which is a good starting point
n_perm = 1000 # How many permutations are we planning to do later on?
ray_handler = RayHandler(
functional_data.copy(),
behav_data.copy(),
behav,
covars,
address="auto", # We assume that the main Ray process runs on the same host
n_perm=n_perm,
)
ray_handler.add_kfold_indices(n_folds, clean=True) # By setting "clean" to True, we remove twins from the fold so they don't predict each other
ray_handler.upload_data() # Functional and behavioral data are uploaded into the shared storage to which Ray Actors have access
First we define the jobs for the actors; a job is a Python list object consisting of the following items: "job type", "fold number", "permutation number". The permutation number for the actual, i.e. unpermutated, data is "-1".
job_list = [["fselection_and_prediction", fold, perm] for perm in [-1] for fold in range(n_folds)] # This is the job list without permutations
# If we wanted to also compute the permutations (which takes a very long time), the job list can be created as follows:
#job_list = [["fselection_and_prediction", fold, perm] for perm in [-1, *range(n_perm)] for fold in range(n_folds)]
ray_handler.start_actors(job_list) # Start computing
n_done, n_remaining = ray_handler.status() # Prints a status report
results = ray_handler.get_results(n=100) # Retrieving a huge number of results (e.g. when performing permutation analysis)
# and especially from distributed Ray actors can take a long time. Specifying the
# number of results (e.g. n=100) to be retrieved from the results store at once
# allows for a display of progress
# Optional: Save results into a file (NumPy format)
np.save("./GAMBLING_win-results.npy", results)
g = plot_predictions(results['prediction_results'][perm], tail="glm", color="green")
For more presentation and plotting examples, see save.py
.
Usage: task.sh <action> [action-specific arguments] [common options]
Actions:
prepare
-s <arg>: specify file with subject IDs or a space-separated list of subjects
-t <arg>: specify a task
-p <arg>: specify a parcellation
You might want to consider using -e to prevent exiting when one subject fails.
design
-s <arg>: specify file with subject IDs or a space-separated list of subjects
-o <arg>: specify an output file (e.g. design.mat)
-V <arg>: specify variables (e.g. "-V Age_in_Yrs -V Gender -V BMI"; repeatable)
-f <arg>: specify CSV file containing the specified variables (repeatable; e.g.
"-f restricted.csv -f unrestricted.csv")
eb
-s <arg>: specify file with subject IDs or a space-separated list of subjects
-o <arg>: specify an output file (e.g. eb.csv)
-f <arg>: specify CSV file containing the specified variables (repeatable; e.g.
"-f restricted.csv -f unrestricted.csv")
analyse
-s <arg>: specify file with subject IDs or a space-separated list of subjects
-t <arg>: specify a task
-c <arg>: specify a COPE number
-p <arg>: specify a parcellation
-d <arg>: specify target directory
[-n <arg>]: number of permutations (default: 5000)
[-o <arg>]: additional PALM arguments
Specify design, contrast and EB via additional PALM arguments or place them in
<target directory>/../ as design_<task>.mat, design.con and eb_<task>.csv
Common options:
[-D <arg>]: specify the directory containing HCP subject data (overrides HCP variable in ../lib/common.sh
[-l <log file>]: specify log file instead of default /tmp/${0}-${RANDOM}.log
[-e]: turn off error handling (do not exit on error)