HAPNEST enables you to
- Efficiently generate large-scale, diverse and realistic datasets for genotypes and phenotypes
- Easily analyse data quality with an extensive workflow for evaluating synthetic data reliability and generalisability
- Examine the posterior distributions of model parameters to aid with model selection
We have used HAPNEST to simulate genetics data and corresponding phenotypes for 1 million individuals for 6 superpopulations and over 6.8 million variants in less than 12 hours (32GB RAM, 8 threads per machine). We have made this synthetic dataset (and a smaller example dataset) publicly available for download at https://www.ebi.ac.uk/biostudies/studies/S-BSST936.
HAPNEST has a modular architecture design, where each software module can be used independently with a single command line call. This means, for example, you can use the evaluation workflow for evaluating synthetic data quality for any synthetic dataset, not just those generated by HAPNEST.
This software tool was developed by members of INTERVENE (INTERnational consortium of integratiVE geNomics prEdiction). If you would like to cite this software, please cite our preprint https://www.biorxiv.org/content/10.1101/2022.12.22.521552v1.
- Quickstart tutorial
- Customising your synthetic datasets
- Understanding your synthetic datasets
- Large scale synthetic data generation
Steps 1-4 only need to be run once the first time you install HAPNEST. Step 5 describes how you can generate and then evaluate a synthetic dataset. If you encounter any errors when running these steps, please see the more detailed documentation below.
- Download the Singularity container by running the following command:
singularity pull docker://sophiewharrie/intervene-synthetic-data
- Setup the directory structure illustrated below with the container you just downloaded and a copy of the
config.yaml
file from this repository:
.
├── data
| └── config.yaml
└── containers
└── intervene-synthetic-data_latest.sif
- Initialise the software dependencies from the root directory you set up in the previous step:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif init
- Fetch the reference dataset:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif fetch
- Generate synthetic genotypes and phenotypes, and (optionally) evaluate the data quality:
Synthetic data generation:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 1 data/config.yaml
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno data/config.yaml
Evaluation (optional - please note this can take a while to execute for larger synthetic datasets):
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif validate data/config.yaml
This quickstart tutorial will show you the simplest approach for generating and evaluating synthetic datasets. This example will generate a small dataset of 1000 samples for 6 ancestry groups and HapMap3 variants on chromosome 1. Later we discuss how to customise the synthetic datasets.
- Download the HAPNEST container.
For ease of portability and reproducibility, we've made this software available as Docker and Singularity containers. Containerisation streamlines software dependency management by creating a standardised environment, to make it easier for you to get started with generating synthetic datasets.
The HAPNEST container is available at the link https://hub.docker.com/r/sophiewharrie/intervene-synthetic-data. The instructions provided in this documentation use the Singularity container, but can be adapted to work directly with the Docker container.
Download the Singularity container by running the following command:
singularity pull docker://sophiewharrie/intervene-synthetic-data
Alternatively, you can run this software without a container by manually installing the software dependencies. If you prefer this approach, you can view the list of required dependencies in the Dockerfile
of this repository.
- Choose a location where you want to generate synthetic data, and create a
data
directory containing a copy of theconfig.yaml
file downloaded from this repository and acontainers
directory containing the downloaded container file.
.
├── data
| └── config.yaml
└── containers
└── intervene-synthetic-data_latest.sif
If your container version has a different name you should update this in any subsequent commands. The config.yaml
file contains the parameter values used for synthetic data generation. In this tutorial we will use the default configuration.
- From the root directory you setup in the previous step, run the
init
command to complete the setup of software dependencies:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif init
This command binds the data
directory you created in the previous step, which is used by HAPNEST to access input reference files and store output data files.
By default the above command will store Julia package files in the ~/.julia
directory. If you experience an error related to updating the registry at ~/.julia/...
you may need to give an explicit depot path by adding --env JULIA_DEPOT_PATH={path-to-your-preffered-location}
to the previous command and all subsequent commands. For example:
singularity exec --bind data/:/data/ --env JULIA_DEPOT_PATH=/data/.julia containers/intervene-synthetic-data_latest.sif init
- The first time using the container, you need to fetch the reference dataset using the
fetch
command:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif fetch
This will download the reference files used as input to the synthetic data generation program to the data/inputs/processed
directory. See preprocessing/README.md if you would like to know more about how this data was created and for information about creating your own reference datasets.
- Now generate a synthetic dataset using the
generate_geno
andgenerate_pheno
commands:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 1 data/config.yaml
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno data/config.yaml
The number given after generate_geno
in the first command is the number of computing threads you want the software to use. We recommend increasing this to a higher value for faster synthetic data generation. Running the above commands should generate a small synthetic dataset in the data/outputs/test
directory.
Now it would be useful to evaluate the quality of this data.
- Evaluate synthetic data quality using the
validate
command (optional - please note this can take a while to execute for larger synthetic datasets):
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif validate data/config.yaml
This will store visualisations in the data/outputs/test/evaluation
directory and print results for quantitative metrics. See our manuscript for details about the evaluation workflow.
Now that you understand the basics, you can read about how to customise your synthetic datasets. You may also be interested in how to generate very large datasets.
You can customise the specifications for synthetic data generation by modifying the config.yaml
file.
The global parameters are used by all workflows:
Parameter name | Possible values | Description |
---|---|---|
random_seed |
Integer value, e.g. 123 | A random seed used for reproducibility. |
chromosome |
all or an integer value between 1 and 22 |
Set to a numerical value between 1 and 22 if running the pipeline for a specific chromosome, or set to all for all 22 chromosomes. The all option will run synthetic data generation for each chromosome sequentially, so we recommend using the approach described in the large scale synthetic data generation section for efficiently generating multi-chromosome datasets. |
superpopulation |
none or one of AFR (African), AMR (Admixed American), EAS (East Asian), EUR (European), CSA (Central or South Asian), MID (Middle Eastern) |
A superpopulation for (all) synthetic samples. This parameter is ignored if using a custom population structure. |
memory |
Integer value | Amount of memory available (in MB) for memory-intensive commands. |
batchsize |
Integer value | The number of synthetic samples to write per batch, used during synthetic genotype generation. |
The most important filepaths to check before generating synthetic datasets are the general filepaths. Most other filepaths can be left with the default values.
The general filepaths are used by all workflows:
Parameter name | Possible values | Description |
---|---|---|
output_dir |
String | Directory for synthetic data generation outputs, e.g. data/outputs/test |
output_prefix |
String | Prefix for synthetic data generation outputs, using the {chromosome} wildcard, e.g. test_chr-{chromosome} . Please note that the program expects input files with the naming convention {prefix}-{chromosome} |
Filepaths for genotype data generation. You do not need to change these if you are using the reference data downloaded by the fetch
command.
Parameter name | Possible values | Description |
---|---|---|
vcf_input_raw |
String | VCF files for the (real) reference dataset, before pre-processing |
vcf_input_processed |
String | VCF files for the (real) reference dataset, created by pre-processing |
vcf_metadata |
String | Text file describing metadata for the VCF reference such as name of SNPs |
popfile_raw |
String | Population file for the reference VCF, before pre-processing |
popfile_processed |
String | Population file for the reference VCF, created by pre-processing |
variant_list |
String | List of variants to include in synthetic dataset |
remove_list |
String | List of samples to remove from the reference dataset |
rsid_list |
String | Map of variant names to rsID format |
genetic_mapfile |
String | Genetic maps for converting basepair to centimorgan distances |
genetic_distfile |
String | A genetic map created by the pre-processing code |
mutation_mapfile |
String | Age of mutation for each variant |
mutation_agefile |
String | A mutation age map created by the pre-processing code |
hap1_matrix |
String | A data structure for the reference data haplotypes created by the pre-processing code |
hap2_matrix |
String | A data structure for the reference data haplotypes created by the pre-processing code |
Filepaths for phenotype data generation. See the phenotype data parameters section for other parameters that can be specified to customise the phenotype generation.
Parameter name | Possible values | Description |
---|---|---|
causal_list |
String | Filepath for a list of predefined SNPs to be used as causal, overrides polygenicity parameter if specified. Each column contains causal SNPs for one trait, columns separated by comma. |
reference_list |
String | Filepath for a reference file for LD. By default, uses the reference file downloaded by the fetch command. |
plink_override |
String | Optional argument which can be used if you would like to generate phenotypes for a genetics dataset that is in a different location to what is given by output_dir and output_prefix . Specify the location of the PLINK files with the naming convention {prefix}-{chromosome} |
Do not change these if you are using the Docker/Singularity containers. However, if you manually installed the software dependencies then you should check and update these filepaths:
Parameter name | Possible values | Description |
---|---|---|
plink |
String | Software path for plink software |
plink2 |
String | Software path for plink2 software |
king |
String | Software path for king software |
vcftools |
String | Software path for vcftools software |
mapthin |
String | Software path for mapthin software |
phenoalg |
String | Software path for our own phenotype software |
There are three approaches for generating synthetic genotype data:
Approach | Result | Instructions |
---|---|---|
Same superpopulation for all samples | Generates synthetic samples for a specific superpopulation group | 1. Set global_parameters > superpopulation to your selected superpopulation, 2. Set genotype_data > samples > default to true , 3. Set genotype_data > default > nsamples to the number of samples you want to generate |
Equal ratio of samples for all six superpopulations | Generates synthetic samples for all six superpopulation groups in equal ratio, e.g. 1000 samples for each superpopulation if nsamples is 6000 |
1. Set global_parameters > superpopulation to none , 2. Set genotype_data > samples > default to true , 3. Set genotype_data > default > nsamples to the number of samples you want to generate |
Custom population structure | Generates synthetic samples according to your own custom population structure | 1. Set genotype_data > samples > default to false (this will cause the algorithm will ignore the value of the global_parameters > superpopulation parameter), 2. See the section on configuring custom population structure for how to define custom population structure |
The genotype algorithm also uses a recombination rate rho
and effective population size Ne
. The default values for each superpopulation were selected as the means of the posterior distributions derived using the optimisation workflow, as described in our manuscript. You can override these values in the configuration file.
Customise the population structure of the synthetic dataset by specifying your own population groups. You can add as many population groups as you want.
A population group consists of:
id
: a name for your population group, e.g.pop1
nsamples
: the number of samples you want to generate for this population group, e.g. 100populations
: the population structure for the population group, as a list of superpopulations that you want to include, and the proportion of segments from each superpopulation for each synthetic genotype (which must sum to 100)
Example 1: 100 genotypes with EUR ancestry and 200 genotypes with AFR ancestry
- id: EUR_pop
nsamples: 100
populations:
- EUR: 100
- id: AFR_pop
nsamples: 200
populations:
- AFR: 100
Example 2: 100 genotypes where each genotype has 50% of segments sampled from EUR reference samples and 50% of segments sampled from AFR reference samples. Please note that this does not result in true admixed genotypes because the current implementation of the algorithm does not account for the time of admixture events.
- id: admix_pop
nsamples: 100
populations:
- EUR: 50
- AFR: 50
Parameters for phenotype data generation:
Parameter name | Possible values | Description |
---|---|---|
nPopulation | Integer | The number of populations (nPop) |
nTrait | Integer | The number of traits |
a | Float | Suggested value is -0.4 |
b | Float | Suggested value is -1 |
c | Float | Suggested value is 0.5 |
nComponent | Integer | Number of Gaussian mixture components |
ProportionGeno | Comma-separated list of floats | The observed causal SNP heritability in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma |
ProportionCovar | Comma-separated list of floats | The observed proportion of variance contributed by the covariate (input in SampleList file) in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma. |
Prevalence | Comma-separated list of floats | Disease prevalence in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma. If prevalence is specified, output will include a column for binary case/control statues. |
TraitCorr | Comma-separated list of floats | A flattened correlation matrix for traits genetic correlation (symmetric positive definite). nTrait x nTrait entries separated by comma. |
PopulationCorr | Comma-separated list of floats | A flattened correlation matrix for population genetic correlation (symmetric positive definite). nPop x nPop entries separated by comma. |
CompWeight | Comma-separated list of floats | Gaussian mixture component weights |
UseCausalList | Boolean | True if using a list of predefined SNPs to be used as causal, overrides Polygenicity parameter if specified. Each column contains causal SNPs for one trait, columns separated by comma. See phenotype filepaths for how to specify a causal list as input. |
Polygenicity | Float | nTrait vector of trait polygenicity, measured by proportion of total SNPs being causal. e.g. Polygenicity = 0.05 meaning around 5% SNPs will be causal. |
Pleiotropy | Float | nTrait vector of trait's pleiotropy relationship comparing to trait 1. i.e. if trait 2 has Pleiotropy = 0.9, it means 90% of causal SNPs in trait 1 are also causal in trait 2. Therefore, first entry of Pleiotropy vector is always 1. Entries separated by comma. |
The evaluation workflow computes a set of visualisations and quantitative metrics for understanding the reliability and generalisability of a synthetic dataset. This is done by a chromosome-by-chromosome comparison of the synthetic data with a real reference dataset from multiple perspectives (LD structure, kinship relatedness structure, minor allele frequencies, etc.).
The evaluation workflow can be run using the validate
command, e.g.
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif validate data/config.yaml
Note that you can use the evaluation workflow with any PLINK-formatted synthetic dataset, i.e. it doesn't have to be generated using this tool. However, please note that the code assumes that the IDs in the .fam file begin with the prefix "syn".
You can enable/disable different types of metrics by setting true
/false
values in the configuration file:
Parameter name | Possible values | Description | Requires genotype data | Requires phenotype data |
---|---|---|---|---|
aats |
true /false |
Nearest neighbour adversarial accuracy, which quantifies the extent to which synthetic data can be statistically distinguished from real data | ✔️ | |
kinship |
true /false |
Kinship-based relatedness analysis, including kinship density and identical by segment (IBS) plots | ✔️ | |
ld_corr |
true /false |
Linkage disequilibrium (LD) correlation matrix | ✔️ | |
ld_decay |
true /false |
Linkage disequilibrium (LD) decay plot (and distance) | ✔️ | |
maf |
true /false |
Minor allele frequency divergences | ✔️ | |
pca |
true /false |
Principal components analysis of population structure | ✔️ | |
gwas |
true /false |
Run GWAS and generate manhattan and qqplot | ✔️ | ✔️ |
The evaluation metrics are printed to standard output and plots are stored at output_dir/evaluation
, where output_dir
is specified in the config.yaml
file.
Approximate Bayesian computation (ABC) is a likelihood-free infererence technique that can be applied to estimate the posterior distributions of parameters for simulation-based models. The key steps in performing an ABC analysis are summarised below:
- Setup: This tutorial assumes that you have made a copy of the
config.yaml
file. You can either update the configuration for your own use case or use the values described in this tutorial. In particular:
- For
global_parameters
, you should set achromosome
(choose a specific value between 1-22) and asuperpopulation
(choose one ofAFR
,AMR
,EAS
,EUR
,CSA
,MID
) - For
filepaths
, either update the values or use the defaults, which will use 1KG+HGDP as the reference dataset - For
genotype_data
, setuse_default: true
and choose the size of the synthetic datasets by specifying a value fornsamples
Example - copy the full config.yaml
file and edit the following sections:
global_parameters:
...
chromosome: 1
superpopulation: EUR
...
genotype_data:
samples:
use_default: true # setting this to true will ignore the custom population groups
...
default:
nsamples: 1000
Note: if you want to continue with the default settings, you can skip to step 4
- Define prior distributions for the unknown parameters: First we need to define prior distributions representing our prior beliefs about the values for the unknown parameters: effective population size,
Ne
, and recombination rate,rho
. HAPNEST uses uniform priors for these parameters, for a fixed superpopulation. Update theconfig.yaml
file to specify the upper and lower bounds of the uniform distributions forrho
andNe
. For example,uniform_lower: 0, uniform_lower: 3
will uniformly sample parameter values between 0 and 3.
Example:
optimisation:
# prior distributions - specify lower/upper bounds for uniform priors
priors:
rho:
uniform_lower: 0
uniform_upper: 3
Ne:
uniform_lower: 0
uniform_upper: 50000
- Define a summary statistic or a set of summary statistics that capture the relevant features of the data: The summary statistics are used to quantify the objectives for the ABC simulation. HAPNEST currently supports two options: (1)
ld_decay
, which aims to preserve LD structure of real genotype data by minimising the Euclidean distance between LD decay curves of the reference and synthetic datasets, and (2)kinship
, which aims to preserve relatedness structure of real genotype data, using a kinship-based relatedness analysis. You can choose to use one or both of these summary statistics, by setting the value/s totrue
orfalse
in theconfig.yaml
file.
Example:
optimisation:
...
# choice of summary statistic/s
summary_statistics:
ld_decay: true
kinship: true
- Choose a simulation type and specify the simulation parameters: Choose either the
simulation_rejection_ABC
oremulation_rejection_ABC
approach by settingrun: true
and specify athreshold
for accepting/rejecting parameters based on the summary statistics (if you are unsure of a suitable threshold for your use case you can select an arbitrarily high value such as 1.0 and see step 5 for post-simulation analysis). Simulation-based rejection sampling (simulation_rejection_ABC
) repeatedly samples parameters from the prior distributions and calculates summary statistics for data simulated from the model using those parameters to decide whether to accept or reject the candidate samples. Emulation-based rejection sampling (emulation_rejection_ABC
) is a more efficient variation of simulation-based rejection sampling that learns a surrogate model to estimate the summary statistic values for different parameter values, which can aid computationally expensive simulations of large synthetic datasets. For method-specific parameters, see the documentation for the Julia GpABC package.
Example:
optimisation:
# inference type - simulation-based rejection ABC or emulation-based rejection ABC
simulation_rejection_ABC:
run: true
n_particles: 500
threshold: 0.15
max_iter: 500
write_progress: true
emulation_rejection_ABC:
run: false
n_particles: 500
threshold: 0.15
n_design_points: 50
max_iter: 500
write_progress: true
- Run a simulation to generate samples of parameters from the prior distributions and simulate data from the model using those parameters: The
optimise
command runs the ABC simulation and calculates the summary statistics for the simulated data and compares them with the summary statistics for the reference data using a distance measure. The idea is that parameters are accepted if the distance is below the predefinedthreshold
and rejected otherwise, where the accepted parameters form an approximation of the posterior distribution.
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif optimise data/config.yaml
- Analyse and use the results for generating synthetic datasets: The optimisation pipeline creates two outputs in the
optimisation
directory: (1){prefix}_sim.csv
, the raw table of accepted parameter values and their distance measured
; and (2){prefix}_sim.png
, a plot of the resulting posterior distributions for each parameter. In the Jupyter notebookoptimisation/example/analysis.ipynb
we provide example code for conducting further analysis of these results. To summarize:
- Further analyse the posterior distributions for each parameter by plotting histograms or density plots for varying acceptance thresholds (example pictured below)
- Calculate point estimates, such as the mean, for each parameter. These point estimates can then be used as values for
rho
andNe
in theconfig.yaml
file for generating synthetic datasets - Perform model validation by simulating datasets with the new parameter values and comparing the simulated data with the observed data using HAPNEST's
evaluation
pipeline
A summary of options in the config.yaml
file for optimisation is provided below:
Parameter name | Possible values | Description |
---|---|---|
priors |
Numerical values for uniform_lower and uniform_upper of each parameter |
The lower and upper bounds for the uniform prior distributions of each parameter. |
simulation_rejection_ABC |
Set run to true if using this algorithm. Specify the number of posterior samples with n_particles , the maximum number of simulations to run with max_iter , the acceptance threshold with threshold , and whether the algorithm progress should be printed to standard output with write_progress . |
Configuration for simulation-based rejection sampling. |
emulation_rejection_ABC |
Set run to true if using this algorithm. Specify the number of posterior samples with n_particles , the number of samples for training the emulator with n_design_points , the maximum number of emulations to run with max_iter , the acceptance threshold with threshold , and whether the algorithm progress should be printed to standard output with write_progress . |
Configuration for emulation-based rejection sampling. Recommended for computationally expensive simulations of large synthetic datasets. |
summary_statistics |
Set true /false for ld_decay and/or kinship |
The ld_decay objective minimises Euclidean distance between LD decay curves of the reference/synthetic datasets. The kinship objective minimises genetic relatedness between the reference/synthetic datasets. The objectives can be used separately or combined together. |
In this section we provide instructions for utilising a HPC cluster to efficiently generate very large datasets. These instructions assume that your cluster uses the Slurm job scheduler.
- Create a
config.yaml
file with the parameters you want to use for synthetic data generation. Specify thechromosome
parameter as a wildcard as shown below to enable the software to efficiently generate data for all 22 chromosomes simultaneously.
global_parameters:
chromosome: ${chr}
...
- The script below shows an example of a batch script that can be used to submit a collection of jobs for generating synthetic genotypes. This script will copy the configuration file for each chromosome and submit 22 multi-threaded jobs. You can adjust the computational resources for your use case - the example below shows what we used to generate 1 million genotypes for over 6.8 million variants.
#!/bin/bash
#SBATCH --array=1-22
#SBATCH --mem-per-cpu=32G
#SBATCH --cpus-per-task=8
#SBATCH --time 24:00:00
n=$SLURM_ARRAY_TASK_ID
CONFIG=data/config # prefix for config file
# generate a config for each chromosome
cp ${CONFIG}.yaml ${CONFIG}$n.yaml
sed -i 's/${chr}'"/$n/g" ${CONFIG}$n.yaml
# generate data for each chromosome
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 8 ${CONFIG}$n.yaml
- The script below shows how to generate synthetic phenotypes, using the genotypes generated in the previous step as input. This shouldn't be split into multiple jobs because the phenotype generation algorithm needs to sum genetic effects across all chromosomes included in the dataset.
#!/bin/bash
CONFIG=data/config # prefix for config file
# replace the chromosome wildcard with all
cp ${CONFIG}.yaml ${CONFIG}_pheno$n.yaml
sed -i 's/${chr}'"/all/g" ${CONFIG}_pheno$n.yaml
# generate phenotype data
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno ${CONFIG}_pheno$n.yaml