Skip to content

Software program for generating synthetic datasets for genotypes and phenotypes

License

Notifications You must be signed in to change notification settings

intervene-EU-H2020/synthetic_data

Repository files navigation

Synthetic data generation and evaluation

HAPNEST enables you to

  • Efficiently generate large-scale, diverse and realistic datasets for genotypes and phenotypes
  • Easily analyse data quality with an extensive workflow for evaluating synthetic data reliability and generalisability
  • Examine the posterior distributions of model parameters to aid with model selection

We have used HAPNEST to simulate genetics data and corresponding phenotypes for 1 million individuals for 6 superpopulations and over 6.8 million variants in less than 12 hours (32GB RAM, 8 threads per machine). We have made this synthetic dataset (and a smaller example dataset) publicly available for download at https://www.ebi.ac.uk/biostudies/studies/S-BSST936.

architecture diagram

HAPNEST has a modular architecture design, where each software module can be used independently with a single command line call. This means, for example, you can use the evaluation workflow for evaluating synthetic data quality for any synthetic dataset, not just those generated by HAPNEST.

This software tool was developed by members of INTERVENE (INTERnational consortium of integratiVE geNomics prEdiction). If you would like to cite this software, please cite our preprint https://www.biorxiv.org/content/10.1101/2022.12.22.521552v1.

Contents

Quickstart tutorial

TLDR

Steps 1-4 only need to be run once the first time you install HAPNEST. Step 5 describes how you can generate and then evaluate a synthetic dataset. If you encounter any errors when running these steps, please see the more detailed documentation below.

  1. Download the Singularity container by running the following command:
singularity pull docker://sophiewharrie/intervene-synthetic-data
  1. Setup the directory structure illustrated below with the container you just downloaded and a copy of the config.yaml file from this repository:
.
├── data
|   └── config.yaml
└── containers
    └── intervene-synthetic-data_latest.sif
  1. Initialise the software dependencies from the root directory you set up in the previous step:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif init
  1. Fetch the reference dataset:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif fetch
  1. Generate synthetic genotypes and phenotypes, and (optionally) evaluate the data quality:

Synthetic data generation:

singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 1 data/config.yaml
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno data/config.yaml

Evaluation (optional - please note this can take a while to execute for larger synthetic datasets):

singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif validate data/config.yaml

Extended version

This quickstart tutorial will show you the simplest approach for generating and evaluating synthetic datasets. This example will generate a small dataset of 1000 samples for 6 ancestry groups and HapMap3 variants on chromosome 1. Later we discuss how to customise the synthetic datasets.

  1. Download the HAPNEST container.

For ease of portability and reproducibility, we've made this software available as Docker and Singularity containers. Containerisation streamlines software dependency management by creating a standardised environment, to make it easier for you to get started with generating synthetic datasets.

The HAPNEST container is available at the link https://hub.docker.com/r/sophiewharrie/intervene-synthetic-data. The instructions provided in this documentation use the Singularity container, but can be adapted to work directly with the Docker container.

Download the Singularity container by running the following command:

singularity pull docker://sophiewharrie/intervene-synthetic-data

Alternatively, you can run this software without a container by manually installing the software dependencies. If you prefer this approach, you can view the list of required dependencies in the Dockerfile of this repository.

  1. Choose a location where you want to generate synthetic data, and create a data directory containing a copy of the config.yaml file downloaded from this repository and a containers directory containing the downloaded container file.
.
├── data
|   └── config.yaml
└── containers
    └── intervene-synthetic-data_latest.sif

If your container version has a different name you should update this in any subsequent commands. The config.yaml file contains the parameter values used for synthetic data generation. In this tutorial we will use the default configuration.

  1. From the root directory you setup in the previous step, run the init command to complete the setup of software dependencies:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif init

This command binds the data directory you created in the previous step, which is used by HAPNEST to access input reference files and store output data files.

By default the above command will store Julia package files in the ~/.julia directory. If you experience an error related to updating the registry at ~/.julia/... you may need to give an explicit depot path by adding --env JULIA_DEPOT_PATH={path-to-your-preffered-location} to the previous command and all subsequent commands. For example:

singularity exec --bind data/:/data/ --env JULIA_DEPOT_PATH=/data/.julia containers/intervene-synthetic-data_latest.sif init
  1. The first time using the container, you need to fetch the reference dataset using the fetch command:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif fetch

This will download the reference files used as input to the synthetic data generation program to the data/inputs/processed directory. See preprocessing/README.md if you would like to know more about how this data was created and for information about creating your own reference datasets.

  1. Now generate a synthetic dataset using the generate_geno and generate_pheno commands:
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 1 data/config.yaml
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno data/config.yaml

The number given after generate_geno in the first command is the number of computing threads you want the software to use. We recommend increasing this to a higher value for faster synthetic data generation. Running the above commands should generate a small synthetic dataset in the data/outputs/test directory.

Now it would be useful to evaluate the quality of this data.

  1. Evaluate synthetic data quality using the validate command (optional - please note this can take a while to execute for larger synthetic datasets):
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif validate data/config.yaml

This will store visualisations in the data/outputs/test/evaluation directory and print results for quantitative metrics. See our manuscript for details about the evaluation workflow.

Now that you understand the basics, you can read about how to customise your synthetic datasets. You may also be interested in how to generate very large datasets.

Customising your synthetic datasets

You can customise the specifications for synthetic data generation by modifying the config.yaml file.

Global parameters

The global parameters are used by all workflows:

Parameter name Possible values Description
random_seed Integer value, e.g. 123 A random seed used for reproducibility.
chromosome all or an integer value between 1 and 22 Set to a numerical value between 1 and 22 if running the pipeline for a specific chromosome, or set to all for all 22 chromosomes. The all option will run synthetic data generation for each chromosome sequentially, so we recommend using the approach described in the large scale synthetic data generation section for efficiently generating multi-chromosome datasets.
superpopulation none or one of AFR (African), AMR (Admixed American), EAS (East Asian), EUR (European), CSA (Central or South Asian), MID (Middle Eastern) A superpopulation for (all) synthetic samples. This parameter is ignored if using a custom population structure.
memory Integer value Amount of memory available (in MB) for memory-intensive commands.
batchsize Integer value The number of synthetic samples to write per batch, used during synthetic genotype generation.

Input and output filepaths

The most important filepaths to check before generating synthetic datasets are the general filepaths. Most other filepaths can be left with the default values.

General filepaths

The general filepaths are used by all workflows:

Parameter name Possible values Description
output_dir String Directory for synthetic data generation outputs, e.g. data/outputs/test
output_prefix String Prefix for synthetic data generation outputs, using the {chromosome} wildcard, e.g. test_chr-{chromosome}. Please note that the program expects input files with the naming convention {prefix}-{chromosome}

Genotype filepaths

Filepaths for genotype data generation. You do not need to change these if you are using the reference data downloaded by the fetch command.

Parameter name Possible values Description
vcf_input_raw String VCF files for the (real) reference dataset, before pre-processing
vcf_input_processed String VCF files for the (real) reference dataset, created by pre-processing
vcf_metadata String Text file describing metadata for the VCF reference such as name of SNPs
popfile_raw String Population file for the reference VCF, before pre-processing
popfile_processed String Population file for the reference VCF, created by pre-processing
variant_list String List of variants to include in synthetic dataset
remove_list String List of samples to remove from the reference dataset
rsid_list String Map of variant names to rsID format
genetic_mapfile String Genetic maps for converting basepair to centimorgan distances
genetic_distfile String A genetic map created by the pre-processing code
mutation_mapfile String Age of mutation for each variant
mutation_agefile String A mutation age map created by the pre-processing code
hap1_matrix String A data structure for the reference data haplotypes created by the pre-processing code
hap2_matrix String A data structure for the reference data haplotypes created by the pre-processing code

Phenotype filepaths

Filepaths for phenotype data generation. See the phenotype data parameters section for other parameters that can be specified to customise the phenotype generation.

Parameter name Possible values Description
causal_list String Filepath for a list of predefined SNPs to be used as causal, overrides polygenicity parameter if specified. Each column contains causal SNPs for one trait, columns separated by comma.
reference_list String Filepath for a reference file for LD. By default, uses the reference file downloaded by the fetch command.
plink_override String Optional argument which can be used if you would like to generate phenotypes for a genetics dataset that is in a different location to what is given by output_dir and output_prefix. Specify the location of the PLINK files with the naming convention {prefix}-{chromosome}

Software filepaths

Do not change these if you are using the Docker/Singularity containers. However, if you manually installed the software dependencies then you should check and update these filepaths:

Parameter name Possible values Description
plink String Software path for plink software
plink2 String Software path for plink2 software
king String Software path for king software
vcftools String Software path for vcftools software
mapthin String Software path for mapthin software
phenoalg String Software path for our own phenotype software

Genotype data parameters

There are three approaches for generating synthetic genotype data:

Approach Result Instructions
Same superpopulation for all samples Generates synthetic samples for a specific superpopulation group 1. Set global_parameters > superpopulation to your selected superpopulation, 2. Set genotype_data > samples > default to true, 3. Set genotype_data > default > nsamples to the number of samples you want to generate
Equal ratio of samples for all six superpopulations Generates synthetic samples for all six superpopulation groups in equal ratio, e.g. 1000 samples for each superpopulation if nsamples is 6000 1. Set global_parameters > superpopulation to none, 2. Set genotype_data > samples > default to true, 3. Set genotype_data > default > nsamples to the number of samples you want to generate
Custom population structure Generates synthetic samples according to your own custom population structure 1. Set genotype_data > samples > default to false (this will cause the algorithm will ignore the value of the global_parameters > superpopulation parameter), 2. See the section on configuring custom population structure for how to define custom population structure

The genotype algorithm also uses a recombination rate rho and effective population size Ne. The default values for each superpopulation were selected as the means of the posterior distributions derived using the optimisation workflow, as described in our manuscript. You can override these values in the configuration file.

Configuring custom population structure

Customise the population structure of the synthetic dataset by specifying your own population groups. You can add as many population groups as you want.

A population group consists of:

  • id: a name for your population group, e.g. pop1
  • nsamples: the number of samples you want to generate for this population group, e.g. 100
  • populations: the population structure for the population group, as a list of superpopulations that you want to include, and the proportion of segments from each superpopulation for each synthetic genotype (which must sum to 100)

Example 1: 100 genotypes with EUR ancestry and 200 genotypes with AFR ancestry

- id: EUR_pop
  nsamples: 100
  populations:
    - EUR: 100
- id: AFR_pop
  nsamples: 200
  populations:
    - AFR: 100

Example 2: 100 genotypes where each genotype has 50% of segments sampled from EUR reference samples and 50% of segments sampled from AFR reference samples. Please note that this does not result in true admixed genotypes because the current implementation of the algorithm does not account for the time of admixture events.

- id: admix_pop
  nsamples: 100
  populations:
    - EUR: 50
    - AFR: 50

Phenotype data parameters

Parameters for phenotype data generation:

Parameter name Possible values Description
nPopulation Integer The number of populations (nPop)
nTrait Integer The number of traits
a Float Suggested value is -0.4
b Float Suggested value is -1
c Float Suggested value is 0.5
nComponent Integer Number of Gaussian mixture components
ProportionGeno Comma-separated list of floats The observed causal SNP heritability in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma
ProportionCovar Comma-separated list of floats The observed proportion of variance contributed by the covariate (input in SampleList file) in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma.
Prevalence Comma-separated list of floats Disease prevalence in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma. If prevalence is specified, output will include a column for binary case/control statues.
TraitCorr Comma-separated list of floats A flattened correlation matrix for traits genetic correlation (symmetric positive definite). nTrait x nTrait entries separated by comma.
PopulationCorr Comma-separated list of floats A flattened correlation matrix for population genetic correlation (symmetric positive definite). nPop x nPop entries separated by comma.
CompWeight Comma-separated list of floats Gaussian mixture component weights
UseCausalList Boolean True if using a list of predefined SNPs to be used as causal, overrides Polygenicity parameter if specified. Each column contains causal SNPs for one trait, columns separated by comma. See phenotype filepaths for how to specify a causal list as input.
Polygenicity Float nTrait vector of trait polygenicity, measured by proportion of total SNPs being causal. e.g. Polygenicity = 0.05 meaning around 5% SNPs will be causal.
Pleiotropy Float nTrait vector of trait's pleiotropy relationship comparing to trait 1. i.e. if trait 2 has Pleiotropy = 0.9, it means 90% of causal SNPs in trait 1 are also causal in trait 2. Therefore, first entry of Pleiotropy vector is always 1. Entries separated by comma.

Understanding your synthetic datasets

Evaluating synthetic data quality

The evaluation workflow computes a set of visualisations and quantitative metrics for understanding the reliability and generalisability of a synthetic dataset. This is done by a chromosome-by-chromosome comparison of the synthetic data with a real reference dataset from multiple perspectives (LD structure, kinship relatedness structure, minor allele frequencies, etc.).

The evaluation workflow can be run using the validate command, e.g.

singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif validate data/config.yaml

Note that you can use the evaluation workflow with any PLINK-formatted synthetic dataset, i.e. it doesn't have to be generated using this tool. However, please note that the code assumes that the IDs in the .fam file begin with the prefix "syn".

You can enable/disable different types of metrics by setting true/false values in the configuration file:

Parameter name Possible values Description Requires genotype data Requires phenotype data
aats true/false Nearest neighbour adversarial accuracy, which quantifies the extent to which synthetic data can be statistically distinguished from real data ✔️
kinship true/false Kinship-based relatedness analysis, including kinship density and identical by segment (IBS) plots ✔️
ld_corr true/false Linkage disequilibrium (LD) correlation matrix ✔️
ld_decay true/false Linkage disequilibrium (LD) decay plot (and distance) ✔️
maf true/false Minor allele frequency divergences ✔️
pca true/false Principal components analysis of population structure ✔️
gwas true/false Run GWAS and generate manhattan and qqplot ✔️ ✔️

The evaluation metrics are printed to standard output and plots are stored at output_dir/evaluation, where output_dir is specified in the config.yaml file.

Optimising model parameters using ABC

Approximate Bayesian computation (ABC) is a likelihood-free infererence technique that can be applied to estimate the posterior distributions of parameters for simulation-based models. The key steps in performing an ABC analysis are summarised below:

  1. Setup: This tutorial assumes that you have made a copy of the config.yaml file. You can either update the configuration for your own use case or use the values described in this tutorial. In particular:
  • For global_parameters, you should set a chromosome (choose a specific value between 1-22) and a superpopulation (choose one of AFR, AMR, EAS, EUR, CSA, MID)
  • For filepaths, either update the values or use the defaults, which will use 1KG+HGDP as the reference dataset
  • For genotype_data, set use_default: true and choose the size of the synthetic datasets by specifying a value for nsamples

Example - copy the full config.yaml file and edit the following sections:

global_parameters:
  ...
  chromosome: 1
  superpopulation: EUR

...

genotype_data:
  samples:
    use_default: true # setting this to true will ignore the custom population groups
    ...
    default:
      nsamples: 1000

Note: if you want to continue with the default settings, you can skip to step 4

  1. Define prior distributions for the unknown parameters: First we need to define prior distributions representing our prior beliefs about the values for the unknown parameters: effective population size, Ne, and recombination rate, rho. HAPNEST uses uniform priors for these parameters, for a fixed superpopulation. Update the config.yaml file to specify the upper and lower bounds of the uniform distributions for rho and Ne. For example, uniform_lower: 0, uniform_lower: 3 will uniformly sample parameter values between 0 and 3.

Example:

optimisation:
  # prior distributions - specify lower/upper bounds for uniform priors
  priors:
    rho:
      uniform_lower: 0
      uniform_upper: 3
    Ne:
      uniform_lower: 0
      uniform_upper: 50000
  1. Define a summary statistic or a set of summary statistics that capture the relevant features of the data: The summary statistics are used to quantify the objectives for the ABC simulation. HAPNEST currently supports two options: (1) ld_decay, which aims to preserve LD structure of real genotype data by minimising the Euclidean distance between LD decay curves of the reference and synthetic datasets, and (2) kinship, which aims to preserve relatedness structure of real genotype data, using a kinship-based relatedness analysis. You can choose to use one or both of these summary statistics, by setting the value/s to true or false in the config.yaml file.

Example:

optimisation:
  ...
  # choice of summary statistic/s
  summary_statistics:
    ld_decay: true
    kinship: true
  1. Choose a simulation type and specify the simulation parameters: Choose either the simulation_rejection_ABC or emulation_rejection_ABC approach by setting run: true and specify a threshold for accepting/rejecting parameters based on the summary statistics (if you are unsure of a suitable threshold for your use case you can select an arbitrarily high value such as 1.0 and see step 5 for post-simulation analysis). Simulation-based rejection sampling (simulation_rejection_ABC) repeatedly samples parameters from the prior distributions and calculates summary statistics for data simulated from the model using those parameters to decide whether to accept or reject the candidate samples. Emulation-based rejection sampling (emulation_rejection_ABC) is a more efficient variation of simulation-based rejection sampling that learns a surrogate model to estimate the summary statistic values for different parameter values, which can aid computationally expensive simulations of large synthetic datasets. For method-specific parameters, see the documentation for the Julia GpABC package.

Example:

optimisation:
  # inference type - simulation-based rejection ABC or emulation-based rejection ABC
  simulation_rejection_ABC:
    run: true
    n_particles: 500
    threshold: 0.15
    max_iter: 500
    write_progress: true
  emulation_rejection_ABC:
    run: false
    n_particles: 500
    threshold: 0.15
    n_design_points: 50
    max_iter: 500
    write_progress: true
  1. Run a simulation to generate samples of parameters from the prior distributions and simulate data from the model using those parameters: The optimise command runs the ABC simulation and calculates the summary statistics for the simulated data and compares them with the summary statistics for the reference data using a distance measure. The idea is that parameters are accepted if the distance is below the predefined threshold and rejected otherwise, where the accepted parameters form an approximation of the posterior distribution.
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif optimise data/config.yaml
  1. Analyse and use the results for generating synthetic datasets: The optimisation pipeline creates two outputs in the optimisation directory: (1) {prefix}_sim.csv, the raw table of accepted parameter values and their distance measure d; and (2) {prefix}_sim.png, a plot of the resulting posterior distributions for each parameter. In the Jupyter notebook optimisation/example/analysis.ipynb we provide example code for conducting further analysis of these results. To summarize:
  • Further analyse the posterior distributions for each parameter by plotting histograms or density plots for varying acceptance thresholds (example pictured below)
  • Calculate point estimates, such as the mean, for each parameter. These point estimates can then be used as values for rho and Ne in the config.yaml file for generating synthetic datasets
  • Perform model validation by simulating datasets with the new parameter values and comparing the simulated data with the observed data using HAPNEST's evaluation pipeline

abc diagram


A summary of options in the config.yaml file for optimisation is provided below:

Parameter name Possible values Description
priors Numerical values for uniform_lower and uniform_upper of each parameter The lower and upper bounds for the uniform prior distributions of each parameter.
simulation_rejection_ABC Set run to true if using this algorithm. Specify the number of posterior samples with n_particles, the maximum number of simulations to run with max_iter, the acceptance threshold with threshold, and whether the algorithm progress should be printed to standard output with write_progress. Configuration for simulation-based rejection sampling.
emulation_rejection_ABC Set run to true if using this algorithm. Specify the number of posterior samples with n_particles, the number of samples for training the emulator with n_design_points, the maximum number of emulations to run with max_iter, the acceptance threshold with threshold, and whether the algorithm progress should be printed to standard output with write_progress. Configuration for emulation-based rejection sampling. Recommended for computationally expensive simulations of large synthetic datasets.
summary_statistics Set true/false for ld_decay and/or kinship The ld_decay objective minimises Euclidean distance between LD decay curves of the reference/synthetic datasets. The kinship objective minimises genetic relatedness between the reference/synthetic datasets. The objectives can be used separately or combined together.

Large scale synthetic data generation

In this section we provide instructions for utilising a HPC cluster to efficiently generate very large datasets. These instructions assume that your cluster uses the Slurm job scheduler.

  1. Create a config.yaml file with the parameters you want to use for synthetic data generation. Specify the chromosome parameter as a wildcard as shown below to enable the software to efficiently generate data for all 22 chromosomes simultaneously.
global_parameters:
  chromosome: ${chr}
  ...
  1. The script below shows an example of a batch script that can be used to submit a collection of jobs for generating synthetic genotypes. This script will copy the configuration file for each chromosome and submit 22 multi-threaded jobs. You can adjust the computational resources for your use case - the example below shows what we used to generate 1 million genotypes for over 6.8 million variants.
#!/bin/bash

#SBATCH --array=1-22
#SBATCH --mem-per-cpu=32G
#SBATCH --cpus-per-task=8
#SBATCH --time 24:00:00

n=$SLURM_ARRAY_TASK_ID

CONFIG=data/config # prefix for config file

# generate a config for each chromosome
cp ${CONFIG}.yaml ${CONFIG}$n.yaml
sed -i 's/${chr}'"/$n/g" ${CONFIG}$n.yaml

# generate data for each chromosome
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_geno 8 ${CONFIG}$n.yaml
  1. The script below shows how to generate synthetic phenotypes, using the genotypes generated in the previous step as input. This shouldn't be split into multiple jobs because the phenotype generation algorithm needs to sum genetic effects across all chromosomes included in the dataset.
#!/bin/bash

CONFIG=data/config # prefix for config file

# replace the chromosome wildcard with all
cp ${CONFIG}.yaml ${CONFIG}_pheno$n.yaml
sed -i 's/${chr}'"/all/g" ${CONFIG}_pheno$n.yaml

# generate phenotype data
singularity exec --bind data/:/data/ containers/intervene-synthetic-data_latest.sif generate_pheno ${CONFIG}_pheno$n.yaml