HLA-MA allows for the matching of high-throughput sequencing (HTS) samples based on HLA typing. Given a list of matches tumor/normal samples or a list of pedigree files with samples, HLA-MA validates the matching information.
Note that OptiType is a Python 2 program while HLA-MA is a Python 3 program.
We recommend installing the dependencies OptiType, Yara, and RazerS3 using Bioconda.
Futher, you can now also install HLA-MA into the same conda environment.
You can skip this step if you have already installed the prerequisites and placed them in your $PATH
.
The following commands will install Miniconda2 (for Python 2) in ~/miniconda2
.
# wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
# bash Miniconda2-latest-Linux-x86_64.sh -b -p ~/miniconda2
Activate Miniconda 2 installation by adding bin
path, then add R and Bioconda channels.
# export PATH=~/miniconda2/bin:${PATH}
# conda config --add channels r
# conda config --add channels bioconda
The following commands will create the appropriate conda environment for HLA-MA:
# conda create -y -n hlama python=3.4
# conda install optitype=2015.10.20
# conda install hlama yara=0.9.6 razers3=3.5.0
Note that you have to install OptiType separately first because of a Python 2.7 dependency at time of writing. (OptiType now also works with Python3)
The following assumes that you are using virtualenv for your Python 3 environment.
# git clone git@github.com:bihealth/hlama.git
# cd hlama
# virtualenv -p python3 .venv
# . .venv/bin/activate
# python setup.py install
Now you have a working hlama
installation in your $PATH
.
# hlama --help
Create a configuration file with dependencies installed in Bioconda. If you have your dependencies installed in your $PATH or if you are using conda you can skip this step.
cat <<"EOF" >~/.hlama.cfg
[hlama]
# Allowed values for dep_source are
#
# - in_path (all binaries in $PATH, no further configuration
# is required)
# - bioconda installed (using Bioconda (Python 2 for Optitype) see below)
# - environment_modules (available using environment modules see below)
dep_source = bioconda
# If hlama.dependencies is "bioconda" then this section is used for
# further configuration of the Bioconda setup.
[hlama.bioconda]
# Optional, value to prepend to $PATH for activating conda installation
prepend_path = ~/miniconda2/bin
# Name of the Conda environment to use.
env = hlama-0.1
# If hlama.dependencies is "environment_modules] then this section is
# used for further configuration in the Environment Modules setup.
[hlama.environment_modules]
# Lines to prepend to running the optitype command. Note that you can use
# multi-line strings as long as the lines starting from the second line are
# indented.
module_command = # load modules
module purge # get rid of possible Python 3 module
module load yara/0.9.4
module load razers3/3.5.0
module load optitype/2015.10.20
EOF
The input is a TSV file (actually whitespaces are also recognized as delimiters) listing the donor/patient name, the sample name, the corresponding reference sample (e.g. the germline sample), the sequence type, and a comma-separated list of FASTQ files.
Files ending in _1.fq.gz
and _2.fq.gz
are recognized as first and second reads of a paired-end read run, as are files containing _R1_
and _R2_
.
# cat <<"EOF" >matched.tsv
donor sample-N sample-N DNA normal_1.fq.gz,normal_2.fq.gz
donor sample-T1 sample-N DNA tumor_1.fq.gz,tumor_1.fq.gz
donor sample-T2 sample-N RNA metastasis_1.fq.gz,metastasis_2.fq.gz
EOF
# hlama --help
# hlama --tumor-normal matched.tsv --read-base-dir path/to/reads
The input is a PED file with an extra column for the read names.
The columns are family name, donor name, father name, mother name, sex (0 unknown, 1 male, 2 female), disease status (0 unknown, 1 unaffected, 2 affected), read names.
Actually, the sex and disease columns are ignored.
Files ending in _1.fq.gz
and _2.fq.gz
are recognized as first and second reads of a paired-end read run, as are files containing _R1_
and _R2_
.
cat <<"EOF" >pedigree.ped
FAM offspring father mother 1 2 offspring_1.fq.gz,offspring_2.fq.gz
FAM father 0 0 1 1 father_1.fq.gz,father_2.fq.gz
FAM mother 0 0 2 1 mother_1.fq.gz,mother_2.fq.gz
EOF
# hlama --help
# hlama --pedigree pedigree.ped --read-base-dir path/to/reads
To test your HLA-MA installation and run a small example, please see First steps
HLA-MA uses the third-party tools OptiType (together with Yara and RazerS 3) for predicting the HLA types of a sample. The Human leukocyte antigen (HLA) is a gene complex important in the immune system. The HLA loci are highly variable and are thus very useful as a genetic fingerprint for identifying samples.
There are three genes, HLA-A, HLA-B, HLA-C. Thus, a diploid human genome carries six HLA gene copies in total. Most HLA types found in the human population are known and are assigned a number, its HLA type, e.g., HLA-A*02:01. The combination of the six hla types in a human genome is used as the fingerprint.
In matched tumor/normal samples the HLA type should be the same, or in the case of somatic mutations in the HLA loci at least very similar. Thus, a strong mismatch in HLA types indicates problems with sample matching (e.g., sample swaps).
In samples derived from related individuals the HLA types should follow the Mendelian inheritance rules. Thus, for each offspring, one copy of HLA-A should come from the biological mother and the other copy should come from the biological father (and similar for HLA-B and HLA-C).
For OptiType, we observe an accuracy of 98% in HLA typing, thus its results can be used for easy sanity checking of HTS samples. Generally, one considers the so-called two-digit HLA type (e.g., HLA-A*02) and the so-called four-digit HLA type (e.g., HLA-A*02:01). When there is a single mismatch in the four-digit HLA type between the actual and expected types of a sample, a two-digit match can still indicate a good match.
- Messerschmidt C., Beule D., Holtgrewe, M. (2016). Simple yet powerful matching of samples using HLA typing results. To appear.
Q: Should I use single-end data or paired-end data?
A: Both can be used but we observe much better precision in OptiType with paired-end data.
Q: The image you are using is an alpaca!
A: That's true but lamas and alpacas are closely related.