series.tex

% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
%
\documentclass[
]{book}
\usepackage{amsmath,amssymb}
\usepackage{iftex}
\ifPDFTeX
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
  \usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
  \usepackage{unicode-math} % this also loads fontspec
  \defaultfontfeatures{Scale=MatchLowercase}
  \defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
\usepackage{lmodern}
\ifPDFTeX\else
  % xetex/luatex font selection
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
  \usepackage[]{microtype}
  \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
  \IfFileExists{parskip.sty}{%
    \usepackage{parskip}
  }{% else
    \setlength{\parindent}{0pt}
    \setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
  \KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
\newlength{\cslhangindent}
\setlength{\cslhangindent}{1.5em}
\newlength{\csllabelwidth}
\setlength{\csllabelwidth}{3em}
\newlength{\cslentryspacingunit} % times entry-spacing
\setlength{\cslentryspacingunit}{\parskip}
\newenvironment{CSLReferences}[2] % #1 hanging-ident, #2 entry spacing
 {% don't indent paragraphs
  \setlength{\parindent}{0pt}
  % turn on hanging indent if param 1 is 1
  \ifodd #1
  \let\oldpar\par
  \def\par{\hangindent=\cslhangindent\oldpar}
  \fi
  % set entry spacing
  \setlength{\parskip}{#2\cslentryspacingunit}
 }%
 {}
\usepackage{calc}
\newcommand{\CSLBlock}[1]{#1\hfill\break}
\newcommand{\CSLLeftMargin}[1]{\parbox[t]{\csllabelwidth}{#1}}
\newcommand{\CSLRightInline}[1]{\parbox[t]{\linewidth - \csllabelwidth}{#1}\break}
\newcommand{\CSLIndent}[1]{\hspace{\cslhangindent}#1}
\ifLuaTeX
  \usepackage{selnolig}  % disable illegal ligatures
\fi
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same}
\hypersetup{
  pdftitle={BIOL 350: Bioinformatics},
  pdfauthor={William Letsou},
  hidelinks,
  pdfcreator={LaTeX via pandoc}}

\title{BIOL 350: Bioinformatics}
\author{William Letsou}
\date{2024-04-04}

\begin{document}
\maketitle

{
\setcounter{tocdepth}{1}
\tableofcontents
}
\hypertarget{intro}{%
\chapter{Introduction}\label{intro}}

Welcome to BIOL-350 (Spring 2024) at New York Tech!~ For this set of lectures, follow along with the slides \href{https://wletsou.github.io/BIOL-350/Biol\%20350\%20slides.pdf}{here}.~ Each chapter describes the theory and practice of the step in an analysis pipeline for you to conduct your own genetic association study.

\hypertarget{principal-components-analysis}{%
\chapter{Principal Components Analysis}\label{principal-components-analysis}}

This week we'll see how individuals can be separated by genetic ancestry using principal components analysis.~ We'll practice applying PCA on a subset of the 1KGP data.

\hypertarget{pricipal-components-analysis-theory}{%
\section{Pricipal components analysis: theory}\label{pricipal-components-analysis-theory}}

The populations in our dataset can be separated into clusters based on their genotypes.~ The inferred groups help control for confounding due to ancestry, and are also more reliable than self-reported race in association studies.~ To see how it works, suppose \(\mathbf{X}\) is an \(n\times m\) (standardized) genotype matrix with individuals down the rows and SNPs across the columns.~ Principal components analysis (PCA) says we can find an \(m\times n\) matrix \(\mathbf{V}\), a diagonal \(n\times n\) matrix \(\mathbf{\Sigma}\), and an \(n\times n\) matrix \(\mathbf{U}\) satisfying \begin{equation}\mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T.\tag{1}\end{equation} If we think of \(\mathbf{V}\) as the the (standardized) SNP genotypes of an ``ideal'' person of a certain ancestry, then \begin{equation}x_{j\cdot} v_{\cdot i}=u_{ji}\lambda_{jj}\end{equation} represents the amount of idealized person \(i\) in actual person \(j\), up to some proportionality constant \(\lambda_{jj}\) that depends on the ancestry.~ Then rows \(j\) of \(\mathbf{U}\) are the \emph{ancestries} of person \(j\), and columns \(i\) of \(\mathbf{U}\) are the ancestries of each individual on ancestry \(i\).~ It is important to remember that these ``idealized'' ancestries do not necessarily correspond with our preconceived notions of ancestry, so we cannot interpret them as ``European,'' ``African,'' or ``Asian,'' say.~ If we rearrange Eq. (1) and use the fact that the columns of \(\mathbf{U}\) and \(\mathbf{V}\) are \emph{orthonormal}, we can find that \begin{equation}\mathbf{X}\mathbf{X}^T\mathbf{U}=\mathbf{U}\mathbf{\Sigma}^2,\end{equation} meaning that the columns of \(\mathbf{U}\) are the eigenvectors of the matrix \begin{equation}\frac{1}{m}\mathbf{X}\mathbf{X}^T\tag{2}\end{equation} whose \(\left(i,j\right)\) entry is the genetic correlation between individuals \(i\) and \(j\), sometimes known as the \emph{genomic relationship matrix} or GRM.~ PCA works by finding the first several eigenvectors of the GRM and plotting each individual's ancestry along each orthogonal vector in a rectangular grid.~ Clusters of individuals in this grid represent distinct ancestry groups.

\hypertarget{principal-components-analysis-practice}{%
\section{Principal components analysis: practice}\label{principal-components-analysis-practice}}

\hypertarget{importing-the-data}{%
\subsection{Importing the data}\label{importing-the-data}}

To do PCA in R, we'll need to load the libraries SNPRelate and SeqArray:

\begin{verbatim}
library(SNPRelate)
library(SeqArray)
\end{verbatim}

Download the \href{https://github.com/wletsou/BIOL-350/raw/master/docs/CHB\%2BYRI\%2BCEU.chr1.vcf.gz}{chr1 vcf file} containing just the CEU, YRI, and CHB populations.~ Once you have the file, store its name as a variable:

\begin{verbatim}
vcf <- "path/to/file.vcf.gz"
\end{verbatim}

Now we'll convert the vcf format to gds format, retaining the base filename and changing the vcf.gz extension to gds.~ (This may take a minute to complete.)~ Then we'll import the gds file as a gds object.

\begin{verbatim}
seqVCF2GDS(vcf.fn = vcf,"path/to/file.gds") # convert vcf to gds with a new file name
genofile <- seqOpen("path/to/file.gds") # import the gds object
\end{verbatim}

You can can see the various fields under genofile by printing it.~ To access the data in one of the fields, do

\begin{verbatim}
seqGetData(genofile,"sample.id") # view the sample ids
\end{verbatim}

where the name of the field is enclosed in quotes.

\hypertarget{running-pca}{%
\subsection{Running PCA}\label{running-pca}}

To run PCA in R, simply do

\begin{verbatim}
pca <- snpgdsPCA(genofile) # runs PCA
\end{verbatim}

to create an objects with 32 eignevectors of the GRM.~ Make a data frame of the first several eigenvectors along with subject ids:

\begin{verbatim}
df.pca <- data.frame(sample = pca$sample.id,EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],EV3 = pca$eigenvect[,3],stringsAsFactors = FALSE)
\end{verbatim}

We'll plot individuals along EV1, EV2, and EV3 in several two-dimensional projections.~ But we'll want to see how the clustering done by PCA corresponds to individuals' self-reported race; for that we'll need another column in our data frame.

\hypertarget{getting-population-labels}{%
\subsection{Getting population labels}\label{getting-population-labels}}

The \href{https://raw.githubusercontent.com/wletsou/BIOL-350/master/docs/CHB\%2BYRI\%2BCEU.txt}{indivs file} contains each subject id along with its 1KGP population group.~ Let's import it now:

\begin{verbatim}
indivs <- read.table("path/to/CHB+YRI+CEU.txt",header = FALSE)
colnames(indivs) <- c("id","pop")
\end{verbatim}

The second field of this table is pop, an assignment to each id of one of three 1KG population groups.~ We want to match the right ID in indivs to the right ID in df.pca so that we can color our PCA plots by population.~ If the tables are in the same order, matching will be easy, but it not, we have to use the match(x,y) function, which finds the positions in y corresponding to the same items in x.~ Thus we can make a new column pop in df.pca with the corresponding pop values from indivs by

\begin{verbatim}
df.pca$pop[match(indivs$id,df.pca$sample)] <- indivs$pop # find the population group of each individual in df.pca
\end{verbatim}

\hypertarget{plotting}{%
\subsection{Plotting}\label{plotting}}

Now that we have a column of population labels, we can make a scatter plot colored by treating the pop column as vector of factors; we can get the unique values of a factor vector by applying the function levels() to it.~ A plot of the second principal component vs.~the first can then be generated by

\begin{verbatim}
par(mar = c(5.1,5.1,4.1,2.1) ) 
plot(df.pca$EV1,df.pca$EV2,pch = 19,col = factor(df.pca$pop),xlab = "PC1",ylab = "PC2",cex.lab = 1.5,cex.axis = 1.5,cex.main = 1.5) # plot of PC2 vs. PC1
legend("topright",legend = levels(factor(df.pca$pop)),bty = "n",pch = 19,col = factor(levels(factor(df.pca$pop))),pt.cex = 1,cex = 1.5,x.intersp = 0.2) # with a legend
\end{verbatim}

Move the legend around if it covers any points, and make similar plots for the other two comparisons between the first three PCs.

Finally, close the connection to the gds file when you are done:

\begin{verbatim}
seqClose(genofile)
\end{verbatim}

\hypertarget{to-turn-in}{%
\section{To turn in:}\label{to-turn-in}}

Make three (nicely formatted) plots of:

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
  PC2 vs.~PC1
\item
  PC3 vs.~PC1
\item
  PC3 vs.~PC2
\end{enumerate}

For each plot, discuss:

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
  Whether the populations appear to be well separated in PCA space
\item
  What the gradients of the different PCs represent, that is, what axis of variation each PC appears to explain
\item
  How to subset your df.pca data frame to isolate individuals of each population (i.e., provide code)
\end{enumerate}

\begin{center}\rule{0.5\linewidth}{0.5pt}\end{center}

title: Kinship analysis
output: bookdown::html\_document2
\_\_\_

\hypertarget{kinship-analysis}{%
\chapter{Kinship Analysis}\label{kinship-analysis}}

\hypertarget{kinship-theory}{%
\section{Kinship: theory}\label{kinship-theory}}

Kinship can be defined as the expected fraction of alleles that two individuals got from the same ancestor(s).~ We say that two individuals share an allele of a SNP \emph{identical-by-descent} or \emph{IBD} if they inherited the same copy of the allele from a common ancestor.~ IBD-sharing is different from simply carrying the same allele of a gene (known as \emph{identical-by-state} or \emph{IBS}-sharing), which unrelated individuals may do if the allele is common enough in the population.~ The \emph{degree} \(R\) of relationship may be defined as the effective number of meioses separating the relatives through the equation
\begin{equation}
\frac{1}{2^R}=\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}},
\label{eq:eqn}
\end{equation}
in which \(R_i\) is the number of meioses separating the relatives through the first relative's \(i\)th parent.~ For example, sibs are connected by two meioses through two parents, while a parent and child are connected by one meiosis through one parent: both relationships are degree-1.

The probability that two relatives share an allele IBD is \(\frac{1}{2^R}\), as there is a \(\frac{1}{2}\) probability that an allele is passed on in any meiosis, and \(R\) is the effective number of meioses or steps between the relatives.~ If \(2\times\frac{1}{2^R}\) is the expected number of alleles shared IBD at any given locus, then the fraction of the genome shared by any two relatives is
\begin{equation}
r=\frac{2\times\frac{1}{2^R}}{2}=\frac{1}{2^R}.
\label{eq:2}
\end{equation}

However, genomic sharing can be realized in different ways depending on the probabilities \(\pi_0\), \(\pi_1\), and \(\pi_2\) that individuals share zero, one, or two copies IBD at a locus.~ The probability that the relatives inherit both both copies IBD, viz.,\begin{equation}
\pi_2=P\left(\text{share 2 IBD}\right)=2^2\frac{1}{2^{R_1}2^{R_2}},
\label{eq:3a}
\end{equation}is simply the product of the probabilities of sharing through both parents.~ The probability of sharing exactly one allele IBD,, viz.,\begin{equation}
\pi_1=P\left(\text{share 1 IBD}\right)=2\left(\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}\right)-2^3\frac{1}{2^{R_1}2^{R_2}},
\label{eq:3b}\end{equation}is the got by finding the probability \(2\left(\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}\right)-2^2\frac{1}{2^{R_1}2^{R_2}}\) of sharing at least one allele IBD less the probability \(\pi_2\) of sharing two.~ Finally, the probability of sharing at zero alleles IBD, viz.,\begin{equation}
\pi_0=P\left(\text{share 0 IBD}\right)=1-2\frac{1}{2^{R_1}}-2\frac{1}{2^{R_2}}+2^2\frac{1}{2^{R_1}2^{R_2}},
\label{eq:3c}
\end{equation}is got by subtracting the probability \(\pi_1+\pi_2\) of sharing at least one allele IBD from 1.~ The coefficients account for the fact that there are \(2\) alleles at each locus and \(2^2\) that can be shared.

From \eqref{eq:3}--\eqref{eq:3c}, the fraction of the genome shared IBD is\begin{equation}
r=\frac{2\pi_2+1\pi_1}{2}=\frac{1}{2^{R_1}}+\frac{1}{2^{R_2}}=\frac{1}{2^R}.
\label{eq:4}
\end{equation}But the same value of \(r\) can obtain from different values of \(\pi_1\) and \(\pi_2\).~ For example, full sibs have a 25\% probability of sharing two alleles and a 50\% chance of sharing one allele at a locus, for a total fraction \(r=0.5\) shared.~ But a parent and child have 0\% chance of sharing two alleles and a 100\% chance of sharing one, also giving \(r=0.5\).~ Thus, your parent does is not equal your sibling, despite the fact of your sharing equal amounts of your genome with each of them.~ Put another way, parents cannot pass on their genotypes to their offspring.

\hypertarget{king}{%
\subsection{KING}\label{king}}

KING {[}\protect\hyperlink{ref-manichaikul_robust_2010}{1}{]} computes both the probability \(\pi_0\) that two relatives share 0 alleles IBD as well as the coefficient of relatedness \(\phi=\frac{r}{2}\), defined as the probability that two alleles taken one from each relative are IBD at a locus (the maximum probability is \(\frac{1}{2}\) because there is a 50\% chance that the alleles chosen come from different parents).~ The idea is to compare the counts \(X\) and \(Y\) of the alternative alleles which two individuals each have at a genetic locus.~ If the pair are from a single ancestral population, the expected values and variances of the allele counts are \(\mathbb{E}\left(X\right)=\mathbb{E}\left(Y\right)=2p\) and \(\sigma_X^2=\mathbb{E}\left(X^2\right)-\mathbb{E}\left(X\right)^2=\mathbb{E}\left(Y^2\right)-\mathbb{E}\left(Y\right)^2=2p\left(1-p\right)\).~ Thus the expected value of the difference \(\mathbb{E}\left(\left(X-Y\right)^2\right)=\mathbb{E}\left(X^2\right)+\mathbb{E}\left(Y^2\right)-2\mathbb{E}\left(XY\right)\) is\begin{equation}\frac{\mathbb{E}\left(\left(X-Y\right)^2\right)}{\sigma_X^2+\sigma_Y^2}=1-\frac{\sigma_{XY}}{\sigma_X\sigma_Y}=1-r,\label{eq:5}\end{equation}where \(\sigma_{XY}=\mathbb{E}\left(XY\right)-\mathbb{E}\left(X\right)\mathbb{E}\left(Y\right)\) is the covariance of the genotype counts and \(r_{ij}=2\phi_{ij}\) is the genetic correlation between individuals \(i\) and \(j\); the latter can be interpreted as the amount of the genome shared IBD.

KING estimates \(\phi\) from Eq. \eqref{eq:5} by counting the number \(N\) of loci at which two individuals are heterozygous \(Aa,Aa\) or opposite homozygous \(AA,aa\), as well as the total number of alleles at which each individual is heterozygous \(Aa\):\begin{equation}\hat{\phi_{ij}}=\frac{N_{Aa,Aa}-2N_{AA,aa}}{N_{Aa}^{\left(i\right)}+N_{Aa}^{\left(j\right)}}.\label{eq:6}\end{equation}From (6) it can be seen that shared heterozygous sites increase the estimated relatedness, and that unshared homozygous sites decrease relatedness.~ Eq. \eqref{eq:6} is called a ``robust'' estimator because it measures relatedness in a purely pairwise fashion: it does not rely on population estimates of allele frequencies.~ However, if the individuals are not of the same genetic background, the allele frequency \(p\) is not well-defined and Eq. \eqref{eq:5} does not hold, leading to negative estimates of \(\phi\); this feature is not necessarily a problem, as it helps us to distinguish different ancestries within a single population.

\hypertarget{pc-air}{%
\subsection{PC-AiR}\label{pc-air}}

KING provides an estimate of relatedness.~ But as cryptic relatedness can skew ancestry estimates determined by PCA, we should use kinship to correct PCA.~ This is where principal components analysis in related samples, or PC-AiR {[}\protect\hyperlink{ref-conomos_robust_2015}{2}{]} comes in handy.~PC-AiR identifies a subset of \(\mathcal{U}\) of individuals who are unrelated to each other (via negative KING kinship estimates), but maximally related to individuals in a remaining set \(\mathcal{R}\).~Recalling the math in \href{}{Week 1}, the matrix\begin{equation}\frac{1}{m}\mathbf{W}=\mathbf{X}^T\mathbf{U}=\mathbf{V}\mathbf{\Sigma},\label{eq:7}\end{equation}so that \(\frac{1}{m}w_{jk}\lambda_{kk}^{-1}=x_{ij}u_{ik}\lambda_{kk}^{-1}=v_{jk}\) is the genotype at SNP \(k\) in ancestry \(j\) inferred from subjects' genotypes \(\mathbb{X}\) who are in subset \(\mathcal{U}\).~ Then if we have an \(m\times n_u\) genotype matrix \(\mathbf{X}'\) of individuals in the subset \(\mathcal{R}\), we can ``apply'' \(\mathbf{W}\) to get an estimate of the ``ancestry-corrected'' PCs by:\begin{equation}\frac{1}{m}\mathbf{X}'\mathbf{W}\mathbf{\Sigma}^{-1}=\mathbf{X}'\mathbf{V}.\label{eq:8}\end{equation}Eq. \eqref{eq:8} is the matrix \(x'_{i\cdot}v_{\cdot j}\) of similarities between subjects \(i\) in set \(\mathcal{R}\) and ancestry \(j\).~ In this way, we have estimated genotype principal components for all individuals (i.e., those in \(\mathcal{U}\) and \(\mathcal{R}\)) without the problem confounding due to cryptic relatedness.

\hypertarget{pc-relate}{%
\subsection{PC-Relate}\label{pc-relate}}

Once we have ancestry-corrected principal components for all individuals, we can obtain ``corrected'' expected genotypes \(\mathbb{E}\left(X_{ik}\right)=2p_{ik}\) of SNPs \(k\) for populations of admixed individuals \(i\) using a method known as PC-Relate {[}\protect\hyperlink{ref-conomos_model-free_2016}{3}{]}.~ Because in populations with both shared genetic ancestry and cryptic relatedness, we can use the PC estimates \(v_{kj}\) of the allele frequencies in each population \(j\) to express the scaled genotypes \(x_{ij}=\frac{g_{ik}-2p_k}{2p_k\left(1-p_k\right)}\) as\begin{equation}\mathbb{E}\left(x_{ik}\mid u_{ij}\right)=u_{ij}\lambda_{jj}v_{kj},\label{eq:9}\end{equation}or in terms of the measured genotypes:\begin{equation}\mathbb{E}\left(g_{ik}\mid u_{ij}\right)=2p_k+u_{ij}\lambda_{jj}v_{kj}\cdot 2p_k\left(1-p_k\right),\label{eq:10}\end{equation} in which \(p_k\) is the allele frequency estimated from the entire sample.~ Eq. \eqref{eq:10} says that if we make a graph of each individual's genotype \(g_{ik}=0,1,2\) at SNP \(k\) vs.~the (scaled) of the amount the individual's population-\(j\) genetic ancestry, the slope \(\lambda_{jj}v_{kj}2p_k\left(1-p_k\right)\) should be the expected allele frequency in population \(j\).~ In practice, the slopes are obtained from linear regression of the observed genotypes on genetic PCs, and the updated expected allele frequencies \(\mathbb{E}\left(g_{ik}\mid u_{ij}\right)=2p_{ik}\) of SNPs \(k\) for individuals \(i\) are found by the corresponding value on the best-fit hyperplane, given individual \(i\)'s genetic ancestry \(u_{ij}\).~ We can then obtain a new genetic relationship matrix:\begin{equation}2\hat{\varphi}_{ij}=\frac{\sum_k\left(g_{ik}-2p_{ik}\right)\left(g_{jk}-2p_{jk}\right)}{2\sqrt{p_{ik}\left(1-p_{ik}\right)p_{jk}\left(1-p_{jk}\right)}}.\label{eq:11}\end{equation}This new GRM will be used in \href{}{Week 3} when we attempt to fit a linear mixed model for the effect of genotype on disease risk; for now we will use it to make an updated table of the relationship between individuals.

\hypertarget{kinship-practice}{%
\section{Kinship: practice}\label{kinship-practice}}

\hypertarget{importing-data}{%
\subsection{Importing data}\label{importing-data}}

Now we'll try kinship analysis for ourselves using a simulated GWAS dataset.~ This dataset was created using the program bioGWAS {[}\protect\hyperlink{ref-changalidis_biogwas_2023}{4}{]}, which uses other programs to simulate haplotypes {[}\protect\hyperlink{ref-su_hapgen2_2011}{5}{]} and phenotypes {[}\protect\hyperlink{ref-meyer_phenotypesimulator_2018}{6}{]} based on a set of input variants previously associated with breast cancer {[}\protect\hyperlink{ref-michailidou_genome-wide_2015}{7}{]} and made available by the MRC Interactive Epidemiology Unit's Open GWAS Project {[}\protect\hyperlink{ref-elsworth_mrc_2020}{8}{]}.~ This particular \href{https://github.com/wletsou/BIOL-350/raw/master/docs/EUR_BCa.vcf.gz}{vcf file} contains 1000 chromosome 10 genotypes at 345,130 variants (304,770 of which are SNPs) generated from 263 1KGP females of European origin.~ Download and unzip the file:

\begin{verbatim}
gunzip EUR_BCa.vcf.gz
\end{verbatim}

and move it to your desired location.~ The unzipped file is readable in plain text format, although it is very big.

Next load the following R packages:

\begin{verbatim}
library(gdsfmt)
library(GWASTools)
library(SNPRelate)
library(GENESIS)
\end{verbatim}

Our first task will be to import the vcf file and convert it to gds format.~ First specify the directory in which you saved your EUR\_BCa.vcf.~ Then define two file names:

\begin{verbatim}
vcf.fn <- sprintf("%s/pat_filt_sim.vcf",directory) # vcf file name (exists already)
gds.fn <- sprintf("%s/pat_filt_sim.gds",directory) # gds file name (about to be created)
\end{verbatim}

Next perform the conversion and importation into R:

\begin{verbatim}
snpgdsVCF2GDS(vcf.fn,gds.fn) # create gds file from vcf
genofile <- snpgdsOpen(gds.fn,readonly = FALSE) # import gds object
\end{verbatim}

The GDS format is an efficient representation of vcf files which does not require that the entire file be loaded into memory at once {[}\protect\hyperlink{ref-zheng_high-performance_2012}{9}{]}.~ By typing

\begin{verbatim}
genofile
\end{verbatim}

you can see all the fields under your gds object.~ To access one of these nodes, type

\begin{verbatim}
index.gdsn(genofile,"sample.id") # vector of sample names
\end{verbatim}

or

\begin{verbatim}
index.gdsn(genofile,"genotype",start = c(1,1), count = c(10,10)) # 10 × 10 sample of the SNP genotype matrix
\end{verbatim}

This second line uses the start and count options to extract a range of data from the 1000 × 304770 matrix og genotypes.~ If you have a vector of phenotypes (as will will in \href{}{Week 3}), you can add it as a node in your gds:

\begin{verbatim}
add.gdsn(genofile,"phenotype",val = fam$Phenotype) # add vector of phenotypes from a table
\end{verbatim}

\hypertarget{pca-and-ld-pruning}{%
\subsection{PCA and LD-pruning}\label{pca-and-ld-pruning}}

For now, we can run PCA on the genotype matrix by simply typing

\begin{verbatim}
pca <- snpgdsPCA(genofile) # principal components analysis
\end{verbatim}

We do not even have to specify the genotype node of genofile.~ Make a plot of the first few PCs as we did in \href{}{Week 1}.

There are 304770 SNPs in this dataset; many are in LD with their neighbors.~ To find a set of independent SNPs, which are not in LD (\(r>0.10\)) inside a sliding window of size ten million bp, we run:

\begin{verbatim}
set.seed(566) # set random seed to initiate pruning process
snpset <- snpgdsLDpruning(genofile,method = "corr", slide.max.bp = 10e6,ld.threshold = sqrt(0.1), verbose = FALSE) # LD prining
pruned <- unlist(snpset, use.names = FALSE) # set of independent SNPs
\end{verbatim}

\hypertarget{king-ibd-and-kinship-calculation}{%
\subsection{KING IBD and kinship calculation}\label{king-ibd-and-kinship-calculation}}

With our set of pruned SNPs, we can run KING to compute the pairwise kinship coefficient (\(\varphi_{ij}=\frac{1}{2}r_{ij}\)).~ We will extract the kinship matrix from the created ibd object and rename the rows and columns with the ids of the samples:

\begin{verbatim}
ibd <- snpgdsIBDKING(genofile,snp.id = pruned) # KING kinship estimation
colnames(ibd$kinship) <- ibd$sample.id
rownames(ibd$kinship) <- ibd$sample.id
\end{verbatim}

Print a 10 × 10 sample of the kinship matrix using:

\begin{verbatim}
ibd$kinship[1:10,1:10]
\end{verbatim}

Do the results make sense?~ Hint: what should an individual's kinship be with itself?

Finally, we will make a plot of kinship vs.~the fraction of the genome shared IBS = 0.~ This will take a minute or two to plot, because the data consist of all pairwise observations.

\begin{verbatim}
par(mar = c(5.1,5.1,4.1,2.1) ) # left default plus one
plot(ibd$IBS0,ibd$kinship,ylab = "Kinship coeffecient",xlab = "IBS0",main = "KING relatedness estimation", ylim = c(0,1),pch = 19,cex.lab = 1.5,cex.axis = 1.5,cex.main = 1.5) # plot kinship coefficient vs. proportion of SNPs not shared IBS
\end{verbatim}

You should notice that related individuals have a low fraction of IBS = 0, whereas unrelated individuals have a larger proportion.~ The graph should be approximately linear, but with discontinuities as the relationsips become closer.\&nsbp; There should not be many close relatives in this dataset.\&nsbp; What, for example, is the largest inferred value of \(\varphi\)?

In addition, if you remove the ylim = c(0,1) option, you will see a large cluster of negative kinship coefficients.~ These values are expected if the (simulated) individuals are from different genetic backgrounds.

Before proceding to the next step, close your gds file:

\begin{verbatim}
closefn.gds(genofile)
\end{verbatim}

\hypertarget{references}{%
\chapter{References}\label{references}}

\hypertarget{refs}{}
\begin{CSLReferences}{0}{0}
\leavevmode\vadjust pre{\hypertarget{ref-manichaikul_robust_2010}{}}%
1. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M. \href{https://doi.org/10.1093/bioinformatics/btq559}{Robust relationship inference in genome-wide association studies}. Bioinformatics. 2010;26:2867--73.

\leavevmode\vadjust pre{\hypertarget{ref-conomos_robust_2015}{}}%
2. Conomos MP, Miller MB, Thornton TA. \href{https://doi.org/10.1002/gepi.21896}{Robust {Inference} of {Population} {Structure} for {Ancestry} {Prediction} and {Correction} of {Stratification} in the {Presence} of {Relatedness}}. Genetic Epidemiology. 2015;39:276--93.

\leavevmode\vadjust pre{\hypertarget{ref-conomos_model-free_2016}{}}%
3. Conomos MP, Reiner AP, Weir BS, Thornton TA. \href{https://doi.org/10.1016/j.ajhg.2015.11.022}{Model-free {Estimation} of {Recent} {Genetic} {Relatedness}}. The American Journal of Human Genetics. 2016;98:127--48.

\leavevmode\vadjust pre{\hypertarget{ref-changalidis_biogwas_2023}{}}%
4. Changalidis AI, Alexeev DA, Nasykhova YA, Glotov AS, Barbitoff YA. \href{https://doi.org/10.3390/biology13010010}{{bioGWAS}: {A} {Simple} and {Flexible} {Tool} for {Simulating} {GWAS} {Datasets}}. Biology. 2023;13:10.

\leavevmode\vadjust pre{\hypertarget{ref-su_hapgen2_2011}{}}%
5. Su Z, Marchini J, Donnelly P. \href{https://doi.org/10.1093/bioinformatics/btr341}{{HAPGEN2}: Simulation of multiple disease {SNPs}}. Bioinformatics. 2011;27:2304--5.

\leavevmode\vadjust pre{\hypertarget{ref-meyer_phenotypesimulator_2018}{}}%
6. Meyer HV, Birney E. \href{https://doi.org/10.1093/bioinformatics/bty197}{{PhenotypeSimulator}: {A} comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships}. Bioinformatics. 2018;34:2951--6.

\leavevmode\vadjust pre{\hypertarget{ref-michailidou_genome-wide_2015}{}}%
7. Michailidou K, BOCS, kConFab Investigators, AOCS Group, NBCS, GENICA Network, et al. \href{https://doi.org/10.1038/ng.3242}{Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer}. Nature Genetics. 2015;47:373--80.

\leavevmode\vadjust pre{\hypertarget{ref-elsworth_mrc_2020}{}}%
8. Elsworth B, Lyon M, Alexander T, Liu Y, Matthews P, Hallett J, et al. \href{https://doi.org/10.1101/2020.08.10.244293}{The {MRC} {IEU} {OpenGWAS} data infrastructure}. 2020.

\leavevmode\vadjust pre{\hypertarget{ref-zheng_high-performance_2012}{}}%
9. Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. \href{https://doi.org/10.1093/bioinformatics/bts606}{A high-performance computing toolset for relatedness and principal component analysis of {SNP} data}. Bioinformatics. 2012;28:3326--8.

\end{CSLReferences}

\end{document}