HMM.Rmd

---
title: "Hidden Markov Models"
author: "liuc"
date: "11/7/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Hidden Markov Models

> https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter10.html
> https://setosa.io/ev/markov-chains/
> https://medium.com/analytics-vidhya/hidden-markov-models-for-time-series-in-r-studio-5ae2b9fb0701#:~:text=A%20Hidden%20Markov%20model%20is,is%20hidden%20from%20the%20observer.
> https://www.cnblogs.com/skyme/p/4651331.html


A Hidden Markov model is a tool for representing probability distribution over a sequence of observations. It assumes that the observation at time t was generated by some process whose state is hidden from the observer.
通过观察observation的分布，推测hidden state。Transition probabilities and Emission Probabilities分别表示 observation和hidden的事件发生概率。

隐马尔可夫模型（Hidden Markov Model，HMM）是一种统计模型，用于表示由被假定为具有隐藏或不可观测状态的马尔可夫过程生成的观测序列。该模型假设观测结果由一个潜在的过程生成，该过程在有限的一组状态之间转移。每个状态都有一个关联的概率分布，用于生成观测结果。HMM的目标是估计生成观测序列的隐藏状态序列。

常见的示例比如以小明的心情去推测天气状态，首先需要确定，小明的心情和天气是有关联的。引入序列信息，天气，即是观察，知道今天晴天、明天晴天的概率为0.8，下雨的概率为0.2；今天下雨明天下雨的概率为0.6、晴天的概率为0.4. 依据小明的心情序列去推测天气。P(S)=0.67 and P(R)=0.33. 比如知道小明三天的心情，求对应的天气。We have prior probability P(S)=0.67, Transition probability 0.2, Probability of Bob being Happy given the weather is Sunny P(H|S)=0.8 and Probability of Bob being Grumpy given the weather is Rainy P(G|R)=0.2. 

转换概率就是不同天气转换的情况，输出概率为每一个stat表现出观测值的概率。

*There are 3 types of problem that can be solved using HMM:*
1. Evaluation → Finding the probabilities of an observed sequence given an HMM model. (Forward/Backward Algorithm)
其实是检测观察到的结果和已知的模型是否吻合。如果很多次结果都对应了比较小的概率，那么就说明我们已知的模型很有可能是错的
2. Decoding → Finding the sequence of hidden states that most probably generated an observed sequence.(Viterbi Algorithm)
3. Learning → Generating a Hidden Markov Model given a sequence of Observations. (Bauch-Welch Estimation)


*一个解释对应问题的笔记摘录*
1）知道骰子有几种（隐含状态数量），每种骰子是什么（转换概率），根据掷骰子掷出的结果（可见状态链），我想知道每次掷出来的都是哪种骰子（隐含状态链）。
      这个问题呢，在语音识别领域呢，叫做解码问题。这个问题其实有两种解法，会给出两个不同的答案。每个答案都对，只不过这些答案的意义不一样。第一种解法求最大似然状态路径，说通俗点呢，就是我求一串骰子序列，这串骰子序列产生观测结果的概率最大。第二种解法呢，就不是求一组骰子序列了，而是求每次掷出的骰子分别是某种骰子的概率。比如说我看到结果后，我可以求得第一次掷骰子是D4的概率是0.5，D6的概率是0.3，D8的概率是0.2.第一种解法我会在下面说到，但是第二种解法我就不写在这里了，如果大家有兴趣，我们另开一个问题继续写吧。

2）还是知道骰子有几种（隐含状态数量），每种骰子是什么（转换概率），根据掷骰子掷出的结果（可见状态链），我想知道掷出这个结果的概率。
      看似这个问题意义不大，因为你掷出来的结果很多时候都对应了一个比较大的概率。问这个问题的目的呢，其实是检测观察到的结果和已知的模型是否吻合。如果很多次结果都对应了比较小的概率，那么就说明我们已知的模型很有可能是错的，有人偷偷把我们的骰子給换了。

3）知道骰子有几种（隐含状态数量），不知道每种骰子是什么（转换概率），观测到很多次掷骰子的结果（可见状态链），我想反推出每种骰子是什么（转换概率）。
      这个问题很重要，因为这是最常见的情况。很多时候我们只有可见结果，不知道HMM模型里的参数，我们需要从可见结果估计出这些参数，这是建模的一个必要步骤。


比如，在基因组学领域，我们欲计算基因突变的频率，在测序样本比对后得到的观察突变频率的基础上(可以理解为初始概率)，通过HMM可以推测出潜在序列突变频率。


```{r, include=FALSE}

library(tidyverse)
library(easystats)
library(depmixS4)
# library(HMM)

```


### 一个示例

```{r data}

data(speed)

head(speed)

speed %>% janitor::tabyl(corr)

speed %>% janitor::tabyl(prev)
```

`depmix` creates an object of class depmix, a dependent mixture model, otherwise known as hidden Markov model. 
```{r}

# HMM with depmix
# response 指的是，观测序列 ~ 隐状态
# 建立观测序列和隐状态之间的关系，其中 "~" 表示关系符号，"|" 表示条件符号
# 这里使用 list 函数表示两个观测序列和对应的隐状态，需要注意的是，每个观测序列对应一个隐状态序列

# family,分布族指定每个观测序列的分布族，例如二项分布、高斯分布等,这里使用 list 函数表示两个观测序列对应的分布族，需要注意的是，每个观测序列对应一个分布族

# nstates: 隐状态数目, 指定隐状态的数目，这个数目需要根据实际问题和数据来确定
# instart: 初始隐状态概率,指定每个隐状态的初始概率，这里使用 c 函数表示一个包含两个元素的向量，表示两个隐状态的概率,需要注意的是，初始概率和转移概率一起构成了隐状态序列的生成模型

# transition: 隐状态转移概率, 指定隐状态之间的转移概率，这里使用 ~1 表示每个隐状态之间的转移概率相等
# 需要注意的是，转移概率和初始概率一起构成了隐状态序列的生成模型

# 本模型只考虑rt序列，即是拟合推测本序列数据的隐序列，隐状态假设有两个，算是一个Learning的问题
mod <- depmix(response = rt ~ 1,
              data = speed,
              nstates = 2,
              family = gaussian()  # use gaussian() for normally distributed data
              )

fit_mod <- fit(mod)
fit_mod

```

```{r}
# mod

summary(fit_mod)
```
两个Stat，St1(5.510 +- 0.192), St2(6.385 +- 0.244)


```{r}
# transition 概率
# from 
summary(fit_mod, which = "transition")

```

```{r}
# extract the estimated state sequence
# state_seq <- apply(fit_mod@posterior, 1, which.max)
state_seq <- apply(fit_mod@posterior[,,1], 1, which.max)

# plot the estimated state sequence over time
plot(state_seq, type = "o", col = c("blue", "red"), xlab = "Time", ylab = "State")
legend("topright", c("Wild Type", "Mutant"), col = c("blue", "red"), lty = 1)
```

```{r}
# predict the states by estimating the posterior
est_states <- posterior(fit_mod)

head(est_states)
```
*interpret the result: *`state`列为两个stat，S1为stat1的概率，S2为stat2的概率。


### 分析基因序列的示例

隐马尔可夫在生物信息中的应用主要是用于序列性数据的分析。比如DNA测序后call mutation。

```{r, include=FALSE}
library(Biostrings)
```


```{r}
# seqs <- readDNAStringSet("sequences.fasta")
seqs <- DNAString("ATCGAACGATCG")

ref_seq <- DNAString("ATCGATCGATCG")
aligned_seqs <- pairwiseAlignment(seqs, ref_seq)

# mut_positions <- which(aligned_seqs@subject != aligned_seqs@pattern)

# Extract positions of mismatches
align_mat <- as.character(aligned_seqs)
mut_positions <- which(align_mat[2, ] != "-")
mut_positions

```


For example, you could define a model with two states: a "wild-type" state and a "mutant" state, where the mutant state has a higher probability of emitting a sequence that contains a mutation.
```{r}
model <- depmix(response = list(DNAStringSet(seqs)),
                transition = ~1,
                family = list(multinomial(c(0.9, 0.1)), multinomial(c(0.2, 0.8))),
                nstates = 2)

```


Fit the model to your data using the fit function. You may need to experiment with different starting values or optimization algorithms to ensure convergence.
```{r}
fit <- fit(model, data = list(DNAStringSet(seqs)), verbose = TRUE)

```


Interpret the results of the model fit. This may involve identifying the most likely sequence of states, examining the emission probabilities for each state, or comparing models with different numbers of states.
```{r}
states <- viterbi(fit)
mutant_seqs <- seqs[states == 2]
wildtype_seqs <- seqs[states == 1]

```