Alignment process #896

David8390 · 2023-10-30T00:35:02Z

David8390
Oct 30, 2023

Hi @rob-p ,
I looked at discussions here and found very useful questions and answers.
I would like to thank you for making such a profound package.
I have a question about the alignment process and I will be very thankful if you could help me.

I understand that Salmon does "lightweight" alignment using probabilities and distributions.

However, in the paper related to Salmon which describes about Salmon's method, it defines a Matrix Z, with entries z_ij, which is 1 if fragment j belongs to transcript i. How does Salmon detect if it belongs to t_i at first step?

I understand that P(f_j | t_i, z_ij=0)=0 uniformly as the paper describes.

How does Salmon detects entries of Z?

Does it do a matching first (finding the entries of Z first) and then lightweight alignment based on the fragments distributions for that transcript?

I mean, we know that p( f_j | t_i, z_ij=1) is the defined based on the distribution of the fragments in t_i ( in other word, probability of drawing a fragment from t_i, given that it is from t_i.

When we say given that it is from t_i, to me sounds like there is a "pre checking" for matchings and then calculating the probabilities under the condition that it belongs to t_i ( the distribution of the fragments that match to t_i)

I understand that when f_j belongs to t_i, then we calculate the bias terms to see if it is " really" from t_i.

Or, does Salmon pick entries of Z randomly without actually checking that if fragment f_j matches to transcript t_i?

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/

Thanks in advance for your help.
David

rob-p · 2023-10-30T01:41:45Z

rob-p
Oct 30, 2023
Maintainer

Hi David,

Thanks for your question. The Z matrix represents the latent variables of the model. That is, the individual $z_{ij}$ are not observed. The way that salmon (like RSEM or other tools for transcript-level quantification) evaluates the probabilities of a fragment deriving from a transcript is to consider the probability as 0 if the fragment does not align (either via tradition or lightweight alignment) to the transcript. If the fragment does align to the transcript (in general, the fragment will align to multiple transcripts, which is what results in the observed ambiguity), then salmon will estimate the probability of the read deriving from this transcript given the current estimates of the abundance of all transcripts.

In fact, the way the EM algorithm (and similarly the VBEM algorithm) works is that, since the $z_{ij}$ are unknown, we cannot use them directly. Instead, we use

$$\mathbb{E}\left[z_{ij}\right] = \Pr(z_{ij} = 1 | \mathcal{F}, \eta^{(t)}) = \frac{ \frac{\eta_i^{(t)}}{\hat{\ell_i}} \Pr(f_j | z_{ij} = 1)}{ \sum_{i' \text{ s.t. j maps to }t_{i'}} \frac{\eta_{i'}^{(t)}}{\hat{\ell_{i'}}} \Pr(f_j | z_{i'j} = 1)}$$

Note that this defines the expected value of the $z_{ij}$ in terms of the parameters (estimated transcript abundances) denoted as $\mathbf{\eta}$. This is the so-called E-step of the EM algorithm. The M step finds the values of the $$\mathbf{\eta}$$ that maximize the probability of the observed data given the $\mathbb{E}\left[z_{ij}\right]$ that we just computed. Specifically, it computes:

$$\eta_i^{(t+1)} = \frac{ \sum_{j} \mathbb{E}\left[z_{ij}\right] }{ N }$$

where $N$ is the total number of mapped fragments.

So, as to the question of where lightweight mapping comes into play. Note that when we evaluate $\mathbb{E}\left[z_{ij}\right]$, we do not consider all transcripts $i$, but rather only those transcripts $i$ to which $j$ aligned / mapped. In salmon, this question is answered via selective-alignment.

I note that the above is a slight simplification, as it describes the basic EM, but salmon can use either this or the Variational Bayesian EM. Second, if you look at the treatment in e.g. the RSEM paper, you'll note that they discuss $z_{nij}$ (with 3 subscripts) rather than $z_{ij}$. This is because if a single fragment aligns to a specific transcript in more than one location (very rare, but possible), it considers each mapping location within the transcript separately — salmon on the other hand assumes a single mapping location for a given fragment against an individual transcript (i.e. arbitrary multimapping is considered between transcripts, but each fragment is assume to align against a given transcript only once). If a read maps to a single transcript in more than one location, salmon randomly chooses one from among the equally-best mapping positions. Finally, the equation above doesn't expand on exactly on the terms folded into $\Pr(f_j | z_{ij} = 1)$, but these are considered under the fragment-transcript agreement model, and can include things like the probability of observing the inferred fragment insert length, and the quality and constitution of the alignment (i.e. the exact CIGAR string observed).

Anyway, I hope this helps. The TLDR to your question is that you're right, the $z_{ij}$ aren't known. They are the latent variable in our model. Since we don't know them directly, we instead work in terms of their conditional expectation, which we can compute if we know the values of the parameters ($\mathbf{\eta}$). Of course, we don't know these, they are what we are trying to infer! So, this is why we apply an iterative algorithm like the EM. We begin with a "guess" as to what the $\mathbf{\eta}$ might be. From this guess, $\mathbf{\eta}^{(0)}$ we compute

$$ \mathbb{E_{\mathbf{\eta^{(0)}}}} \left[ z_{ij} \right]$$

Then, given these expected $z_{ij}$, we can compute $\mathbf{\eta^{(1)}}$. Of course, if $\mathbf{\eta}^{(1)}$ are different than $\mathbf{\eta}^{(0)}$, so too should be our $z_{ij}$, so we re-compute them. We repeat this process iteratively until it converges (i.e. until the $\mathbf{\eta}^{(t)}$ stop changing).

Best,
Rob

0 replies

Ray6283 · 2023-10-30T02:35:41Z

Ray6283
Oct 30, 2023

Thank you very much for your comprehensive response. It totally make sense.

Is that right to say the main job here is an "good" estimation of $\eta^{0}$, and from there all it starts ?
And, for estimation of the $\eta^{0}$, Salmon considers mini-batches, Markov chain and Variational Bayesian methods as I understood from your paper.

Cheers,
David

1 reply

rob-p Oct 30, 2023
Maintainer

Hi David,

I would say that actually, the main job is to find a "good" estimate of $\eta^{(n)}$ where $n$ is the final/converged iteration of the algorithm. Of course, it helps to have a good $\eta^{(0)}$ both, because if this is a local optimal you're immediately done and also because — since find a global optimum is, in general, not possible, the better your starting position the better you are likely to do.

You're right that salmon estimates the initial $\eta^{(0)}$ using the online stochastic collapsed variational Bayesian algorithm (over mini-batches). The other benefit of salmon's "dual-phase" approach (there is both a stochastic online inference phase, and a collapsed offline inference phase), is that during the first phase we have an estimate of an individual per-fragment probability. Since we consider modeling a separate probability per fragment-transcript pair, we can consider things like the individual length of this fragment when aligned to a specific transcript, and consider that in the overall assignment probability. However, once collapsed into equivalence classes, that per-fragment information is lost. By performing a dual-phase inference, salmon is able to consider that probability during the online phase, as well as determine how best to summarize that probability when fragments are grouped into equivalence classes. The range-factorized equivalence classes paper describes how the resolution of this factorization can be smoothly controlled to trade-off between high-fidelity fragment-level modeling and very efficient fragment equivalence-class level modeling.

Salmon uses range-factorized equivalence classes where the fragments grouped together for the offline phase are determined both by the set of transcripts to which they map, and also to the discretized conditional probability with which they map to each transcript labeling the equivalence class.

Best,
Rob

Ray6283 · 2023-10-30T04:00:16Z

Ray6283
Oct 30, 2023

I very appreciate your help for explaining that comprehensively 🙏.

Sincerely,
David

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment process #896

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Alignment process #896

David8390 Oct 30, 2023

Replies: 3 comments · 1 reply

rob-p Oct 30, 2023 Maintainer

Ray6283 Oct 30, 2023

rob-p Oct 30, 2023 Maintainer

Ray6283 Oct 30, 2023

David8390
Oct 30, 2023

Replies: 3 comments 1 reply

rob-p
Oct 30, 2023
Maintainer

Ray6283
Oct 30, 2023

rob-p Oct 30, 2023
Maintainer

Ray6283
Oct 30, 2023