- Maximum Likelihood Estimation(MLE)
- Maximum A Posteriori(MAP)
- Naive Bayes
- Logistic Regression
- Bayesian Belief Networks
Q. Explain frequentist vs. Bayesian statistics.
Answer
Frequentist and Bayesian statistics represents two different school of thoughts while building probabilistic model from the data.
We can understand both approaches from an example:
Suppose you have a coin, and you want to determine the probability of it landing heads (H) when you toss it.
Frequentist Approach:
In the frequentist approach, probabilities are viewed as long-term relative frequencies based on repeated, identical experiments or events. To find the probability of getting a heads (H), you perform a large number of coin flips and calculate the proportion of times it lands heads.
- You flip the coin 100 times.
- It lands heads (H) 53 times.
- The frequentist probability of getting heads is calculated as the relative frequency:
Probability of H = (Number of H outcomes) / (Total number of outcomes) =
In the frequentist approach, probability is objective and based on observable data from repeated experiments.
Bayesian Approach:
In the Bayesian approach, probability is a measure of our uncertainty or belief in an event. You start with a prior belief (prior probability) about the probability of getting heads, and you update that belief with new evidence (likelihood) from your observations.
-
You have a prior belief that the probability of getting heads is uniformly distributed between 0 and 1, i.e., a
$Beta(1, 1)$ distribution.Prior Probability:
$Beta(1, 1)$ -
You flip the coin 10 times, and it lands heads (H) 6 times and tails (T) 4 times.
Likelihood:
$Binomial(10, 0.5)$ -
You update your prior belief using Bayes' theorem:
Posterior Probability = (Prior Probability * Likelihood) / Evidence
Posterior Probability:
$Beta(1 + 6, 1 + 4) = Beta(7, 5)$
In the Bayesian approach, you use your prior belief and update it with observed evidence to obtain a posterior probability distribution. This posterior distribution represents your updated belief in the probability of getting heads.
Key Differences:
- Frequentist approach treats probability as a relative frequency based on data.
- Bayesian approach treats probability as a measure of belief and updates it using Bayes' theorem.
- Frequentist probabilities are fixed and objective.
- Bayesian probabilities are subjective and represent your current knowledge or belief.
Q. How can we estimate the parameters of a given probability distribution?
Answer
We can use following methods to estimates parameters:
- Maximum Likelihood Estimation(MLE)
- Maximum A Posteiori(MAP)
Q. What is the main assumption of MAP and MLE?
Answer
MLE/MAP both assumes the data are independent and identically distributed(iid)
Q. How is likelihood different than probability?
Answer
In the case of discrete distributions, likelihood is a synonym for the probability mass, or joint probability mass, of the data. In the case of continuous distribution, likelihood refers to the probability density of the data distribution.
Q. Write the mathematical expression of likelihood?
Answer
It represents the probability of observing the given data as a function of the parameters of the statistical model.
For a random variable
$$ L(\theta; x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} f(x_i; \theta) $$
Since we assumed that each data point is independent, the likelihood of all of our data is the product of the likelihood of each data point.
Q. What does Argmax mean?
Answer
Argmax is short for Arguments for the maxima. The argmax of a function is the value of the domain at which the function is maximized.
Q. Describe how to analytically find the MLE of a likelihood function?
Answer
To analytically find the Maximum Likelihood Estimator (MLE) of a likelihood function, we can follow below steps:
MLE Estimation Steps |
Define the likelihood function
Suppose we have a set of independent and identical distributed observations
$$ L(\theta) = \prod_{i=1}^{n} f(X_i \mid \theta) $$
- Here
$f(X_i \mid \theta)$ is the pdf or pmf of the data given the parameter$\theta$
Take the log likelihood
To simplify the
$$ l(\theta) = log(L(\theta)) = \sum_{i=1}^{n}\log f(X_i \mid \theta) $$
Take the derivative wrt $\theta$
$$ \frac{d\ell(\theta)}{d\theta} $$
Set the Derivative Equal to Zero
To find the critical points set:
$$ \frac{d\ell(\theta)}{d\theta} = 0 $$
Solve $\theta$
Find the values of
Verify the Maximum (Second Derivative Test)
$$ \frac{d^2\ell(\theta)}{d\theta^2} $$
- If the second derivative is negative at the critical point, it confirms a local maximum.
Q. What is the term used to describe the first derivative of the log-likelihood function?
Answer
Score function : The score function measures the sensitivity of the log-likelihood function to changes in the parameter
Q. What is the relationship between the likelihood function and the log-likelihood function?
Answer
The log-likelihood function is derived by taking the natural logarithm of the likelihood function.
$$ l(\theta) = \log{L(\theta)} $$
- Likelihood:
$L(\theta)$ - Log-Likelihood:
$\ell(\theta)$
Q. What is likelihood function of the independent identically distributed (i.i.d) random variables:
Answer
Likelihood function in case of discrete random variables is jus the PMF.
For Binomial distribution:
$$ P(X_i = x_i) = \binom{n}{x_i} p^{x_i} (1 - p)^{n - x_i} \quad \text{PMF} $$
Since the observations are i.i.d., the likelihood function is the product of the individual PMFs:
$$ L(p) = \prod_{i=1}^{n} P(X_i = x_i) = \prod_{i=1}^{n} \binom{n}{x_i} p^{x_i} (1 - p)^{n - x_i}. $$
Q. How can we derive the maximum likelihood estimator (MLE) of the i.i.d samples
Answer
Likelihood function in case of binomial distribution:
$$ L(p) = \prod_{i=1}^{n} P(X_i = x_i) = \prod_{i=1}^{n} \binom{n}{x_i} p^{x_i} (1 - p)^{n - x_i}. $$
Log likelihood:
$$ LL(p) = \log{\binom{n}{x}} + x \log{p} + (n-x) \log{1-p} \quad text{(Log Likelihood)} $$
On taking derivative wrt
$$ \frac{dL(p)}{dp} = 0 + \frac{x}{p} - \frac{(n-x)}{1-p} $$
$$ \frac{dL(p)}{dp} = \frac{x-pn}{p(1-p)} $$
For maximizing the likelihood:
$$ \frac{dL(p)}{dp} = 0 $$
$$ p = \frac{x}{n} $$
Q. Derive the maximum likelihood estimator of an exponential distribution.
Answer
PDF of exponential distribution:
$$ f(x \mid \lambda) = \lambda e^{-\lambda x}, \quad x \geq 0. $$
Likelihood Function:
For
$$ L(\lambda) = \prod_{i=1}^{n} f(X_i \mid \lambda) = \prod_{i=1}^{n} \lambda e^{-\lambda X_i}. $$
Simplifying this expression:
$$ L(\lambda) = \lambda^n e^{-\lambda \sum_{i=1}^{n} X_i}. $$
Log-Likelihood Function:
To make the maximization easier, take the natural logarithm of the likelihood function to get the log-likelihood function:
$$ \ell(\lambda) = \log L(\lambda) = \log(\lambda^n) + \log\left(e^{-\lambda \sum_{i=1}^{n} X_i}\right). $$
Simplify:
$$ \ell(\lambda) = n \log(\lambda) - \lambda \sum_{i=1}^{n} X_i. $$
Differentiate the Log-Likelihood Function:
Differentiate
$$ \frac{d\ell(\lambda)}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^{n} X_i. $$
Set the Derivative Equal to Zero:
Set the first derivative to zero to find the critical points:
$$ \frac{n}{\lambda} - \sum_{i=1}^{n} X_i = 0. $$
Solve for $\lambda$:
Rearrange the equation to solve for
$$ \frac{n}{\lambda} = \sum_{i=1}^{n} X_i. $$
Therefore, the MLE of
$$ \hat{\lambda} = \frac{n}{\sum_{i=1}^{n} X_i} = \frac{1}{\bar{X}}, $$
Q. A lot of machine learning models aim to approximate probability distributions. Let’s say P is the distribution of the data and Q is the distribution learned by our model. How do measure how close Q is to P?
Answer
We can use KL Divergence formula which is a measure of how one probability distribution
$$ D_{KL}(P | Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \quad \text{(for discrete distributions)} $$
$$ D_{KL}(P | Q) = \int P(x) \log \frac{P(x)}{Q(x)} , dx \quad \text{(for continuous distributions)} $$
Q. What is MAP? How is it different than MLE?
Answer
MAP estimation finds the parameter values that maximize the posterior distribution of the parameters given the data, inducing prior beliefs about the parameters.
$$ \hat{\theta}{\text{MAP}} = \arg\max{\theta} P(\theta | X) = \arg\max_{\theta} \frac{P(X | \theta) P(\theta)}{P(X)}. $$
Since
$$ \hat{\theta}{\text{MAP}} = \arg\max{\theta} P(X | \theta) P(\theta) $$
- MAP induces priori knowledge about the parameters through a prior distribution where as MLE does not consider any prior information.
- In MLE, parameters are treated as fixed values, while in MAP, they are treated as random variables with a prior distribution, requiring an extra assumption about the prior.
Q. When to use MAP over MLE?
Answer
If prior probability is provided in the problem setup, that information should be used (i.e., apply MAP). However, if no prior information is given or assumed, MAP cannot be used, and MLE becomes a suitable approach.
Q. When do MAP and MLE yield similar parameter estimates?
Answer
MAP and MLE will yield similar parameter estimates in following situations:
- Uniform Prior : When prior assign equal probabilities to all parameter values, adding no additional information
- Non-informative Priors : Priors that are weakly informative (e.g., with very high variance) have little impact on the posterior,
- Data Size is large : With a large amount of data, the likelihood dominates the posterior, reducing the influence of the prior
Q.
- Define the term conjugate prior.
- Define the term non-informative prior.
Answer
Conjugate Prior
A conjugate prior is a probability distribution that, when combined with the likelihood and normalized, results in a posterior distribution that belongs to the same family as the prior.
$$ p(\theta | x) = \frac{p(x|\theta)p(\theta)}{p(x)} $$
The prior
Non-Informative Prior
Q. MPE (Most Probable Explanation) vs. MAP (Maximum A Posteriori)
- How do MPE and MAP differ?
- Give an example of when they would produce different results.
Answer
Q. Naive Bayes classifier.
- How is Naive Bayes classifier naive?
- Let’s try to construct a Naive Bayes classifier to classify whether a tweet has a positive or negative sentiment. We have four training samples:
According to your classifier, what's sentiment of the sentence The hamster is upset with the puppy?
Answer
-
The Naive Bayes classifier is considered "naive" because it makes a strong and often unrealistic assumption: it assumes that all features (or predictors) in the dataset are independent of each other given the class label.
Q. Is Naive bayes a discriminative model?
Answer
True
The Naive Bayes algorithm is generative. It models
Q. How does the Naive Bayes algorithm work?
Answer
Naive Bayes Assumption
It assumes that each feature
Training Phase
In this phase we do parameter estimations. In core Naive Bayes uses Bayes theorem.
$$ P(\text{Class} | \text{Features}) = \frac{P(\text{Features} | \text{Class}) \cdot P(\text{Class})}{P(\text{Features})} $$
-
$P(\text{Features} | \text{Class})$ : Likelihood of the features given the class. -
$P(\text{Class})$ : Prior probability of the class. -
$P(\text{Features})$ : Evidence, the overall probability of the features.
Using Naive Bayes Assumption
$$ P(\text{Features} | \text{Class}) = P(\text{Feature}_1 | \text{Class}) \times P(\text{Feature}_2 | \text{Class}) \times \ldots \times P(\text{Feature}_n | \text{Class}) $$
Here we can calculate all the terms of Bayes theorem:
- Prior Probability:
$P(\text{Class})$ : This is usually estimated from the training data by calculating the frequency of each class. - Likelihood:
$P(\text{Feature}_i | \text{Class})$ : Estimated from the training data by counting how often each feature value appears within each class. - Evidence:
$P(\text{Features})$ : This term is often omitted during classification since it's the same for all classes and does not affect the ranking of probabilities.
Predictions
- For a given set of feature values, the classifier computes the posterior probability for each class.
- The class with the highest posterior probability is chosen as the predicted class.
Q. Why is Naive Bayes still used despite its flawed assumption of feature independence?
Answer
Naive Bayes is beneficial primarily because of its "naive" assumption of feature independence, which, although technically incorrect, offers some practical advantages:
- Scalability: Handles large feature spaces efficiently. It scales linearly with the number of features
- Simplicity: Easy to implement and interpret.
- High-Dimensional Performance: Performs well in high-dimensional datasets.
- Robustness: Yields good results in many practical applications.
Q. What is Laplace smoothing (additive smoothing) in Naive Bayes?
Answer
Laplace smoothing, also known as additive smoothing, is a technique used in Naive Bayes to handle zero probabilities that occur when a feature (e.g., a word in text classification) does not appear in the training data for a given class. Without smoothing, if a word never appears in a class during training, its probability would be zero, which could incorrectly influence the final prediction.
Q. Can Naive Bayes handle continuous and categorical features?
Answer
Yeah, We can handle both categorical and continuous features both using Naive Bayes
- Categorical Features : can be handled with methods like multinomial and bernoulli distributions
- Continuous Features : Can be handled using Gaussian assumptions
- Mixed Data : We can either convert continuous values into bins(categorization) and treat it as only categorical features or, we can fit separate model on categorical and numeric data and then combine to make prediction
Q. Can Naive Bayes handle missing data?
Answer
Naive Bayes does not directly handle missing data, but several practical strategies, such as ignoring missing features, imputing missing values, or creating indicator variables, can be employed to manage it effectively.
Q. What is the difference between Naive Bayes and other classification algorithms like Logistic Regression or Decision Trees?
Answer
Q. Define logistic regression?
Answer
Logistic Regression is a discriminative classifier that works by trying to learn a function that approximates
Q. What is the main assumption of logistic regression?
Answer
The central assumption that
$$ P(Y=1 | X) = \frac{1}{1+\exp(-w_0 - \sum_i w^i X^i)} $$
- Logistic function applied to a linear function of the data.
Q. Write the expression of sigmoid or logistic function?
Answer
$$ \sigma(z) = \frac{1}{1+\exp(-z)} $$
Q. Prove that logistic regression is a linear classifier?
Answer
At the decision boundary:
$$ P(Y=1|X) = \frac{1}{2} $$
We can express this as:
$$ P(Y=1|X) = \frac{1}{1 + \exp(-w_0 - \sum_i w_i X_i)} = \frac{1}{2} $$
Solving this equation gives:
$$ \exp(-w_0 - \sum_i w_i X_i) = 1 $$
This occurs only if:
$$ -w_0 - \sum_i w_i X_i = 0 $$
This equation defines the decision boundary of logistic regression. Since it represents a straight line, logistic regression is classified as a linear classifier.
Q. Does closed-form solution exists for logistic regression?
Answer
No closed-form solution exist. That's why we use gradient descent to estimate the parameters.
Q. How can we learn the parameters of logistic regression model?
Answer
Q. State the difference between Naive bayes and Logistic regression model?
Answer
Q. What is the range of logistic(sigmoid function)?
Answer
Q. What is the difference between Conditional MLE and standard MLE, and how does it relate to logistic regression?
Answer
Conditional Maximum Likelihood Estimation (Conditional MLE) refers to MLE applied within a conditional model, where the parameters only influence the conditional probability
Logistic regression is an example of a conditional model because the parameters
Q. What is the issue with using squared losses(MSE) or absolute losses(MAE) for logistic regression model?
Answer
Squared and absolute losses are not typically used in logistic regression because they are not well-suited to the characteristics of the logistic function and can lead to significant issues during optimization.
- Non-Convexity Issues: This non-convexity introduces multiple local minima, complicating optimization and often leading to suboptimal solutions.
- Gradient Behavior: Squared loss flattens gradients, especially when predictions are close to 0 or 1, which is common in logistic regression. The derivative of the squared loss is proportional to the error
$(y_i - \hat{y}_i)$ . In logistic regression, where$\hat{y}_i$ is bounded between 0 and 1, the gradients can become very small, slowing down convergence. - Poor Fit for Probabilistic Outputs : More fitted for regression task
Q. Can we use logistic regression for multiclass classification problem?
Answer
Yes, logistic regression can be extended to handle multiclass classification problems through approaches like One-vs-Rest (OvR) and Softmax Regression (Multinomial Logistic Regression).
Q. Write the expression of softmax function?
Answer
For a set of scores/logits
$$ P(y = j | \mathbf{x}) = \frac{\exp(z_j)}{\sum_{k=1}^{K} \exp(z_k)} $$
Q. State one issue with softmax function over sigmoid?
Answer
Computationally more intensive compared to binary logistic regression, especially when the number of classes is large.
Q. How is Maximum Likelihood Estimation (MLE) used in logistic regression, and why is it preferred over other estimation methods like least squares?
Answer
Logistic Regression Model
Logistic regression models the probability that a binary outcome
$$ P(y = 1 | \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}} $$
Likelihood Function
For a dataset with
$$ L(\mathbf{w}, b) = \prod_{i=1}^{n} P(y_i | \mathbf{x}_i) $$
Since logistic regression deals with binary outcomes, this can be rewritten as:
$$ L(\mathbf{w}, b) = \prod_{i=1}^{n} \left(\frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x}_i + b)}}\right)^{y_i} \left(\frac{e^{-(\mathbf{w}^T \mathbf{x}_i + b)}}{1 + e^{-(\mathbf{w}^T \mathbf{x}_i + b)}}\right)^{1 - y_i} $$
Log-Likelihood Function
$$ \text{Log-Likelihood} = \sum_{i=1}^{n} \left( y_i \log P(y_i | \mathbf{x}_i) + (1 - y_i) \log (1 - P(y_i | \mathbf{x}_i)) \right) $$
Maximizing the Log-Likelihood
MLE estimates the parameters
MLE is Preferred Over Least Squares in Logistic Regression?
- Appropriate Loss Function
- Convex Optimization: The optimization problem derived from MLE is convex, meaning it has a single global minimum, which guarantees the stability and reliability of the solution.
Q. What is Maximum A Posteriori (MAP) Estimation in logistic regression?
Answer
MAP estimation in logistic regression is a Bayesian approach that estimates model parameters by maximizing the posterior probability, which combines the likelihood of the observed data with a prior distribution over the parameters.
$$ \hat{\mathbf{w}}{\text{MAP}} = \arg\max{\mathbf{w}} , P(\mathbf{w} | \text{data}) = \arg\max_{\mathbf{w}} , P(\text{data} | \mathbf{w}) , P(\mathbf{w}) $$
Using Bayes' theorem, this becomes:
$$ \hat{\mathbf{w}}{\text{MAP}} = \arg\max{\mathbf{w}} , \left(\prod_{i=1}^{n} P(y_i | \mathbf{x}_i; \mathbf{w})\right) P(\mathbf{w}) $$
Q. How does MAP differ from Maximum Likelihood Estimation (MLE) in logistic regression?
Answer
MLE maximizes the likelihood of the data given the parameters, relying solely on observed data. MAP, on the other hand, maximizes the posterior probability by incorporating a prior distribution, which acts as a regularization term.
Q. What role do priors play in MAP estimation?
Answer
Priors in MAP estimation incorporate external knowledge or beliefs about the parameters, adding a regularization effect. Common priors include Gaussian (L2 regularization) and Laplace (L1 regularization), which help control model complexity and prevent overfitting.
Q. Why might MAP be preferred over MLE in logistic regression?
Answer
MAP is preferred over MLE in scenarios where there is a risk of overfitting, when data is sparse, or when domain knowledge is important. The inclusion of priors in MAP acts as regularization, making the model more robust to noise and improving generalization.
Q. How does MAP help in small datasets compared to MLE?
Answer
In small datasets, MLE may overfit because it relies only on the observed data. MAP’s use of priors helps stabilize parameter estimates, providing more reliable results when the data alone is insufficient.
Q. What type of priors are commonly used in MAP for logistic regression?
Answer
Common priors used in MAP for logistic regression include:
- Gaussian Prior (L2 Regularization): Penalizes large weights and prevents overfitting.
- Laplace Prior (L1 Regularization): Encourages sparsity, leading to simpler models by driving some coefficients to zero.
Q. How does MAP provide flexibility compared to MLE?
Answer
MAP allows the use of different priors based on the problem context, providing flexibility in how the model is regularized or adjusted. MLE lacks this capability as it does not incorporate any prior information.
Q. What is the main advantage of using MAP in logistic regression?
Answer
The main advantage of using MAP in logistic regression is its ability to combine observed data with prior information, enhancing the model’s robustness against overfitting and making it better suited for small or noisy datasets.
Q. Can you explain a situation where using MAP estimation could lead to worse results than MLE?
Answer
MAP estimation could lead to worse results if the prior is incorrect or misaligned with the actual data distribution. For example, if a strong prior incorrectly penalizes certain parameter values, the resulting estimates could be biased, leading to poor predictive performance compared to MLE, which only relies on the observed data.