Skip to content

Commit

Permalink
Add Basic Concepts in Information Theory
Browse files Browse the repository at this point in the history
  • Loading branch information
victorliu5296 committed Sep 13, 2024
1 parent b08da00 commit 5182dc6
Showing 1 changed file with 168 additions and 0 deletions.
168 changes: 168 additions & 0 deletions source/content/posts/Basic Concepts in Information Theory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
title: 'Basic Concepts in Information Theory'
date: 2024-09-13T17:58:35-04:00
math: katex
categories: [Statistics, Information Theory]
tags: [Information Theory, Statistics, Probability, Expectation, Variance, Covariance, Entropy, Mutual Information, Conditional Entropy, Fisher Information, Jensen's Inequality, Cramér-Rao Bound, Bias-Variance Tradeoff, Likelihood Function, Bayes' Theorem]
weight: 100
---

Here are some important concepts from both statistics and information theory, presenting them with compact notation, expected value where applicable, and an intuitive description alongside examples.

### 1. **Mutual Information (MI)**:
**Notation**:
\[
I(X; Y) = \mathbb{E}_{p(x, y)} \left[ \log \frac{p(x, y)}{p(x) p(y)} \right]
\]
**Intuition**:
Mutual Information quantifies the amount of information that knowing one random variable \( X \) provides about another random variable \( Y \). In other words, it measures the reduction in uncertainty about \( X \) given knowledge of \( Y \), and vice versa. MI is symmetric, meaning \( I(X; Y) = I(Y; X) \).

**Example**:
In predicting weather conditions, if you know the season (variable \( X \)), you gain information about the likelihood of certain weather patterns (variable \( Y \)), such as the probability of snow in winter vs. summer.

**Application: Feature Selection in Machine Learning**
Mutual Information is used to measure the dependency between a feature and the target variable in classification problems. Features that provide the most information about the target variable are selected, improving model performance while reducing complexity. For instance, in a spam email classifier, MI helps in selecting the most informative words (features) that determine whether an email is spam or not.

---

### 2. **Fisher Information**:
**Notation**:
\[
\mathcal{I}(\theta) = \mathbb{E} \left[ \left( \frac{\partial}{\partial \theta} \log p(X \mid \theta) \right)^2 \right]
\]
**Intuition**:
Fisher Information measures the amount of information that an observable random variable \( X \) carries about an unknown parameter \( \theta \). It's used to gauge the precision of estimators of \( \theta \). The higher the Fisher Information, the more accurately we can estimate \( \theta \) from the data.

**Example**:
In estimating the mean of a normal distribution from a sample, Fisher Information helps to understand how much the sample data can tell us about the true population mean. A larger sample would result in higher Fisher Information, leading to a more accurate estimate.

**Application: Sensor Design in Engineering**
In designing a sensor for temperature measurements, Fisher Information can be used to assess how much information the sensor’s output gives about the true temperature. This helps optimize the sensor's accuracy by selecting designs that maximize information about the parameter of interest (temperature).

---

### 3. **Conditional Entropy**:
**Notation**:
\[
H(Y \mid X) = \mathbb{E}_{p(x, y)} \left[ - \log p(y \mid x) \right]
\]
**Intuition**:
Conditional Entropy measures the uncertainty of a random variable \( Y \) given that we know the value of another random variable \( X \). It’s the expected remaining uncertainty in \( Y \) after observing \( X \). Lower conditional entropy indicates that knowing \( X \) reduces the uncertainty in predicting \( Y \).

**Example**:
If \( X \) is the result of rolling a fair die, and \( Y \) is a function that returns whether the result is odd or even, the uncertainty in \( Y \) (whether odd or even) reduces once we know \( X \).

**Application: Data Compression**
In lossless data compression, knowing the structure of data can reduce the amount of space needed to store it. For instance, in video compression (like H.264), if you know the values of nearby pixels in a frame, the uncertainty about the value of a given pixel reduces (i.e., conditional entropy decreases), allowing for more efficient encoding and smaller file sizes.

---

### 4. **Variance**:
**Notation**:
\[
\text{Var}(X) = \mathbb{E} \left[ (X - \mathbb{E}[X])^2 \right]
\]
**Intuition**:
Variance measures the spread or dispersion of a random variable \( X \) around its mean. A higher variance means the values of \( X \) are more spread out, while a lower variance indicates the values are closer to the mean.

**Example**:
In a classroom, the variance in students' test scores quantifies how much individual scores deviate from the class average. If all students score very similarly, the variance will be low, while a wide range of scores will result in high variance.

**Application**: Portfolio Risk in Finance
In portfolio theory, variance is used to measure the risk of an asset or portfolio. Higher variance indicates higher volatility, which equates to higher risk. Investors use variance to assess how much the returns on an investment are expected to fluctuate, allowing them to make informed decisions about balancing risk and return.

---

### 5. **Covariance**:
**Notation**:
\[
\text{Cov}(X, Y) = \mathbb{E} \left[ (X - \mathbb{E}[X])(Y - \mathbb{E}[Y]) \right]
\]
**Intuition**:
Covariance measures the relationship between two random variables \( X \) and \( Y \), specifically how much they vary together. A positive covariance indicates that \( X \) and \( Y \) tend to increase together, while a negative covariance means that as one increases, the other tends to decrease.

**Example**:
In finance, the covariance between the returns of two stocks can help assess whether they move in the same direction. Positive covariance means when one stock’s price increases, the other tends to increase as well.

**Application: Portfolio Diversification in Finance**
Covariance is used to determine how the returns of two assets are related. Investors seek assets with low or negative covariance to diversify their portfolios, thus reducing overall risk. For instance, if stock X and stock Y have negative covariance, investing in both may stabilize the portfolio as gains in one can offset losses in the other.

---

### 6. **Jensen’s Inequality**:
**Notation**:
\[
f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] \quad \text{(for convex \( f \))}
\]
**Intuition**:
Jensen’s Inequality applies to convex (or concave) functions and expected values. It tells us that the function of the expected value of a random variable is less than or equal to the expected value of the function of the random variable when the function is convex.

**Example**:
In economics, consider a utility function \( u(X) \) that is concave (due to risk aversion). Jensen’s inequality shows that the utility of the expected income is greater than or equal to the expected utility of income, explaining why people prefer certain outcomes over gambles with the same expected payoff.

**Application: Risk Aversion in Economics**
In economics, the utility function of a risk-averse person is typically concave, meaning they prefer the expected value of wealth over a gamble with the same expected wealth. This explains why people buy insurance: paying a certain small premium (function of wealth) is preferred over the uncertainty of large, rare losses.

---

### 7. **Cramér-Rao Bound**:
**Notation**:
\[
\text{Var}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)}
\]
**Intuition**:
The Cramér-Rao Bound provides a lower bound on the variance of any unbiased estimator \( \hat{\theta} \) of a parameter \( \theta \). It shows the best possible precision you can achieve in estimating \( \theta \) given the available data.

**Example**:
In estimating the mean of a normal distribution, the Cramér-Rao Bound helps us understand the best possible accuracy that any unbiased estimator of the mean could achieve, ensuring we can’t do better than a certain limit in terms of precision.

**Application: Estimation in Signal Processing**
In signal processing, estimating the frequency of a noisy signal involves applying estimators to noisy measurements. The Cramér-Rao Bound tells engineers the best possible accuracy they can achieve in estimating the signal frequency, allowing them to benchmark their algorithms against an optimal bound.

---

### 8. **Bias-Variance Tradeoff**:
**Notation**:
\[
\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2
\]
**Intuition**:
The Bias-Variance Tradeoff shows that in model selection, there is often a tradeoff between how biased an estimator is and how much it varies. Lower bias usually leads to higher variance, and vice versa. The Mean Squared Error (MSE) decomposes into variance and squared bias, revealing the total error of an estimator.

**Example**:
In machine learning, a simple linear regression model might have high bias (underfitting), while a highly flexible model might have low bias but high variance (overfitting). The goal is to balance the two to minimize overall prediction error.

**Application: Model Selection in Machine Learning**
In machine learning, simpler models (like linear regression) have high bias and low variance, while more complex models (like deep neural networks) have low bias but high variance. To prevent overfitting or underfitting, practitioners balance bias and variance to minimize overall error. Regularization techniques (like L2 norm) are often used to control model variance.

---

### 9. **Likelihood Function**:
**Notation**:
\[
\mathcal{L}(\theta \mid X) = p(X \mid \theta)
\]
**Intuition**:
The likelihood function represents the probability of observing the data \( X \) given a parameter \( \theta \). It is the central concept in maximum likelihood estimation (MLE), where the goal is to find the parameter \( \theta \) that maximizes the likelihood of observing the given data.

**Example**:
In estimating the parameters of a normal distribution, the likelihood function is constructed using the observed data, and we find the values of \( \mu \) and \( \sigma^2 \) that maximize this likelihood.

**Application: Maximum Likelihood Estimation (MLE) in Statistical Modeling**
Likelihood functions are used in Maximum Likelihood Estimation (MLE) to find the parameter values that maximize the probability of observing the given data. For example, in logistic regression, MLE is used to estimate the weights of the model that best predict binary outcomes (like whether a patient has a disease).

---

### 10. **Bayes’ Theorem**:
**Notation**:
\[
p(\theta \mid X) = \frac{p(X \mid \theta) p(\theta)}{p(X)}
\]
**Intuition**:
Bayes’ Theorem updates our prior beliefs about a parameter \( \theta \) after observing data \( X \). The posterior distribution \( p(\theta \mid X) \) combines the likelihood of the data given \( \theta \) and the prior belief about \( \theta \) to form an updated belief.

**Example**:
In medical testing, if a patient tests positive for a disease, Bayes’ Theorem helps update the probability that they actually have the disease by considering both the test’s reliability (likelihood) and the general prevalence of the disease (prior).

**Application: Spam Filtering**
In spam filtering (e.g., Naive Bayes spam classifier), Bayes’ Theorem is used to update the probability that an email is spam based on the words present in the email. Starting with a prior probability (based on historical data), the classifier adjusts this probability given the likelihood of each word occurring in spam vs. non-spam emails.

0 comments on commit 5182dc6

Please sign in to comment.