You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Supervised Learning
Support Vector Machine allows computer to deal with infinite number of features.
Supervised Learning works with features, regression problem to predict continuous valued output, classification problem predicts the discrete value output.
Unsupervised Learning
Google uses clustering algorithm to combine articles of the same or similar contents.
Unsupervised Learning is like giving the computer a set of data and asking it to cluster them based on similarities, without giving it much more information.
Social networking analysis, Astronomical data analysis both use clustering or Unsupervised Learning.
Code for separating out audio
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
Model Representation
The training set feeds the training algo with data.
Learning algo is denoted by h which means hypothesis.
h takes in the training set and maps values of y for x's.
The model is called linear regression with one variable x.
Cost Function
tetah 0 is the starting point of the line, that is the starting point of tetah 1.
Cost function is also called squared error function.
Difference between hypothesis and cost function
The hypothesis is a function of x, in contrast, the cost function is a parameter of tetha 1.
The objective is to choose the value of tetha 1 that minimizes the value of Jtetah1
Gradient descent repeats until convergence
:= means assigning a value to something
= means assertion
In the gradient descent algorithm:
Alpha is the learning rate and it determines how big steps we take during a gradient descent.
The next term is a derivative term.
We simultaneously update tetha0 and tetha1 and we do that by computing LHS and RHS at the same time.
Learning Rate: If alpha is small, the gradient would be slow, but if too large, there is a tendency to overshoot the minimum
The derivative considers the slope of the graph
It tetah 1 is at the minimum, the derivative term would be equal to zero and tetah1 remains unchanged.
For each step of gradient descent, the derivative value becomes smaller, until we converge to the local minimum. The local minimum is when the derivative becomes zero.
We apply gradient descend algo to Linear regression model to reduce the linear regression model.
Taking further steps of gradient descents, the hypothesis change.
Batch means that we are considering the entire trading data.
Matrices and Vectors
A vector is a matrix with one column
Matrix multiplication is
-Not commutative
-Associative
Mean normalization is making the features have approximately zero mean.
If you plot J tetha against number of iterations and J is increasing, that gives a sign that gradient descend is not functioning, and a smaller Learning Rate should be used. This is caused by the overshooting of large learning rates.
The J tetha should decrease after every iteration if the appropriate Learning Rate is used.
In Gradient Descent, you need to choose a learning rate alpha and many iterations are required. It works well even when n is large.
In the Normal Equation, the learning rate is not needed to be chosen and iterations are not necessary. Slow if n is large. Need to find the inverse of the transpose of x multiplied by x.
Classification
Applying linear regression to a classification problem isn't often a great idea.
The decision boundary is a property not of the training set but of the hypothesis and the parameters
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
Optimization algorithms are quite a bit more complex than gradient descent.
Multiclass Classification
This is when there is a classification problem that should be classified into classes of classes.
With the idea of one-versus-all classification, we can accomplish multiclass classification.
We convert the multiclass classification into several classes of binary class classification.
Cost Function
There are binary classification and Multiclass classification.
The Logistic regression cost function is different from the Neural Network cost function.
The backpropagation algorithm is an algo for minimizing the cost function
Using this algorithm we compute the error of each node in each layer.
We can use forward propagation and Backpropagation on one training data once at a time.
When implementing, for advanced optimization, we unroll into vectors
We use the reshape syntax in Octave to restore back unrolled parameters.
Debugging a learning algorithm
Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features
Try decreasing lambda
Try increasing lambda
For hypothesis evaluation, it is better to divide a dataset into 3 parts:
Training Set (60%)
Cross-Validation Set (20%)
Testing Set (20%)
For a bias(underfit), The training data will be high and the cross-validation will approximately
equal the training set.
For a Variance(overfit), The training data will be low and the cross-validation would
greater than the training data.
In high bias problem, getting more learning data is unlikely to help
In high variance problems, getting more learning data is likely to help
Decreasing lambda helps to fix high bias and increasing lambda helps to fix high variance
The text was updated successfully, but these errors were encountered:
Stanford Machine Learning course
Supervised Learning
Support Vector Machine allows computer to deal with infinite number of features.
Supervised Learning works with features, regression problem to predict continuous valued output, classification problem predicts the discrete value output.
Unsupervised Learning
Google uses clustering algorithm to combine articles of the same or similar contents.
Unsupervised Learning is like giving the computer a set of data and asking it to cluster them based on similarities, without giving it much more information.
Social networking analysis, Astronomical data analysis both use clustering or Unsupervised Learning.
Code for separating out audio
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
The training set feeds the training algo with data.
Learning algo is denoted by h which means hypothesis.
h takes in the training set and maps values of y for x's.
The model is called linear regression with one variable x.
Cost Function
tetah 0 is the starting point of the line, that is the starting point of tetah 1.
Cost function is also called squared error function.
Difference between hypothesis and cost function
The hypothesis is a function of x, in contrast, the cost function is a parameter of tetha 1.
The objective is to choose the value of tetha 1 that minimizes the value of Jtetah1
Gradient descent repeats until convergence
:= means assigning a value to something
= means assertion
In the gradient descent algorithm:
Alpha is the learning rate and it determines how big steps we take during a gradient descent.
The next term is a derivative term.
We simultaneously update tetha0 and tetha1 and we do that by computing LHS and RHS at the same time.
Learning Rate: If alpha is small, the gradient would be slow, but if too large, there is a tendency to overshoot the minimum
The derivative considers the slope of the graph
It tetah 1 is at the minimum, the derivative term would be equal to zero and tetah1 remains unchanged.
For each step of gradient descent, the derivative value becomes smaller, until we converge to the local minimum. The local minimum is when the derivative becomes zero.
We apply gradient descend algo to Linear regression model to reduce the linear regression model.
Taking further steps of gradient descents, the hypothesis change.
Batch means that we are considering the entire trading data.
A vector is a matrix with one column
Matrix multiplication is
-Not commutative
-Associative
Mean normalization is making the features have approximately zero mean.
If you plot J tetha against number of iterations and J is increasing, that gives a sign that gradient descend is not functioning, and a smaller Learning Rate should be used. This is caused by the overshooting of large learning rates.
The J tetha should decrease after every iteration if the appropriate Learning Rate is used.
In Gradient Descent, you need to choose a learning rate alpha and many iterations are required. It works well even when n is large.
In the Normal Equation, the learning rate is not needed to be chosen and iterations are not necessary. Slow if n is large. Need to find the inverse of the transpose of x multiplied by x.
Applying linear regression to a classification problem isn't often a great idea.
The decision boundary is a property not of the training set but of the hypothesis and the parameters
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
Optimization algorithms are quite a bit more complex than gradient descent.
Multiclass Classification
This is when there is a classification problem that should be classified into classes of classes.
With the idea of one-versus-all classification, we can accomplish multiclass classification.
We convert the multiclass classification into several classes of binary class classification.
There are binary classification and Multiclass classification.
The Logistic regression cost function is different from the Neural Network cost function.
The backpropagation algorithm is an algo for minimizing the cost function
Using this algorithm we compute the error of each node in each layer.
We can use forward propagation and Backpropagation on one training data once at a time.
When implementing, for advanced optimization, we unroll into vectors
We use the reshape syntax in Octave to restore back unrolled parameters.
For hypothesis evaluation, it is better to divide a dataset into 3 parts:
For a bias(underfit), The training data will be high and the cross-validation will approximately
equal the training set.
For a Variance(overfit), The training data will be low and the cross-validation would
greater than the training data.
In high bias problem, getting more learning data is unlikely to help
In high variance problems, getting more learning data is likely to help
Decreasing lambda helps to fix high bias and increasing lambda helps to fix high variance
The text was updated successfully, but these errors were encountered: