Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanford Machine Language Notes #1

Open
Logahn opened this issue Jul 31, 2021 · 0 comments
Open

Stanford Machine Language Notes #1

Logahn opened this issue Jul 31, 2021 · 0 comments

Comments

@Logahn
Copy link
Owner

Logahn commented Jul 31, 2021

Stanford Machine Learning course

		Introduction

Supervised Learning
Support Vector Machine allows computer to deal with infinite number of features.
Supervised Learning works with features, regression problem to predict continuous valued output, classification problem predicts the discrete value output.

Unsupervised Learning
Google uses clustering algorithm to combine articles of the same or similar contents.
Unsupervised Learning is like giving the computer a set of data and asking it to cluster them based on similarities, without giving it much more information.
Social networking analysis, Astronomical data analysis both use clustering or Unsupervised Learning.

Code for separating out audio
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

	Model Representation

The training set feeds the training algo with data.
Learning algo is denoted by h which means hypothesis.
h takes in the training set and maps values of y for x's.
The model is called linear regression with one variable x.
Cost Function
tetah 0 is the starting point of the line, that is the starting point of tetah 1.
Cost function is also called squared error function.

Difference between hypothesis and cost function
The hypothesis is a function of x, in contrast, the cost function is a parameter of tetha 1.

The objective is to choose the value of tetha 1 that minimizes the value of Jtetah1
Gradient descent repeats until convergence
:= means assigning a value to something
= means assertion

In the gradient descent algorithm:

Alpha is the learning rate and it determines how big steps we take during a gradient descent.
The next term is a derivative term.
We simultaneously update tetha0 and tetha1 and we do that by computing LHS and RHS at the same time.
Learning Rate: If alpha is small, the gradient would be slow, but if too large, there is a tendency to overshoot the minimum
The derivative considers the slope of the graph
It tetah 1 is at the minimum, the derivative term would be equal to zero and tetah1 remains unchanged.
For each step of gradient descent, the derivative value becomes smaller, until we converge to the local minimum. The local minimum is when the derivative becomes zero.
We apply gradient descend algo to Linear regression model to reduce the linear regression model.
Taking further steps of gradient descents, the hypothesis change.
Batch means that we are considering the entire trading data.

		Matrices and Vectors

A vector is a matrix with one column
Matrix multiplication is
-Not commutative
-Associative

Mean normalization is making the features have approximately zero mean.

If you plot J tetha against number of iterations and J is increasing, that gives a sign that gradient descend is not functioning, and a smaller Learning Rate should be used. This is caused by the overshooting of large learning rates.
The J tetha should decrease after every iteration if the appropriate Learning Rate is used.
In Gradient Descent, you need to choose a learning rate alpha and many iterations are required. It works well even when n is large.
In the Normal Equation, the learning rate is not needed to be chosen and iterations are not necessary. Slow if n is large. Need to find the inverse of the transpose of x multiplied by x.

			Classification

Applying linear regression to a classification problem isn't often a great idea.
The decision boundary is a property not of the training set but of the hypothesis and the parameters
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
Optimization algorithms are quite a bit more complex than gradient descent.
Multiclass Classification
This is when there is a classification problem that should be classified into classes of classes.
With the idea of one-versus-all classification, we can accomplish multiclass classification.
We convert the multiclass classification into several classes of binary class classification.

				Cost Function

There are binary classification and Multiclass classification.
The Logistic regression cost function is different from the Neural Network cost function.

The backpropagation algorithm is an algo for minimizing the cost function
Using this algorithm we compute the error of each node in each layer.
We can use forward propagation and Backpropagation on one training data once at a time.
When implementing, for advanced optimization, we unroll into vectors
We use the reshape syntax in Octave to restore back unrolled parameters.

		Debugging a learning algorithm
  • Get more training examples
  • Try smaller sets of features
  • Try getting additional features
  • Try adding polynomial features
  • Try decreasing lambda
  • Try increasing lambda

For hypothesis evaluation, it is better to divide a dataset into 3 parts:

  • Training Set (60%)
  • Cross-Validation Set (20%)
  • Testing Set (20%)

For a bias(underfit), The training data will be high and the cross-validation will approximately
equal the training set.
For a Variance(overfit), The training data will be low and the cross-validation would
greater than the training data.
In high bias problem, getting more learning data is unlikely to help
In high variance problems, getting more learning data is likely to help
Decreasing lambda helps to fix high bias and increasing lambda helps to fix high variance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant