Skip to content

Latest commit

 

History

History
798 lines (652 loc) · 49.2 KB

README.md

File metadata and controls

798 lines (652 loc) · 49.2 KB

Study-09-MachineLearning-[Supervised Learning]

  • Let's find the Decision Surface!

(A1) Linear Regression (when 'y' follows Normal-Dist)

  • For numeric data point
Single value
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df[ ['predictor'] ], df[ ['response'] ])

model.predict([ [127], [248] ])

array([[ 438.94308857, 127.14839521]])

  • (The reason for predicting on an array like [127] and not just 127, is because we can have a model that makes a prediction using multiple features.)
Multiple values
  • (The dataset consists of 13 features of 506 houses and their median value in $1000's. We fit a model on the 13 features to predict on the value of houses, i.e 'x' has 506 lists.)
from sklearn.datasets import load_boston
boston_data = load_boston()
x = boston_data['data']
y = boston_data['target']

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)

sample_house = [[2.29690000e-01, 0.00000000e+00, 1.05900000e+01, 0.00000000e+00, 4.89000000e-01,
                6.32600000e+00, 5.25000000e+01, 4.35490000e+00, 4.00000000e+00, 2.77000000e+02,
                1.86000000e+01, 3.94870000e+02, 1.09700000e+01]]

model.predict(sample_house)

array([ 23.68420569])

(A2) Logistic Regression

Linear models are the most useful applied statistical technique. However, they are not without their limitations. Additive response models don’t make much sense if the response is discrete, or strictly positive. Additive error models often don’t make sense, for example, if the outcome has to be positive. Transformations, such as taking a cube root of a count outcome, are often hard to interpret. In addition, there’s value in modeling the data on the scale that it was collected. Particularly interpretable transformations, natural logarithms in specific, aren’t applicable for negative or zero values.

The generalized linear model is family of models that includes linear models. By extending the family, it handles many of the issues with linear models, but at the expense of some complexity and loss of some of the mathematical tidiness. A GLM involves three components.

  • An exponential family model for the response.
  • A systematic component via a linear predictor.
  • A link function that connects the means of the response to the linear predictor.

The three most famous cases of GLMs are: linear models, binomial and binary regression and Poisson regression. We’ll go through the GLM model specification and likelihood for all three.

Typical Approach

  • Fitting a logistic regression to a dataset where we would like to predict if a transaction is fraud or not.

As we can see, there are two columns that need to be changed to dummy variables. Use the 1 for weekday and True, and 0 otherwise.

df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])

df = df.drop('not_fraud', axis=1)
df.head(2)

The proportion of fraudulent, weekday... transactions...?

print(df['fraud'].mean())
print(df['weekday'].mean())
print(df.groupby('fraud').mean()['duration'])

Fit a logistic regression model to predict if a transaction is fraud using both day and duration. Don't forget an intercept! Instead of 'OLS', we use 'Logit'

df['intercept'] = 1

log_model = sm.Logit(df['fraud'], df[['intercept', 'weekday', 'duration']])
result = log_model.fit()
result.summary()

Coeff-interpret: we need to exponentiate our coefficients before interpreting them.

# np.exp(result.params)
np.exp(2.5465)
np.exp(-1.4637), 100/23.14

12.762357271496972, (0.23137858821179411, 4.32152117545376)

On weekdays, the chance of fraud is 12.76 (e^2.5465) times more likely than on weekends...holding 'duration' constant.

For each min less spent on the transaction, the chance of fraud is 4.32 times more likely...holding the 'weekday' constant.

*Note: When you find the ordinal variable with numbers...Need to convert to the categorical variable, then

df['columns'].astype(str).value_counts()

Diagnostics

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
  • Confusion Matrix
    • Recall: 'reality'(Out of all the items that are truly positive): TP / TP+FN
    • Precision 'argued'(Out of all the items labeled as positive): TP / TP+FP

  • Next, it is useful to split your data into training and testing data to assure your model can predict well not only on the data it was fit to, but also on data that the model has never seen before. Proving the model performs well on test data assures that you have a model that will do well in the future use cases. Let's pull off X and y first. Create your test set as 10% of the data, and use a random state of 0.
X = df[['intercept', 'weekday', 'duration']]
y = df['fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=0)

The usual steps are:

  • Instantiate
  • Fit (on train)
  • Predict (on test)
  • Score (compare predict to test)
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
pred = log_model.predict(X_test)

print(accuracy_score(y_test, pred))
print(recall_score(y_test, pred))
print(precision_score(y_test, pred))
confusion_matrix(y_test, pred)

Roc Curve: The ideal case is for this to shoot all the way to the upper left hand corner.

from ggplot import *
from sklearn.metrics import roc_curve, auc

preds = log_mod.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x='fpr', y='tpr')) + geom_line() + geom_abline(linetype='dashed')


(B) DecisionTree

  • For continuous, categoric data
  • For non-binary classification

PREDICTION: based on the features, we can guess the apps that the future users would download.

  • Unlike SVM using a kernel trick, DecisionTree use a trick that lets a linear-DecisionSurf do Non-Linear-Decision making.
  • When making Decision Trees, we ask questions: On what features do we make our decisions on? What is the threshold for classifying each question into a yes or no answer? By adding an additional question, we can greater define the Yes and No classes !

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
import sys
from class_vis import prettyPicture, output_image
from prep_terrain_data import makeTerrainData

features_train, labels_train, features_test, labels_test = makeTerrainData()

We build two DecisionTree classifiers; one with parameter(min_samples_split=2), and the other with (min_samples_split=50). What's the difference in accuracy ? And how to prevent overfitting ?

Store your predictions in a list named 'pred_2', 'pred_50'.

from sklearn import tree

clf_2 = tree.DecisionTreeClassifier(min_samples_split=2)
clf_50 = tree.DecisionTreeClassifier(min_samples_split=50)

X = features_train
y = labels_train

clf_2.fit(X, y)
clf_50.fit(X, y)

pred_2 = clf_2.predict(features_test)
pred_50 = clf_50.predict(features_test)

Accuracy ? Whose accuracy is better ? clf_2 or clf_50 ? Well..min_samples_split=2 is too much..overfitting giving less accuracy.

from sklearn.metrics import accuracy_score

acc_min_samples_split_2 = accuracy_score(pred_2, labels_test)
acc_min_samples_split_50 = accuracy_score(pred_50, labels_test)

def submitAccuracies():
  return {"acc_min_samples_split_2":round(acc_min_samples_split_2, 3),
          "acc_min_samples_split_50":round(acc_min_samples_split_50, 3)}

DecisionTree & Entropy

  • Entropy: is a measure of [impurity] in a bunch of examples...Let's say it's an opposite of purity..
  • Entropy controls how a DecisionTree decides where to split the data to make subsets as pure as possible...
  • Entropy describes what's going on here.. - proportion_a * log(proportion_a) - proportion_b * log(proportion_b)

Entropy measures variance in categorical variables.

If we have a categorical veriable that consists of entry (a, b). Let's say p(a)=0.5, p(b)=0.5, then our entropy is

import math

-0.5*math.log(0.5, 2) -0.5*math.log(0.5, 2)

Which is 1, so it's a fucked up entropy.

Our DecisionTree picks splits of the maximum Information Gain

  • First, calculate "Parents Entropy".
  • Second, look at the possible splits that each column gives, and caluculate each "Child Entropy".
  • Third, calculate each column's "Information Gain" to pick the largest.

Hyperparameters for Decision Trees

  • max_depth
    • the largest length between the root to a leaf. A tree of maximum length k can have at most 2^k leaves(the very end).
    • Of course, too large depth very often causes overfitting.
  • min_samples_leaf(the very end)
    • a minimum for the number of samples we allow on each individual leaf.
    • This number can be specified as an integer, or as a float. If it's an integer, it's the number of minimum samples in the leaf. If it's a float, it'll be considered as the minimum percentage of samples on each leaf. For example, 0.1, or 10%, implies that a cut will not be allowed if in one of the leaves there is less than 10% of the samples on that node.
    • Of course, too small minimum samples per leaf results in overfitting.
  • min_samples_split
    • This is the same as the minimum number of samples per leaf, but applied on any split of a node.
  • max_features
    • Oftentimes, we will have too many features to build a tree. To speed up? we limit the number of features that one looks for in each split.

RandomForests >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

What if we have so many columns? ---- Warning of Overfitting !!! How to solve ?

    1. Pick some of the columns randomly
    1. Build a DecisionTree in those columns
    1. repeat
    1. Let the trees vote !
    • When we have a new data-pt, let all the trees make a prediction and pick the one that appears the most.

EnsembleMethods >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Take a bunch of models and join them together to get a better model

  • Bagging(Bootstrap Aggregating): Check each results and combine(average out or vote).
  • Boosting: A bit more elaborated than Bagging. Try hard to exploit the strength of each models then combine.

Bagging

  • As our data is huge, we don't want to train many models on the same data. so we take random subsets of the data and train a week learner(one node DecisionTree) on each one of these subsets.
  • Impose each result over the data and vote(as what two or more of them say..blue? it's blue.)

Adaboosting

  • First, fit our first weak learner in order to maximize accuracy(or equivalently minimize the size of errors): Do no better than 3 errors ! When it comes to the errors, it makes them bigger(punish them).
  • Our second learner needs to fix on the mistakes that the first one has made, correctly classifying these points at any expense, then punish the points misclassified by itself.
  • Our third learner needs to fix on the mistakes that the second one has made, correctly classifying these points at any expense, then punish the points misclassified by itself....we can go on and on..but let's say 3 is enough and we combine these learners.
  • OVERALL

  • DETAIL
    • Assign an initial weight of '1' and before fit our first learner, minimize the size of errors, then minimize the SUM of weights of these errors by changing the weights of errors to: correct_sum/incorrect_sum, which will make these two correct, incorrect into the same SUM of the correct, incorrect.
    • keep this going.
    • Here, notice correct_sum/incorrect_sum = accuracy/(1-accuracy) and we put it into ln( ), which is the final weight.
    • Our concern is whether the sums of these final-weights are + / -.

from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
model.fit(x_train, y_train)
model.predict(x_test)

When we define the model, we can specify the hyperparameters. In practice, the most common ones are:

  • base_estimator: The model(Here, DecisonTree..) utilized for the weak learners
  • n_estimators: The maximum number of weak learners used.
from sklearn.tree import DecisionTreeClassifier
model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators = 4)

Decision Tree Regression

  • Predicted output = average of the training examples in the subset
  • It requires a different definition of entropy
  • We can use linear regression at the leaves !!


(C) Naive Bayes

  • For categoric, continuous data
    • For continuous, you need to build the probablistic distribution based on the data and plug the new data in.
  • In contingency table, we have rows(features) and columns(event/non_event). First, we need to know if rows(features) are independent each other. Of course rows and columns are not independent, they are subject to the conditional prob P(E|A), but when independent features are too many, P(E|A,B,C,D,...) we know this is where the Naive_Bayes come into play. We know multiplication works when finding the joint probability of multiple independent features.

PREDICTION: when future emails come, we can combine these features to guess if they are spam or not.

  • Naive Bayes is an extension of the Bayes Theorem where we have more than one feature, with the assumption that each feature is independent of each other event..so Naive.
  • Library: sklearn.naive_bayes (Gaussian)
  • Example: Compute the accuracy of your Naive Bayes classifier. Accuracy is defined as the number of test points that are classified correctly divided by the total number of test points.
def NBAccuracy(features_train, labels_train, features_test, labels_test):
    from sklearn.naive_bayes import GaussianNB
    clf = GaussianNB()    ### create classifier ###
    clf.fit(features_train, labels_train)    ### fit the classifier on the training features and labels ###
    pred = clf.predict(features_test)    ### use the trained classifier to predict labels for the test features ###

    ### calculate and return the accuracy on the test data. ### 
    accuracy = clf.score(features_test, labels_test)
    return(accuracy)
    
    ### or we can use 'sklearn accuracy' ###
    from sklearn.metrics import accuracy_score
    print(accuracy_score(pred, labels_test))

It throws an accuracy of 88.4% which means 88.4% of the points are being correctly labelled by our classifier-'clf' when we use our test-set !

Bayes Rule:

*Semantically, what Bayes rule does is it incorporates some evidence from the test into our prior to arrive at a posterior.

  • Prior: Probability before running a test.
  • test evidence
  • Posterior:

*Algorithm of Naive Bayes

Ex) Text Forensic and Learning (ex. Whose email would it be ?)

Ex) Multiple Evidences(test results)

Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.

What are spammy messages? Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us! Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. This project has been broken down in to the following steps:

  • Step 0: Introduction to the Naive Bayes Theorem
  • Step 1.1: Understanding our dataset
  • Step 1.2: Data Preprocessing
  • Step 2.1: Bag of Words(BoW)
  • Step 2.2: Implementing BoW from scratch
  • Step 2.3: Implementing Bag of Words in scikit-learn
  • Step 3.1: Training and testing sets
  • Step 3.2: Applying Bag of Words processing to our dataset.
  • Step 4.1: Bayes Theorem implementation from scratch
  • Step 4.2: Naive Bayes implementation from scratch
  • Step 5: Naive Bayes implementation using scikit-learn
  • Step 6: Evaluating our model
  • Step 7: Conclusion

Curse of Dimensionality

  • As the number of descriptive features grows, the number of potential conditioning events grows. Consequently, an exponential increase is required in the size of the dataset as each new descriptive feature is added to ensure that for any conditional probability there are enough instances in the training dataset matching the conditions so that the resulting probability is reasonable.


(D) Support Vector Machine

  • For categoric data
  • For numeric data

PREDICTION: face detection, handwriting recognition, time series, stock value prediction

SVM is a set of supervised learning methods used for

  • classification
  • regression
  • outliers detection SVMs "doesn't work well with lots and lots of noise, so when the classes are very overlapping, you have to count independent evidence.

It is not a probabilistic model; i.e., it does not postulate a probability distribution and thus does not assume any randomness. It merely tries to draw a simple line(or plane or hyperplane in higher dimensions) to separate the data points into two parts. That's all. Note that the dataset contains labeled data.

One difficulty was that oftentimes the classifier(the separating 'line' or 'hyperplane') cannot be defined linearly, which means it's not actually a straight line or plane that separates the two sets. It should rather be a wavy curve or surface. So what do we do? We lift the feature space to a higher or possibly an infinite dimensional space so that a linear classifier is possible. This is called the kernel trick. This is what the support vector machine does.

Now applying this to a regression problem, linear regression could be described as an attempt to draw a line (or similarly plane or hyperplane in higher dimensions) that minimizes the error(or the loss function). Therefore, if we choose different loss functions, the regression line(or plane, hyperplane) changes. When the feature space seemingly isn't best served by a simple line or plane but rather calls for something wavy as seen in the classification problem, instead of approximating the wavy object, we again use the kernel trick to lift the feature space into a higher dimension. In this task, the output is a real value.

In SVM, tuning the parameters can be a lot of work, but GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.

Naive Bayes is great for 'text'. It’s faster and generally gives better performance than an SVM. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art of ML.

Pros & Cons

  • The advantages of support vector machines are:

    • Effective in cases where number of dimensions is greater than the number of samples.
    • Uses a subset of training points in the decision function called support vectors, so it is also memory efficient.
    • Versatile: different Kernel functions can be specified for the decision function(Common kernels are provided, but it is also possible to specify custom kernels).
    • Using a kernel trick, Linear DecisionSurf -> NonLinear DecisionSurf
  • The disadvantages of support vector machines include:

    • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
    • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

Margine is a maximum distance to each nearest point. The separating line should be most robust to classification errors. The margine aims to maximizes the robustness of the result....As Much Separation b/w two classifications as possible.

  • The perceptron algorithm is a trick in which we started with a random line, and iterated on a step in order to slowly walk the line towards the misclassified points, so we can classify them correctly. However, we can also see this algorithm as an algorithm which minimizes an error function.

Error (Margin Error + Classification Error)

  • We punish the smaller margin..(just like punishing the model complexity in the L2_regularization of LinearModel).
  • We want to minimize the total error (or error function)

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
import copy

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

features_train, labels_train, features_test, labels_test = makeTerrainData()

In sklearn.svm, SVC(), NuSVC(), LinearSVC() accept slightly different sets of parameters and have different mathematical formulations, but take as input two arrays:

  • an array X of size [n_samples, n_features]holding the training samples
  • an array y of class labels (strings or integers), size [n_samples]
  • Library: sklearn.svm
  • Example:
from sklearn.svm import SVC
# clf = SVC(kernel="linear") #
# clf = SVC(kernel='poly', degree=4, C=0.1) #
# clf = SVC(kernel='rbf', gamma= ) #

X = features_train
y = labels_train
clf.fit(X, y)

pred = clf.predict(features_test)

Accuracy ?

from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)

def submitAccuracy():
    return acc

Non-Linear SVM

Introducing New Features 'Z' or 'transformed X or Y' causes 'hyperplane.' Z is non-negative because it's a distance from the origin.

Introducing poly tool box. Select terms to create the decision surf.

Kernel Trick: There are functions taking a low dimensional given 'input space' and the added 'feature space' then map it to a very high dimensional space - Kernel function (Linear, poly, rbf, sigmoid). It makes the separation then takes the solution and go back to the original space. It sets the dataset apart where the division line is non-linear.

rbf (radial basis func) kernel:

  • hill & valley
  • find a place where a line intersecting the mountain range and project every pt down, then we have a boundary given by the vertical cut. But how we build the mountain range and how to locate red pt in highlands and blue pt in lowlands ?

parameters (degree, C, Gamma)

  • C: The 'gamma' parameter actually has no effect on the 'linear' kernel for SVMs. The key parameter for 'linear kernel function' is "C". The C parameter trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors - wiggling around individual data pt...
  • C is a constant that attaches itself to the classification error. If we have large C, then the error is mostly the classification error so we focus more on correctly classifying all the points than in finding a good margin. When C is small, the error is mostly a margin error.

  • Gamma: This parameter in rbf defines how far the influence of a single data pt reaches, with low values (widey mountain) meaning ‘far’ and high values (pointy mountain) meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. High gamma just like me..only thinking of sth right in my face.
  • When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of high density of any pair of two classes. If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.

SV-Regression

Support Vector Machines are very specific class of algorithms, characterized by

  • usage of kernels,
  • absence of local minima,
  • sparseness of the solution
  • capacity control obtained by acting on the margin, or on number of support vectors, etc.

Intuitively, as all regressors, it tries to fit a line to data by minimising a cost function. However, the interesting part about SVR is that you can deploy a non-linear kernel, making non-linear regression. The model is represented as combinations of the training points rather than a function of the features and some weights.

It can be applied not only to classification problems but also to the case of regression. Still it contains all the main features that characterize maximum margin algorithm:

  • A non-linear function is leaned by linear learning machine mapping into high dimensional kernel induced feature space.
  • The capacity of the system is controlled by parameters that do not depend on the dimensionality of feature space.
  • In the same way as with classification approach there is motivation to seek and optimize the generalization bounds given for regression. They relied on defining the loss function that ignores errors, which are situated within the certain distance of the true value. This type of function is often called – epsilon intensive – loss function. The variables measure the cost of the errors on the training points. These are zero for all points that are inside the band(Margin).

One of the most important ideas in Support Vector Classification and Regression cases is that presenting the solution by means of small subset of training points gives enormous computational advantages. Using the epsilon intensive loss function we ensure existence of the global minimum and at the same time optimization of reliable generalization bound.

Under the condition that:

  • All examples are classified correctly...(for Classification)
  • The value y of all examples deviates less than ϵ from f(x)...(for Regression)
  • Support vectors: This are the data points which are closest to the margins. The distance of the points is minimum or least.

Classification

  • Goal: Find a function f(x)=wx+b where f(x)≥1 for positive examples and f(x)≤−1 for negative examples.
  • HOW: Maximise the margin, which is nothing more than minimising the derivative of f′ which is w. Kill the slope.
  • Intuition: Maximising the margin means that this will give us a unique solution to the problem of finding f(x)=wx+b. In SVM, we actually try to separate the class as far as possible from the decision surf and unlike logistic regression, we create a safety boundary(margin) from both sides of the decision surf (different between logistic regression and SVM classification is in their loss function).

In simple regression we try to minimise the error rate while in SVR we try to fit the error within a certain threshold.

Regression

  • Goal: Find a function f(x)=wx+b under the condition that f(x) is within a required ϵ deviation and |y−f(x)|≤ϵ where ϵ is the half width of margin. We are trying to decide a decision boundary at ϵ distance from the original decision surf such that data points closest to the decision surf or the support vectors that are within that boundary line.
  • HOW: Maximise the margin, which again minimising f′(x)=w, for the reason of regularisation and to obtain a unique solution as the result of the convex optimisation problem. One can see how minimising w results in a more general case because the extreme value of w=0 would mean no functional relation at all which is the most general result one can obtain from the data.
  • Intuition: We want to fit a model to predict a quantity for future. Therefore, we want the data point(observation) to be as close as possible to the hyperplane unlike SVM for classification. The SVM regression inherited from Simple OLS Regression by this difference that we define an ϵ range from both sides of hyperplane to make the regression function insensitive to the error (unlike SVM for classification that we define a boundary to be safe for making the prediction). Eventually, SVM in Regression has a boundary like SVM in classification but the boundary for Regression is for making the regression function insensitive w.r.t the error but the boundary for classification is only to be way far from hyperplane to distinguish between class for future (that is why we call it safety margin).




(E) Perceptron Algorithm

Why do we need NN? Because SO MANY FEATURESSSSS !!!

NN's each "NODE" is a linear model(classifier) with so many featuresssss whose weights(parameters) are updated by the optimization algorithm like a Gradient_Descent that minimizing the cost function("MSE" for regression form, "CrossEntropy" for logistic regression form), and this linear model(classifier) becomes the input of the activation function (step for discrete, sigmoid for continous) that returns a bunch of outcome of either classification/probability information per each feature datapoint.

  • For classification (Y/N)
  • For regression (numbers)
  • This is the two model example. Here we simply go with linear rather than quadratic.

  • The model has 'input data-pt', 'weights', 'bias'

As the simplest Neural Network, Perceptron refers a combination of nodes (Here, Step_func(LinearModel) coz it simply says 1 / 0)

What does the perceptron look like?

A single node(except the input_node) made out of

  • Multiple Linear Model
  • Activation function
  • the single output

Two Algorithms

  • They are about the automatic process for model improvement by adjusting parameters(W, b) of the Multiple Linear Model.
    • Take your data
    • Pick a random decision surface
    • Calculate the error
    • Minimize the Error-Function, and obtain a better surface

  • Perceptron Algorithm:
    • Focus on the missclassified points(it will give the direction)
    • Because of the Step function, y and y_hat have the value of 0 / 1.
    • Thus, the difference between y and y_hat is 1 or -1 or 0. We want it as close as "0"
  • Gradient_Descent Algorithm:
    • Focus on ?
    • Because of the Sigmoid function, y and y_hat have the value of probability.
    • Thus, the difference between y and y_hat is -1 < values < 1. We want it as close as "0"
    • Or...instead of y - y_hat, use the derivative of the cost function like a MSE because....at the end of the day, both have a goal of minimization!

We want to improve our model!

1> Perceptron Trick(with Step function)

  • Now that we've learned that the points that are misclassified, and want the line to move closer to them. How to modify the equation of the line, so that it comes closer to a particular point?
  • Here is the example. Need to repeat this until the point becomes well-classified (For blue point, need to repeat 10 times).

Example

Recall that the perceptron step works as follows. For a point with coordinates(p,q), label y, and prediction given by the equation

import numpy as np
np.random.seed(42)

def stepFunction(t):
    if t >= 0:
        return 1
    return 0

def prediction(X, W, b):
    return stepFunction((np.matmul(X,W)+b)[0])    

The function should receive as inputs the data X, the labels y, the weights W (as an array), and the bias b. Update W, b, according to the perceptron algorithm, and return W and b.

def perceptronStep(X, y, W, b, learn_rate = 0.01):
    for i in range(len(X)):
        y_hat = prediction(X[i],W,b) ## it's | ---
        if y[i]-y_hat == 1:  ## FalseNegative
            W[0] += learn_rate*X[i][0]
            W[1] += learn_rate*X[i][1]
            b += learn_rate
        elif y[i]-y_hat == -1:  ## FalsePositive
            W[0] -= learn_rate*X[i][0]
            W[1] -= learn_rate*X[i][1]
            b -= learn_rate
    return W, b

This function runs the perceptron algorithm repeatedly on the dataset, and returns a few of the boundary lines obtained in the iterations for plotting purposes.

Play with the learning rate and the num_epochs.

  • 'boundary_lines' are the solution lines that get plotted below.
  • In each epoch, we apply the perceptron step.

def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs = 25):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2,1))
    b = np.random.rand(1)[0] + x_max
    boundary_lines = []
    for i in range(num_epochs):
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0]/W[1], -b/W[1]))
    return boundary_lines

We want to improve our model!

2> Gradient Descent Trick(with Sigmoid function)

Move from the discrete to the continuous!

> MaximumLikelihood and Error Function

Let's say we want to calculate probability the four points are of the colors that they actually are. We assume the colors of the points are independent events, then the probability for the whole arrangement is the product of the probabilities of the four points. If the model is given by these probability spaces, then the probability that the points are of this colors offers the clue of which model is better.

Q. So how to maximize the probability?

Q. So how to minimize the Error-Function?

Q. Can we obtain an error-Function from the probability? Maximized probability can yield the minimised Error-Function?

In general, an Error-Function tells us how far we are from the solution(it's a distance).

  • It should be continuous!
  • It should be differentiable! (just like minimizing SSE in linear model.)

Q. What if the number of datapoints are astronomical? Then producting is not a good idea. We again need a log-function that turns products into sums...and remember..when input is ranged from 0 to 1(coz..they are probabilities), the logarithm gives negative. And this is the Entropy function.

> Cross Entropy

As always, we don't like a huge entropy value..

We here, define Error-Function, using Cross Entropy. If I have a bunch of events and probabilities, Cross-Entropy says how likely those events happen based on the probabilities. If it's highly likely, then we have a small Cross-Entropy. If it's unlikely, we have a large Cross-Entropy.

  • A good model gives a low cross-entropy and a bad model gives a high cross-entropy. So our goal has changed:
    • Minimize the Cross Entropy = Maximize the probability

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return(-np.sum(Y*np.log(P) + (1-Y)*np.log(1-P)))

> Multiclass Cross-Entropy

    1. Softmax Function: When the problem has 3 or more classes ? How to turn all scores(WX+b) into probabilities?
    • Note that scores often can be negative, and we need to calculate probability. See 'exp()' can turn them into positive.

Let's say we have 'n' classes and our linear model(WX+b) gives us the score: Z_1...Z_n, each score for each class. Let's turn them into probabilities. Takes as input a list of numbers(scores), and returns the list of values(possibilities) given by the softmax function.

def softmax(L):
    expL = np.exp(L)
    S_expL = sum(expL)
    result=[]
    for i in expL:
        result.append(i/S_expL)
    return(result)
    
def softmax(L):
    expL = np.exp(L)
    return(np.divide(expL, sum(expL)))    
    1. One hot encoding: What if some input data is not numerical?

    1. formula

Again, Cross Entropy is a connection between probabilities and error functions. We define our Error Function, using 'Cross Entropy'.

  • Error = each element of Cross Entropy -ln(p) or -ln(q)
  • What we need is just to minimize the Error-Function (minimize Cross_Entropy / n) !

> Minimization of our Error-Function

How to minimize the Error?

  • knowing the current inputs into the model(the current weight and bias), the derivatives of the loss function tell us which direction to nudge W and b in order to minimize the error.

Example(Binary Classification)

  • Implementing the functions that build the gradient descent algorithm:
    • sigmoid: The sigmoid activation function.
    • output_formula: The formula for the prediction.
    • error_formula: The formula for the error at a point.
    • update_weights: The function that updates the parameters with one gradient descent step.

  • Some helper functions for plotting and drawing lines.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

def plot_points(X, y):
    admitted = X[np.argwhere(y==1)]
    rejected = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'blue', edgecolor = 'k')
    plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'red', edgecolor = 'k')

def display(m, b, color='g--'):
    plt.xlim(-0.05,1.05)
    plt.ylim(-0.05,1.05)
    x = np.arange(-10, 10, 0.1)
    plt.plot(x, m*x+b, color)

# Activation (sigmoid) function
def sigmoid(x):
    return(1 / (1 + np.exp(-x)))

# Output (prediction) formula
def output_formula(features, weights, bias):
    return(sigmoid(np.dot(features, weights) + bias))

# Error (log-loss) formula
def error_formula(y, output):
    return(- y*np.log(output) - (1-y)*np.log(1-output))

# Gradient descent step
def update_weights(x, y, weights, bias, learnrate):
    output = output_formula(x, weights, bias)
    d_error = -(y - output)
    weights -= learnrate * d_error * x
    bias -= learnrate * d_error
    return(weights, bias)
  • Iterate the gradient descent algorithm through all the data, for a number of epochs. Plot the data, and obtain some of the boundary lines as we run the algorithm.
np.random.seed(44)

epochs = 100
learnrate = 0.01

def train(features, targets, epochs, learnrate, graph_lines=False):
    
    errors = []
    n_records, n_features = features.shape
    last_loss = None
    weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
    bias = 0
    for e in range(epochs):
        del_w = np.zeros(weights.shape)
        for x, y in zip(features, targets):
            output = output_formula(x, weights, bias)
            error = error_formula(y, output)
            weights, bias = update_weights(x, y, weights, bias, learnrate)
        
        # Printing out the log-loss error on the training set
        out = output_formula(features, weights, bias)
        loss = np.mean(error_formula(targets, out))
        errors.append(loss)
        if e % (epochs / 10) == 0:
            print("\n========== Epoch", e,"==========")
            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss
            predictions = out > 0.5
            accuracy = np.mean(predictions == targets)
            print("Accuracy: ", accuracy)
        if graph_lines and e % (epochs / 100) == 0:
            display(-weights[0]/weights[1], -bias/weights[1])
            

    # Plotting the solution boundary
    plt.title("Solution boundary")
    display(-weights[0]/weights[1], -bias/weights[1], 'black')

    # Plotting the data
    plot_points(features, targets)
    plt.show()

    # Plotting the error
    plt.title("Error Plot")
    plt.xlabel('Number of epochs')
    plt.ylabel('Error')
    plt.plot(errors)
    plt.show()
  • Train the algorithm!
    • 10 updates with the current training loss and accuracy.
    • A plot of the data and some of the boundary lines obtained. The final one is in black. The lines get closer and closer to the best fit, as we go through more epochs.
    • A plot of the error function, which decreases as we go through more epochs.