Skip to content

drnitinmalik/multiple-linear-regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pre-regression steps, Regression model, Post-regression analysis

In multiple regression (regression is also known as function approximation), we are interested in predicting one (dependent) variable from two or more (independent) variables e.g. predicting height from weight and age. Regression implies causation. Change in the dependent variable is due to the change in the independent variables. Linear regression implies that the relationship between the dependent variable and independent variables is linear and thus can be described by a linear plane known as the regression plane. We are in the process of finding a regression plane that fits (touches) the maximum number of data points (no. of data points = no. of records in the dataset).

MLR MODEL

Objective: Predict the dependent variable Yi

Model structure: Yi=β0 + β1XI + β2X2 + ... βKXK + εi where β0 (y-intercept/constant) and βK (slope) are population regression coefficients and εi is the prediction error (prediction does not match with the actual) (error also termed as residue)

Model parameters: sample regression coefficients b0, b1... bk [determined by solving normal equations]

Model hyperparameters: Summed squared error, Mean square error [error functions]

Methods to estimate model parameters: Ordinary Least square (Minimise error function), Maximum likelihood (likelihood is the probability that a dependent variable can be predicted from the independent variables)

Model assumptions: Relationship between dependent and independent variables is linear, Dependent variable is continuous (interval or ratio), No correlation (a measure of the relationship between two variables derived from covariance) between errors & between error and independent variables, all errors have equal variances, errors follow a normal probability distribution, independent variables are not collinear (no multicollinearity)

USE CASE

Let's say we want to predict the profits (dependent variable) of 50 US Startup companies from the data which contains three types of expenditures (R&D spend, Administrative spend and Marketing spend, all of which is numerical data) and their location (categorical data). The python code and the data file are available on GitHub

RESULTS

Duplicate values: None Missing value: None Multicollinearity: Yes. Regression coefficients: array([ 5.69631699e-16, 4.99600361e-16, -3.05311332e-16, 1.17961196e-15, 4.05057932e-16, -8.32667268e-17, 1.00000000e+00, 2.20489350e-13, 2.81742200e-12]) y-intercept: -2.6193447411060333e-10 prediction on test data array([ 96778.92, 35673.41, 191792.06, 192261.83, 132602.65, 108552.04, 69758.98, 110352.25, 146121.95, 125370.37, 182901.99, 77798.83, 96712.8 ]) Levene test for homoscedasticity test statistic=5.351611538586933e-29, p-value=1.0.

DISCUSSION

The correlation between marketing spend and R&D is on the higher side which confirms multicollinearity. After encoding the categorical variable using one-hot encoding, we have 10 columns

About

Predicting net yearly revenue of Top 50 US startups on the basis of their financial data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages