Dataset - https://www.kaggle.com/code/ashydv/sales-prediction-simple-linear-regression/input
This project involves analyzing Kaggle's Advertising Dataset to build a simple linear regression model that predicts sales based on advertising spend. The dataset includes columns for TV, Radio, Newspaper advertising budgets, and Sales.
- Name: Himel Sarder
- Email: info.himelcse@gmail.com
- GitHub: Himel-Sarder
The goal of this project is to:
- Explore the relationship between advertising spend on TV and sales.
- Build a linear regression model to predict sales based on TV advertising.
- Evaluate the model's performance using statistical metrics.
The dataset contains the following columns:
- TV: Advertising budget for TV (in thousands of dollars).
- Radio: Advertising budget for Radio (in thousands of dollars).
- Newspaper: Advertising budget for Newspaper (in thousands of dollars).
- Sales: Sales generated (in thousands of units).
TV | Radio | Newspaper | Sales |
---|---|---|---|
230.1 | 37.8 | 69.2 | 22.1 |
44.5 | 39.3 | 45.1 | 10.4 |
17.2 | 45.9 | 69.3 | 12.0 |
151.5 | 41.3 | 58.5 | 16.5 |
180.8 | 10.8 | 58.4 | 17.9 |
The following Python libraries are used:
warnings
to suppress warnings.matplotlib.pyplot
andseaborn
for data visualization.pandas
andnumpy
for data manipulation.math
for mathematical operations.sklearn
for machine learning tasks.statsmodels
for statistical modeling.
The dataset is loaded using:
import pandas as pd
df = pd.read_csv('advertising.csv')
- Scatter plot to visualize the relationship between TV advertising and Sales:
plt.figure(figsize=(12, 8))
plt.scatter(df['TV'], df['Sales'])
plt.xlabel('TV')
plt.ylabel('Sales')
plt.show()
- Features (
X
) and target (y
) are defined as:
X = df.iloc[:, 0:1] # TV column
y = df.iloc[:, -1] # Sales column
- Split the dataset into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
- A linear regression model is trained using:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
- The model's performance is evaluated using:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared
from sklearn.metrics import mean_squared_error
predictions = lr.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, predictions))
print(f"RMSE: {rmse}")
- The regression equation is printed:
print(f"The linear model is: Y = {lr.intercept_:.5f} + {lr.coef_[0]:.5f}X")
- The model is refitted using
statsmodels
for detailed statistical insights:
import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)
lr_sm = sm.OLS(y_train, X_train_sm).fit()
print(lr_sm.summary())
- Residuals are analyzed to ensure the model's assumptions are met:
res = y_train - lr_sm.predict(X_train_sm)
sns.distplot(res, bins=15)
plt.title('Error Terms')
plt.xlabel('y_train - predictions')
plt.show()
- The regression equation is:
Y = 6.94868 + 0.05455X
- Key performance metrics:
- RMSE: 2.019
- R-squared: 0.816
- The model explains approximately 81.6% of the variance in sales based on TV advertising.
- TV advertising has a significant positive relationship with sales.
- The linear regression model is effective in predicting sales based on TV advertising spend.
- Kaggle's Advertising Dataset
- Python libraries:
sklearn
,statsmodels
,matplotlib
,seaborn
,pandas
,numpy
For more details, visit GitHub.