Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
On branch main
 Changes to be committed:
	modified:   Projects/project4.qmd
  • Loading branch information
1Ramirez7 committed Mar 30, 2024
1 parent 0dadca3 commit fd41256
Showing 1 changed file with 1 addition and 386 deletions.
387 changes: 1 addition & 386 deletions Projects/project4.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,389 +25,4 @@ execute:

---


# Elevator pitch
_The following report provides a detailed analysis of a machine learning project. It explores relationships between home variables and construction year, using scatter plots and bar charts for insights. The report constructs a classification model aiming for over 90% accuracy, justifying the model's features and parameters through regression analysis. It evaluates the model's quality using metrics like accuracy, precision, recall, F1 score, and confusion matrix, offering valuable insights for stakeholders. Through concise analysis and interpretation, the report delivers actionable recommendations for optimizing the machine learning model._



__Questions and Tasks__

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

# Q1. Home Variables Pre-1980: Charts and Machine Learning Insights

__Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.__


## Fine tunning the model with outliers

__Scatter plots and Bar Charts__
I did multiple statistical analysis on the dwellings_ml data(will refer as ‘data’ for rest of report). Some of the statistical researched included OLS’Ss regressions, Correlation matrices, and excel if formulas. The goal of the analysis was to find the best independent variables that will help determine if a home is built before 1980 [(dependent variable)note: all other variables will be refer to as independent variables]. The first part of the analysis was to visualize the data, and so we did by doing some scatter plots. Almost of variables have significant correlation with the dependent variable (‘before1980’ column), and only a few outliers stood out. The ‘Price vs Year Built’ graph shows how there are data points for both before and after 1980. The scatter plot also shows that homes sold for more than 10 million have 100% probability of being sold after 1979.

```{python}
# number of Price vs year build SCATTER PLOT graph --------------------- -------------------------------
import pandas as pd
import plotly.express as px
file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
df = pd.read_csv(file_path)
# Group by year and calculate the average price
df_avg = df.groupby('yrbuilt')['sprice'].mean().reset_index()
# line plot
fig = px.line(df_avg, x='yrbuilt', y='sprice',
title='Average Sale Price vs Year Built',
labels={'yrbuilt': 'Year Built', 'sprice': 'Average Sale Price (in millions)'},
color_discrete_sequence=['skyblue'])
fig.update_layout(xaxis_tickangle=45)
# Remove hover information
fig.update_traces(hoverinfo='skip')
fig.show()
```


```{python}
# price vs year built ---- Bar plot percetage of total between x prices----------------------
import pandas as pd
import matplotlib.pyplot as plt
# load data
file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
df = pd.read_csv(file_path)
# convert to sprice to percent
df['sprice'] = df['sprice'] / 1_000_000
# Adjust bins
bin_edges = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10, 20, 22]
# Calculate the percentage
percentage_in_bins = []
for i in range(len(bin_edges) - 1):
lower_bound = bin_edges[i]
upper_bound = bin_edges[i + 1]
percentage = ((df['sprice'] >= lower_bound) & (df['sprice'] < upper_bound)).mean() * 100
percentage_in_bins.append(percentage)
# Plotting
plt.figure(figsize=(10, 6))
bars = plt.bar(range(len(percentage_in_bins)), percentage_in_bins, align='center', color='skyblue')
# Title and labels
plt.title('Percentage of Sale Price')
plt.xlabel('Price in millions')
plt.ylabel('Percentage of Sale Price')
# Tick labels
plt.xticks(range(len(percentage_in_bins)), [f"{bin_edges[i]}-{bin_edges[i+1]}" for i in range(len(bin_edges)-1)], rotation=45)
# Labels
for bar, percentage in zip(bars, percentage_in_bins):
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, height, f'{percentage:.2f}%', ha='center', va='bottom')
# Show plot
plt.show()
```

In regard to fine tuning the model, only 1.44% of the total sample size will fall under that criterion, but that is assuming this data represents the population as a whole. The purpose of this model is classified houses as being built ‘before 1980’ or ‘during or after 1980’ and the instructions do not entail this data represents the population as a whole since the given data is not the raw but prepared for the model. I found many independent variables to have similar outliers to where they can 100% explain the dependent variable assuming the data represent the population. I did not continue in this path to find all the independent variables that 100% explain the dependent variable because of my limitations with models, and I decided to assume this data does not represent the population as a whole, but nonetheless I think such data would be great to fine tune a model. It was also hard to determine how they will affect the model, as the samples that 100% explain the dependent variable where a small percentage of the total sample size so it was hard to give them any weight in the determination of the model.





```{python}
# number of Number of Baths vs year build SCATTER PLOT graph --------------------- -------------------------------
import pandas as pd
import matplotlib.pyplot as plt
# Load
df = pd.read_csv("C://Users//eduar//Downloads//dwellings_ml.csv")
# scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['yrbuilt'], df['numbaths'], alpha=0.5, color='skyblue')
# vertical line at year 1980
plt.axvline(x=1980, color='red', linestyle='--', linewidth=1)
# title and labels
plt.title('Number of Baths vs Year Built')
plt.xlabel('Year Built')
plt.ylabel('Number of Baths')
plt.xticks(rotation=45)
# plot
plt.show()
```

## Correlation and Regressions & Best Fit for Model

__Correlation Matrix__


I did a few correlation Matrices in excel to determined any noticible correlation between the variables in the sample data. The correlation matrix did a great job in determing close correlation between varibales, for exmaple price and live area had a correlation of 0.67. It will be reasonable that homes within an area tend to fall within a similar price range, and different areas could vary by price. This results did not really help me in determining the variables that will best fit the model since all the variables with close correlation were just about the same for before1980 and after 1979 so I ended the correltion matrix analysis.

![Correlation Matrix for all Variables](C:/Users/eduar/OneDrive - BYU-Idaho/2024 Winter/250 DS/project 4/corr.png)



# Q2. Classification Model
__Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.__

## Model decision

My final choice for this model was done by doing several OLS regressions, and finding the independent variables with the best fit for the the dependent variable. This included also finding independent variables with good coeffiecient between them to aid in choosing the variables that can also have a statistical significance with each other apart from the dependent variable. By determing the coefficient between independent variables and the dependent vartiable I was able to arrive to this model.

```{python}
# model -------------------------------------------------------
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Load
file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
data = pd.read_csv(file_path)
# target variable
features = data.drop(['yrbuilt', 'before1980', 'parcel'], axis=1) # Exclude unique columns
target = data['before1980']
# training vs testing
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.34, random_state=76)
# Creating and training the Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train, y_train)
# predictions
predictions = decision_tree_model.predict(X_test)
# Calculating evaluation metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
# evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
```



```{python}
# --------------------- question 1 -----------
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Load data
file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
data = pd.read_csv(file_path)
# Define features and target variable
features = data.drop(['yrbuilt', 'before1980', 'parcel'], axis=1) # Exclude unique columns
target = data['before1980']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.34, random_state=76)
# Creating and training the Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train, y_train)
# Predictions
predictions = decision_tree_model.predict(X_test)
# Calculating evaluation metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
# Calculate the average of the first 10 values in testing y values
average_first_10 = y_test[:10].mean()
print(f"Average of the first 10 values in testing y values: {average_first_10}")
```

```{python}
# question 2
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Load data
file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
data = pd.read_csv(file_path)
# Define features and target variable
# Ensure 'sprice' is included in the features, and remove irrelevant columns
features = data.drop(['yrbuilt', 'before1980', 'parcel'], axis=1)
target = data['before1980']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.34, random_state=76)
# Creating and training the Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train, y_train)
# Predictions
predictions = decision_tree_model.predict(X_test)
# Calculating evaluation metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
# Calculate the average of the first 10 values in training X values for sprice
average_first_10_sprice = X_train['sprice'][:10].mean()
print(f"Average of the first 10 values in training X values for sprice: {average_first_10_sprice}")
```




## Visual of Decision Tree

```{python}
# See the decision tree in a graph ---must run code for model first---------------
from sklearn.tree import export_graphviz
import graphviz
# Limit the depth of the tree for visualization purposes
decision_tree_model_vis = DecisionTreeClassifier(random_state=42, max_depth=3)
decision_tree_model_vis.fit(X_train, y_train)
# show
dot_data = export_graphviz(decision_tree_model_vis, out_file=None,
feature_names=X_train.columns,
class_names=['After 1980', 'Before 1980'],
filled=True, rounded=True,
special_characters=True)
# Visualize the graph
graph = graphviz.Source(dot_data)
graph
```


# Q3. Model Validation: Essential Feature Discussion


__Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.__

## Explanation of the model variables

__Regression model that explains 66% of the dependent variable__

The main justification for the classification model was the result of the regression analysis which resulted in the following regression.


![Regression model](C:/Users/eduar/OneDrive - BYU-Idaho/2024 Winter/250 DS/project 4/decision_tree_full.png)


From the regression output, I chose to include quality_C, status_V, gartype_Att, arcstyle_ONE-STORY, stories, and numbaths in my decision tree model because they are statistically significant of the dependent variable as evidenced by their P-values. The P-values are very close to zero, which strongly rejects the null hypothesis of no effect, indicating that these variables have a significant relationship with the dependent variable.

I decided to use arcstyle_ONE-STORY as the first variable in my tree because it has a high coefficient and a significant t-statistic, implying a strong and consistent impact on the dependent variable. The decision tree algorithm often chooses the feature that provides the most significant split at each node, and based on the regression results, arcstyle_ONE-STORY fits this criterion well. Its positive coefficient (0.3355) indicates that one-story style has a substantial and positive effect on the predicted value.

quality_C is also included as an important variable; it has a positive coefficient (0.2053), suggesting that the quality of a dwelling, represented by this dummy variable, is associated with an increase in the predicted value.

Status_V and gartype_Att both have negative coefficients, indicating that these features contribute to a decrease in the predicted value when they are present. Specifically, status_V (with a coefficient of -0.2979) shows a substantial decrease, which is why it's crucial to include it in the decision tree to capture this negative relationship.

The number of stories (stories) and the number of bathrooms (numbaths) also have negative coefficients, and even though their absolute impact might be smaller compared to the other variables, they are still significant. Therefore, they are part of the decision tree to capture the full complexity of the relationships in the data.

The decision tree visualization, which I included in my analysis, places these variables accordingly, with arcstyle_ONE-STORY at the top due to its strong predictive power followed by the other variables based on their contribution to reducing the model's impurity, which aligns with the coefficients' magnitude and significance in the regression output.




# Q4. Evaluating Model Quality: Metrics & Interpretation

__Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.__


## Metrics in current Classification Model

The model has an accuracy of 90.25%. It correctly identifies about 92.34% of all actual positive cases, and captures around 92.08% of all actual positive cases (recall). The F1 score, which combines precision and recall, is 92.21%. In the confusion matrix, 2240 instances were correctly predicted as negative, 3964 instances were correctly predicted as positive, 329 instances were falsely predicted as negative, and 341 instances were falsely predicted as positive.





## Regression results explained as to why I choose Arc for first independent variable in tree.

__this regression is with only three independent variables__



![Decision Tree](C:/Users/eduar/OneDrive - BYU-Idaho/2024 Winter/250 DS/project 4/Screenshot 2024-03-06 200926.png)

Here, the R-squared value is approximately 0.338, which means that about 33.8% of the variability in the dependent variable can be explained by the model.

Coefficients: The coefficients represent the relationship between each independent variable and the dependent variable. For instance, the positive coefficient for arcstyle_ONE-STORY suggests that being a one-story building is associated with an increase in the dependent variable, while negative coefficients for stories and numbaths indicate a decrease.

Confidence Intervals: The 95% confidence intervals provide a range of plausible values for the coefficients. If the interval does not include zero, it indicates that there is a significant effect at the 95% confidence level. This can help in feature selection when building machine learning models, as it indicates which features are likely to be important predictors.

Model Validation: The standard error and significance testing of the regression can be conceptually linked to model validation techniques in classification model. The standard error can indicate how much the estimated coefficients 'jump around' when you estimate the model on different subsets of the data.



update. current file is renderring errors

0 comments on commit fd41256

Please sign in to comment.