update

On branch main Changes to be committed: modified: Projects/project4.qmd
1Ramirez7 · Mar 30, 2024 · fd41256 · fd41256
1 parent 0dadca3
commit fd41256
Showing 1 changed file with 1 addition and 386 deletions.
diff --git a/Projects/project4.qmd b/Projects/project4.qmd
@@ -25,389 +25,4 @@ execute:
 
 ---
 
-
-# Elevator pitch
-_The following report provides a detailed analysis of a machine learning project. It explores relationships between home variables and construction year, using scatter plots and bar charts for insights. The report constructs a classification model aiming for over 90% accuracy, justifying the model's features and parameters through regression analysis. It evaluates the model's quality using metrics like accuracy, precision, recall, F1 score, and confusion matrix, offering valuable insights for stakeholders. Through concise analysis and interpretation, the report delivers actionable recommendations for optimizing the machine learning model._
-
-
-
-__Questions and Tasks__
-
-Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
-Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
-Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.
-Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
-
-# Q1. Home Variables Pre-1980: Charts and Machine Learning Insights
-
-__Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.__
-
-
-## Fine tunning the model with outliers
-
-__Scatter plots and Bar Charts__
-I did multiple statistical analysis on the dwellings_ml data(will refer as ‘data’ for rest of report). Some of the statistical researched included OLS’Ss regressions, Correlation matrices, and excel if formulas. The goal of the analysis was to find the best independent variables that will help determine if a home is built before 1980 [(dependent variable)note: all other variables will be refer to as independent variables]. The first part of the analysis was to visualize the data, and so we did by doing some scatter plots. Almost of variables have significant correlation with the dependent variable (‘before1980’ column), and only a few outliers stood out. The ‘Price vs Year Built’ graph shows how there are data points for both before and after 1980. The scatter plot also shows that homes sold for more than 10 million have 100% probability of being sold after 1979.
-
-```{python}
-# number of Price vs year build SCATTER PLOT graph --------------------- -------------------------------
-import pandas as pd
-import plotly.express as px
-file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
-df = pd.read_csv(file_path)
-
-# Group by year and calculate the average price
-df_avg = df.groupby('yrbuilt')['sprice'].mean().reset_index()
-
-# line plot
-fig = px.line(df_avg, x='yrbuilt', y='sprice', 
-              title='Average Sale Price vs Year Built',
-              labels={'yrbuilt': 'Year Built', 'sprice': 'Average Sale Price (in millions)'}, 
-              color_discrete_sequence=['skyblue'])
-
-fig.update_layout(xaxis_tickangle=45)
-
-# Remove hover information
-fig.update_traces(hoverinfo='skip')
-
-fig.show()
-
-
-
-```
-
-
-```{python}
-# price vs year built ---- Bar plot percetage of total between x prices----------------------
-import pandas as pd
-import matplotlib.pyplot as plt
-
-# load data
-file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"
-df = pd.read_csv(file_path)
-
-# convert to sprice to percent
-df['sprice'] = df['sprice'] / 1_000_000  
-
-# Adjust bins
-bin_edges = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10, 20, 22]   
-
-# Calculate the percentage
-percentage_in_bins = []
-for i in range(len(bin_edges) - 1):
-    lower_bound = bin_edges[i]
-    upper_bound = bin_edges[i + 1]
-    percentage = ((df['sprice'] >= lower_bound) & (df['sprice'] < upper_bound)).mean() * 100
-    percentage_in_bins.append(percentage)
-
-# Plotting
-plt.figure(figsize=(10, 6))
-bars = plt.bar(range(len(percentage_in_bins)), percentage_in_bins, align='center', color='skyblue')
-
-# Title and labels
-plt.title('Percentage of Sale Price')
-plt.xlabel('Price in millions')
-plt.ylabel('Percentage of Sale Price')
-
-# Tick labels
-plt.xticks(range(len(percentage_in_bins)), [f"{bin_edges[i]}-{bin_edges[i+1]}" for i in range(len(bin_edges)-1)], rotation=45)
-
-# Labels
-for bar, percentage in zip(bars, percentage_in_bins):
-    height = bar.get_height()
-    plt.text(bar.get_x() + bar.get_width()/2, height, f'{percentage:.2f}%', ha='center', va='bottom')
-
-# Show plot
-plt.show()
-
-
-
-
-```
-
-In regard to fine tuning the model, only 1.44% of the total sample size will fall under that criterion, but that is assuming this data represents the population as a whole. The purpose of this model is classified houses as being built ‘before 1980’ or ‘during or after 1980’ and the instructions do not entail this data represents the population as a whole since the given data is not the raw but prepared for the model. I found many independent variables to have similar outliers to where they can 100% explain the dependent variable assuming the data represent the population. I did not continue in this path to find all the independent variables that 100% explain the dependent variable because of my limitations with models, and I decided to assume this data does not represent the population as a whole, but nonetheless I think such data would be great to fine tune a model. It was also hard to determine how they will affect the model, as the samples that 100% explain the dependent variable where a small percentage of the total sample size so it was hard to give them any weight in the determination of the model. 
-
-
-
-
-
-```{python}
-
-# number of Number of Baths vs year build SCATTER PLOT graph --------------------- -------------------------------
-import pandas as pd
-import matplotlib.pyplot as plt
-
-# Load
-df = pd.read_csv("C://Users//eduar//Downloads//dwellings_ml.csv")
-
-# scatter plot
-plt.figure(figsize=(10, 6))
-plt.scatter(df['yrbuilt'], df['numbaths'], alpha=0.5, color='skyblue')
-
-# vertical line at year 1980
-plt.axvline(x=1980, color='red', linestyle='--', linewidth=1)
-
-# title and labels
-plt.title('Number of Baths vs Year Built')
-plt.xlabel('Year Built')
-plt.ylabel('Number of Baths')
-plt.xticks(rotation=45)
-
-# plot
-plt.show()
-
-
-
-
-```
-
-## Correlation and Regressions & Best Fit for Model
-
-__Correlation Matrix__ 
-
-
-I did a few correlation Matrices in excel to determined any noticible correlation between the variables in the sample data. The correlation matrix did a great job in determing close correlation between varibales, for exmaple price and live area had a correlation of 0.67. It will be reasonable that homes within an area tend to fall within a similar price range, and different areas could vary by price. This results did not really help me in determining the variables that will best fit the model since all the variables with close correlation were just about the same for before1980 and after 1979 so I ended the correltion matrix analysis. 
-
-![Correlation Matrix for all Variables](C:/Users/eduar/OneDrive - BYU-Idaho/2024 Winter/250 DS/project 4/corr.png)
-
-
-
-# Q2. Classification Model 
-__Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.__
-
-## Model decision
-
-My final choice for this model was done by doing several OLS regressions, and finding the independent variables with the best fit for the the dependent variable. This included also finding independent variables with good coeffiecient between them to aid in choosing the variables that can also have a statistical significance with each other apart from the dependent variable. By determing the coefficient between independent variables and the dependent vartiable I was able to arrive to this model. 
-
-```{python}
-
-# model  ------------------------------------------------------- 
-
-
-import pandas as pd
-from sklearn.model_selection import train_test_split
-from sklearn.tree import DecisionTreeClassifier
-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
-
-# Load
-file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"  
-data = pd.read_csv(file_path)
-
-# target variable
-features = data.drop(['yrbuilt', 'before1980', 'parcel'], axis=1)  # Exclude unique columns
-target = data['before1980']
-
-# training vs testing
-X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.34, random_state=76)
-
-# Creating and training the Decision Tree model
-decision_tree_model = DecisionTreeClassifier(random_state=42)
-decision_tree_model.fit(X_train, y_train)
-
-# predictions
-predictions = decision_tree_model.predict(X_test)
-
-# Calculating evaluation metrics
-accuracy = accuracy_score(y_test, predictions)
-precision = precision_score(y_test, predictions)
-recall = recall_score(y_test, predictions)
-f1 = f1_score(y_test, predictions)
-conf_matrix = confusion_matrix(y_test, predictions)
-
-# evaluation metrics
-print(f"Accuracy: {accuracy}")
-print(f"Precision: {precision}")
-print(f"Recall: {recall}")
-print(f"F1 Score: {f1}")
-print(f"Confusion Matrix:\n{conf_matrix}")
-
-```
-
-
-
-```{python}
-# --------------------- question 1  -----------
-import pandas as pd
-from sklearn.model_selection import train_test_split
-from sklearn.tree import DecisionTreeClassifier
-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
-
-# Load data
-file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"  
-data = pd.read_csv(file_path)
-
-# Define features and target variable
-features = data.drop(['yrbuilt', 'before1980', 'parcel'], axis=1)  # Exclude unique columns
-target = data['before1980']
-
-# Split data into training and testing sets
-X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.34, random_state=76)
-
-# Creating and training the Decision Tree model
-decision_tree_model = DecisionTreeClassifier(random_state=42)
-decision_tree_model.fit(X_train, y_train)
-
-# Predictions
-predictions = decision_tree_model.predict(X_test)
-
-# Calculating evaluation metrics
-accuracy = accuracy_score(y_test, predictions)
-precision = precision_score(y_test, predictions)
-recall = recall_score(y_test, predictions)
-f1 = f1_score(y_test, predictions)
-conf_matrix = confusion_matrix(y_test, predictions)
-
-# Print evaluation metrics
-print(f"Accuracy: {accuracy}")
-print(f"Precision: {precision}")
-print(f"Recall: {recall}")
-print(f"F1 Score: {f1}")
-print(f"Confusion Matrix:\n{conf_matrix}")
-
-# Calculate the average of the first 10 values in testing y values
-average_first_10 = y_test[:10].mean()
-print(f"Average of the first 10 values in testing y values: {average_first_10}")
-
-
-```
-
-```{python}
-# question 2
-import pandas as pd
-from sklearn.model_selection import train_test_split
-from sklearn.tree import DecisionTreeClassifier
-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
-
-# Load data
-file_path = "C://Users//eduar//Downloads//dwellings_ml.csv"  
-data = pd.read_csv(file_path)
-
-# Define features and target variable
-# Ensure 'sprice' is included in the features, and remove irrelevant columns
-features = data.drop(['yrbuilt', 'before1980', 'parcel'], axis=1)
-target = data['before1980']
-
-# Split data into training and testing sets
-X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.34, random_state=76)
-
-# Creating and training the Decision Tree model
-decision_tree_model = DecisionTreeClassifier(random_state=42)
-decision_tree_model.fit(X_train, y_train)
-
-# Predictions
-predictions = decision_tree_model.predict(X_test)
-
-# Calculating evaluation metrics
-accuracy = accuracy_score(y_test, predictions)
-precision = precision_score(y_test, predictions)
-recall = recall_score(y_test, predictions)
-f1 = f1_score(y_test, predictions)
-conf_matrix = confusion_matrix(y_test, predictions)
-
-# Print evaluation metrics
-print(f"Accuracy: {accuracy}")
-print(f"Precision: {precision}")
-print(f"Recall: {recall}")
-print(f"F1 Score: {f1}")
-print(f"Confusion Matrix:\n{conf_matrix}")
-
-# Calculate the average of the first 10 values in training X values for sprice
-average_first_10_sprice = X_train['sprice'][:10].mean()
-print(f"Average of the first 10 values in training X values for sprice: {average_first_10_sprice}")
-
-
-
-
-```
-
-
-
-
-## Visual of Decision Tree
-
-```{python}
-
-# See the decision tree in a graph ---must run code for model first---------------
-from sklearn.tree import export_graphviz
-import graphviz
-
-# Limit the depth of the tree for visualization purposes
-decision_tree_model_vis = DecisionTreeClassifier(random_state=42, max_depth=3)
-decision_tree_model_vis.fit(X_train, y_train)
-
-#  show
-dot_data = export_graphviz(decision_tree_model_vis, out_file=None, 
-                           feature_names=X_train.columns,  
-                           class_names=['After 1980', 'Before 1980'],
-                           filled=True, rounded=True, 
-                           special_characters=True)
-
-# Visualize the graph
-graph = graphviz.Source(dot_data) 
-graph
-
-
-
-```
-
-
-# Q3. Model Validation: Essential Feature Discussion
-
-
-__Justify your classification model by discussing the most important features selected by your model. This discussion should include a chart and a description of the features.__
-
-## Explanation of the model variables
-
-__Regression model that explains 66% of the dependent variable__
-
-The main justification for the classification model was the result of the regression analysis which resulted in the following regression. 
-
-
-![Regression model](C:/Users/eduar/OneDrive - BYU-Idaho/2024 Winter/250 DS/project 4/decision_tree_full.png)
-
-
-From the regression output, I chose to include quality_C, status_V, gartype_Att, arcstyle_ONE-STORY, stories, and numbaths in my decision tree model because they are statistically significant  of the dependent variable as evidenced by their P-values. The P-values are very close to zero, which strongly rejects the null hypothesis of no effect, indicating that these variables have a significant relationship with the dependent variable.
-
-I decided to use arcstyle_ONE-STORY as the first variable in my tree because it has a high coefficient and a significant t-statistic, implying a strong and consistent impact on the dependent variable. The decision tree algorithm often chooses the feature that provides the most significant split at each node, and based on the regression results, arcstyle_ONE-STORY fits this criterion well. Its positive coefficient (0.3355) indicates that one-story  style has a substantial and positive effect on the predicted value.
-
-quality_C is also included as an important variable; it has a positive coefficient (0.2053), suggesting that the quality of a dwelling, represented by this dummy variable, is associated with an increase in the predicted value.
-
-Status_V and gartype_Att both have negative coefficients, indicating that these features contribute to a decrease in the predicted value when they are present. Specifically, status_V (with a coefficient of -0.2979) shows a substantial decrease, which is why it's crucial to include it in the decision tree to capture this negative relationship.
-
-The number of stories (stories) and the number of bathrooms (numbaths) also have negative coefficients, and even though their absolute impact might be smaller compared to the other variables, they are still significant. Therefore, they are part of the decision tree to capture the full complexity of the relationships in the data.
-
-The decision tree visualization, which I included in my analysis, places these variables accordingly, with arcstyle_ONE-STORY at the top due to its strong predictive power followed by the other variables based on their contribution to reducing the model's impurity, which aligns with the coefficients' magnitude and significance in the regression output.
-
-
-
-
-# Q4. Evaluating Model Quality: Metrics & Interpretation
-
-__Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.__ 
-
-
-## Metrics in current Classification Model
-
-The model has an accuracy of 90.25%. It correctly identifies about 92.34% of all actual positive cases, and captures around 92.08% of all actual positive cases (recall). The F1 score, which combines precision and recall, is 92.21%. In the confusion matrix, 2240 instances were correctly predicted as negative, 3964 instances were correctly predicted as positive, 329 instances were falsely predicted as negative, and 341 instances were falsely predicted as positive. 
-
-
-
-
-
-## Regression results explained as to why I choose Arc for first independent variable in tree.
-
-__this regression is with only three independent variables__
-
-
-
-![Decision Tree](C:/Users/eduar/OneDrive - BYU-Idaho/2024 Winter/250 DS/project 4/Screenshot 2024-03-06 200926.png)
-
-Here, the R-squared value is approximately 0.338, which means that about 33.8% of the variability in the dependent variable can be explained by the model. 
-
-Coefficients: The coefficients represent the relationship between each independent variable and the dependent variable. For instance, the positive coefficient for arcstyle_ONE-STORY suggests that being a one-story building is associated with an increase in the dependent variable, while negative coefficients for stories and numbaths indicate a decrease.
-
-Confidence Intervals: The 95% confidence intervals provide a range of plausible values for the coefficients. If the interval does not include zero, it indicates that there is a significant effect at the 95% confidence level. This can help in feature selection when building machine learning models, as it indicates which features are likely to be important predictors.
-
-Model Validation: The standard error and significance testing of the regression can be conceptually linked to model validation techniques in classification model. The standard error can indicate how much the estimated coefficients 'jump around' when you estimate the model on different subsets of the data.
-
-
-
+update. current file is renderring errors