diff --git a/README.md b/README.md index 2022ee8..0b81cd5 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,8 @@ The first thing we need to do before starting our work, is importing the librari `weather_data = pd.read_csv('data/weatherHistory.csv')`. ## Exploring Data -Now that we have our data and our libraries ready, let's start by taking a look at the data than we have. +Now that we have our data and our libraries ready, let's start by taking a look at the data that we have. + 3. Open the weather_data variable in the Variable Explorer. 4. Verify that the size of the data displayed in the Variable Explorer, corresponds to the result of the following command `len(weather_data)` @@ -18,7 +19,7 @@ Now that we have our data and our libraries ready, let's start by taking a look 6. Now try printing the last 3 rows of the DataFrame. ## Visualisation -A useful tool for exploring data that we are going to work on is plotting it. This is easy to do, using our pandas library which we imported previously. +A useful tool for exploring the data that we are going to work with, is plotting it. This is easy to do, using our pandas library which we imported previously. The first thing we want to do before plotting our data is ordering the rows according to the date. Use the Variable Explorer to verify that our data is not ordered by default. 7. Use the following commands to create a new variable with our data ordered. @@ -57,7 +58,7 @@ Now, we want to evaluate the relationships between the variables in our data set 18. Open the plots pane to visualize the correlations plot. 19. Import the function `plot_color_gradients` which is also in the utils.py file which will help you plot the colormap gradient to be able to interpret your correlations plot. 20. Plot the colormap gradient using the following commands. -`cmap_category, cmap_list = ('Plot gradiends convention', ['viridis', ])` +`cmap_category, cmap_list = ('Plot gradients convention', ['viridis', ])` `plot_color_gradients(cmap_category, cmap_list)` 21. Calculate the correlations between the different variables in our data set usgin the following command `weather_correlations = weather_data_ordered.corr()`. 22. Open the variable `weather_correlations`in the Variable Explorer. @@ -65,7 +66,7 @@ Now, we want to evaluate the relationships between the variables in our data set weather_data_ordered['Humidity'])`in the console to get the correlation between the Humidity and Temperature. Verify it has the same value in the correlations DataFrame. 24. Try calculating correlations between different variables and comparing them with the ones in the data frame. -## Data Dodeling and Prediction +## Data Modeling and Prediction Finally, we want to use our data to construct a model that allows us predicting values for some of our variables. In our previous section we realized that humidity and temperature are two of the most correlated variables so we are going to use these two first. We are going to use scikit-learn which is a python library that contains tools to explore data and build different types of predictive models. We will use two functions for this task which need to be imported. @@ -100,8 +101,12 @@ Note that this means our model is a linear function `$$y = beta_0 + beta_1 \time 31. Using the coefficients found in our model, predict the temperature for a given level of humidity using the `predicted_temperature` function available in 'utils'. +Finally, we can numerically evaluate how good was our model predicted. For this we will use the `explained_variance_score`metric available in sklearn.metrics. This metric is calculated as 1-(Var(Y_real-Y_model)/Var(Y_real)) which means that the closer the value is to 1, the better our model. +32. Use the following command `from sklearn.metrics import explained_variance_score`to import the function that evaluates how good is our model. +33. Calculate the explained variance score and print it using the following `ev = explained_variance_score(Y_test, Y_predict)` +`print(ev)`. diff --git a/workshop.py b/workshop.py index 68e4ac5..4e709c3 100644 --- a/workshop.py +++ b/workshop.py @@ -8,70 +8,57 @@ # In[1] Importing Libraries and Data # Third-party imports -import matplotlib.pyplot as plt -import pandas as pd -from sklearn.model_selection import train_test_split -from sklearn import linear_model # Local imports -from utils import ( - plot_correlations, plot_color_gradients, aggregate_by_year, - predicted_temperature) + # In[2] Exploring Data -weather_data = pd.read_csv('data/weatherHistory.csv') -print(len(weather_data)) -print(weather_data.head(3)) +# Reading data + +# Print size of data + +# Print first 3 rows of DataFrame # TO DO: Print the last 3 rows of the DataFrame -print(weather_data.tail(3)) + # In[3] Visualisation -weather_data['Formatted Date'] = pd.to_datetime( - weather_data['Formatted Date']) -weather_data_ordered = weather_data.sort_values(by='Formatted Date') +# Order rows according to date + +# Order Index according to date -weather_data_ordered = weather_data_ordered.reset_index(drop=True) # Drop categorical columns -weather_data_ordered = weather_data_ordered.drop( - columns=['Summary', 'Precip Type', 'Loud Cover', 'Daily Summary']) -weather_data_ordered.plot( - x='Formatted Date', y=['Temperature (C)'], color='red', figsize=(15, 8)) +# Plot Temperature Vs Formatted Date # TO DO: Plot Temperature (C) V.S the Date using only the data from 2006 -weather_data_ordered.head(8759).plot(x='Formatted Date', y=['Temperature (C)'], color='red') # ----------------------------------------------------------------------------- -weather_data_ordered.plot( - subplots=True, x='Formatted Date', y=['Temperature (C)', 'Humidity'], - figsize=(15, 8)) + +# Plot Temperature and Humidity in the same plot + # TO DO: Plot different combinations of the variables, for different years -# ----------------------------------------------------------------------------- # In[4] Data summarization and aggregation # Weather data by year -weather_data_by_year = aggregate_by_year( - weather_data_ordered, 'Formatted Date') # TO DO: Create and use a function to get the average # of the weather data by month # In[5] Data Analysis and Interpretation -plot_correlations(weather_data_ordered, size=15) -cmap_category, cmap_list = ('Plot gradiends convention', ['viridis', ]) -plot_color_gradients(cmap_category, cmap_list) -weather_correlations = weather_data_ordered.corr() -weather_data_ordered['Temperature (C)'].corr( - weather_data_ordered['Humidity']) +# Plot Correlations + +# Plot Gradients colormaps + +# Compute Correlations # TO DO: Get the correlation for different combinations of variables. # Contrast them with the weather_correlations dataframe @@ -80,27 +67,16 @@ # In[6] Data Modeling and Prediction # Get data subsets for the model -X_train, X_test, Y_train, Y_test = train_test_split( - weather_data_ordered['Humidity'], weather_data_ordered['Temperature (C)'], - test_size=0.25) # Run regression -regresion = linear_model.LinearRegression() -regresion.fit(X_train.values.reshape(-1, 1), Y_train.values.reshape(-1, 1)) -print(regresion.intercept_, regresion.coef_) # beta_0=intercept, beta_1=coef_ # Get coefficients -beta_0 = regresion.intercept_[0] -beta_1 = regresion.coef_[0, 0] -Y_predict = predicted_temperature(X_test, beta_0, beta_1) -plt.scatter(X_test, Y_test, c='red', label='observation', s=1) -plt.scatter(X_test, Y_predict, c='blue', label='model') -plt.xlabel('Humidity') -plt.ylabel('Temperature (C)') -plt.legend() -plt.show() +# Plot predicted model with test data. # TO DO: Using the coefficients predict the temperature for a # given level of humidity using the 'predicted_temperature' function # available in 'utils' + +# Evaluate model numerically + diff --git a/workshop_solutions.py b/workshop_solutions.py new file mode 100644 index 0000000..e31a981 --- /dev/null +++ b/workshop_solutions.py @@ -0,0 +1,122 @@ +# -*- coding: utf-8 -*- +# +# Copyright © Spyder Project Contributors +# Licensed under the terms of the MIT License +"""Workshop main flow.""" + + +# In[1] Importing Libraries and Data + +# Third-party imports +import matplotlib.pyplot as plt +import pandas as pd +from sklearn.model_selection import train_test_split +from sklearn import linear_model +from sklearn.metrics import explained_variance_score + +# Local imports +from utils import ( + plot_correlations, plot_color_gradients, aggregate_by_year, + predicted_temperature) + +# In[2] Exploring Data + +# Reading data +weather_data = pd.read_csv('data/weatherHistory.csv') + +# Print size of data +print(len(weather_data)) +# Print first 3 rows of DataFrame +print(weather_data.head(3)) + +# TO DO: Print the last 3 rows of the DataFrame +print(weather_data.tail(3)) + + +# In[3] Visualisation + +# Order rows according to date +weather_data['Formatted Date'] = pd.to_datetime( + weather_data['Formatted Date']) +weather_data_ordered = weather_data.sort_values(by='Formatted Date') +# Order Index according to date +weather_data_ordered = weather_data_ordered.reset_index(drop=True) +# Drop categorical columns +weather_data_ordered = weather_data_ordered.drop( + columns=['Summary', 'Precip Type', 'Loud Cover', 'Daily Summary']) +# Plot Temperature Vs Formatted Date +weather_data_ordered.plot( + x='Formatted Date', y=['Temperature (C)'], color='red', figsize=(15, 8)) + +# TO DO: Plot Temperature (C) V.S the Date using only the data from 2006 +weather_data_ordered.head(8759).plot(x='Formatted Date', y=['Temperature (C)'], color='red') + +# ----------------------------------------------------------------------------- +# Plot Temperature and Humidity in the same plot +weather_data_ordered.plot( + subplots=True, x='Formatted Date', y=['Temperature (C)', 'Humidity'], + figsize=(15, 8)) +# TO DO: Plot different combinations of the variables, for different years + + +# ----------------------------------------------------------------------------- + +# In[4] Data summarization and aggregation + +# Weather data by year +weather_data_by_year = aggregate_by_year( + weather_data_ordered, 'Formatted Date') + +# TO DO: Create and use a function to get the average +# of the weather data by month + + +# In[5] Data Analysis and Interpretation + +# Plot Correlations +plot_correlations(weather_data_ordered, size=15) +# Plot Gradients colormaps +cmap_category, cmap_list = ('Plot gradients convention', ['viridis', ]) +plot_color_gradients(cmap_category, cmap_list) + +# Compute Correlations +weather_correlations = weather_data_ordered.corr() +weather_data_ordered['Temperature (C)'].corr( + weather_data_ordered['Humidity']) + +# TO DO: Get the correlation for different combinations of variables. +# Contrast them with the weather_correlations dataframe + + +# In[6] Data Modeling and Prediction + +# Get data subsets for the model +X_train, X_test, Y_train, Y_test = train_test_split( + weather_data_ordered['Humidity'], weather_data_ordered['Temperature (C)'], + test_size=0.25) + +# Run regression +regresion = linear_model.LinearRegression() +regresion.fit(X_train.values.reshape(-1, 1), Y_train.values.reshape(-1, 1)) +print(regresion.intercept_, regresion.coef_) # beta_0=intercept, beta_1=coef_ + +# Get coefficients +beta_0 = regresion.intercept_[0] +beta_1 = regresion.coef_[0, 0] + +# Plot predicted model with test data. +Y_predict = predicted_temperature(X_test, beta_0, beta_1) +plt.scatter(X_test, Y_test, c='red', label='observation', s=1) +plt.scatter(X_test, Y_predict, c='blue', label='model') +plt.xlabel('Humidity') +plt.ylabel('Temperature (C)') +plt.legend() +plt.show() + +# TO DO: Using the coefficients predict the temperature for a +# given level of humidity using the 'predicted_temperature' function +# available in 'utils' + +# Evaluate model numerically +ev = explained_variance_score(Y_test, Y_predict) +print(ev)