jupyter | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
In this notebook, we'll look at using linear regression to study changes in temperature.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%config InlineBackend.figure_format ='retina'
We'll be getting data from North America Land Data Assimilation System (NLDAS), which provides the daily average temperature from 1979-2011 for the United States.
For the next step, you will need to choose some settings in the data request form. These are:
- GroupBy: Month Day, Year
- Your State
- Export Results (check box)
- Show Zero Values (check box)
- Download the data for your home state (or state of your choosing) and upload it to M2 in your work directory.
df = pd.read_csv('North America Land Data Assimilation System (NLDAS) Daily Air Temperatures and Heat Index (1979-2011).txt',delimiter='\t',skipfooter=14,engine='python')
df
- Drop any rows that have the value "Total" in the Notes column, then drop the Notes column
- Make a column called Date that is in the pandas datetime format
- Make columns for 'Year', 'Month', and 'Day' by splitting the column 'Month Day, Year'
df['DateInt'] = df['Date'].astype(int)/10e10 # This will be used later
- Use df.plot.scatter to plot 'Date' vs 'Avg Daily Max Air Temperature (F)'. You might want to add figsize=(50,5) as an argument to make it more clear what is happening.
- Describe your plot.
# No need to edit this unless you want to try different colors or a pattern other than colors by month
cmap = matplotlib.cm.get_cmap("nipy_spectral", len(df['Month'].unique())) # Builds a discrete color mapping using a built in matplotlib color map
c = []
for i in range(cmap.N): # Converts our discrete map into Hex Values
rgba = cmap(i)
c.append(matplotlib.colors.rgb2hex(rgba))
df['color']=[c[int(i-1)] for i in df['Month'].astype(int)] # Adds a column to our dataframe with the color we want for each row
- Make the same plot as 4) but add color by adding the argument c=df['color'] to our plotting command.
- Select a 6 month period from the data. # Hint use logic and pd.datetime(YYYY, MM, DD)
- Plot the subset using the the same code you used in 6). You can change the figsize if needed.
We are going to use a very simple linear regression model. You may implement a more complex model if you wish.
The method described here is called the least squares method and is defined as:
Where
First we need to define our X and Y values.
X=subset['DateInt'].values
Y=subset['Avg Daily Max Air Temperature (F)'].values
def lin_reg(x,y):
# Calculate the average x and y
x_avg = np.mean(x)
y_avg = np.mean(y)
num = 0
den = 0
for i in range(len(x)): # This represents our sums
num = num + (x[i] - x_avg)*(y[i] - y_avg) # Our numerator
den = den + (x[i] - x_avg)**2 # Our denominator
# Calculate slope
m = num / den
# Calculate intercept
b = y_avg - m*x_avg
print (m, b)
# Calculate our predicted y values
y_pred = m*x + b
return y_pred
Y_pred = lin_reg(X,Y)
subset.plot.scatter(x='Date', y='Avg Daily Max Air Temperature (F)',c=subset['color'])
plt.plot([min(subset['Date'].values), max(subset['Date'].values)], [min(Y_pred), max(Y_pred)], color='red') # best fit line
plt.show()
- What are the slope and intercept of your best fit line?
- What are the minimum and maximum Y values of your best fit line? Is your slope positive or negative?
- Generate a best fit line for the full data set and plot the line over top of the data.
- Is the slope positive or negative? What do you think that means?