Note: This is a generated markdown export from the Jupyter notebook file statistical_analysis.ipynb. You can also view the notebook with the nbviewer from Jupyter.
In this notebook we use pandas and the stats module from scipy for some basic statistical analysis.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use("ggplot")
First we need some data. Let'use pandas to load the 'adult' data set from the UC Irvine Machine Learning Repository in our dataframe.
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
"Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week", "Country", "Target"])
# some data cleaning remove leading and trailing spaces
df['Sex'] = df['Sex'].str.strip()
df.head()
Age | Workclass | fnlwgt | Education | Education-Num | Martial Status | Occupation | Relationship | Race | Sex | Capital Gain | Capital Loss | Hours per week | Country | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Let's have a first look at the shape of our dataframe.
df.shape
(32561, 15)
What are the column names.
df.columns
Index(['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-Num',
'Martial Status', 'Occupation', 'Relationship', 'Race', 'Sex',
'Capital Gain', 'Capital Loss', 'Hours per week', 'Country', 'Target'],
dtype='object')
We can calculate the mean, median, standard error of the mean (sem), variance, standard deviation (std) and the quantiles for every column in the dataframe
df.mean()
Age 38.581647
fnlwgt 189778.366512
Education-Num 10.080679
Capital Gain 1077.648844
Capital Loss 87.303830
Hours per week 40.437456
dtype: float64
df.median()
Age 37.0
fnlwgt 178356.0
Education-Num 10.0
Capital Gain 0.0
Capital Loss 0.0
Hours per week 40.0
dtype: float64
df.sem()
Age 0.075593
fnlwgt 584.937250
Education-Num 0.014258
Capital Gain 40.927838
Capital Loss 2.233126
Hours per week 0.068427
dtype: float64
df.var()
Age 1.860614e+02
fnlwgt 1.114080e+10
Education-Num 6.618890e+00
Capital Gain 5.454254e+07
Capital Loss 1.623769e+05
Hours per week 1.524590e+02
dtype: float64
df.std()
Age 13.640433
fnlwgt 105549.977697
Education-Num 2.572720
Capital Gain 7385.292085
Capital Loss 402.960219
Hours per week 12.347429
dtype: float64
df.quantile(q=0.5)
Age 37.0
fnlwgt 178356.0
Education-Num 10.0
Capital Gain 0.0
Capital Loss 0.0
Hours per week 40.0
Name: 0.5, dtype: float64
df.quantile(q=[0.05, 0.95])
Age | fnlwgt | Education-Num | Capital Gain | Capital Loss | Hours per week | |
---|---|---|---|---|---|---|
0.05 | 19.0 | 39460.0 | 5.0 | 0.0 | 0.0 | 18.0 |
0.95 | 63.0 | 379682.0 | 14.0 | 5013.0 | 0.0 | 60.0 |
In the next sample we replace a value with None so that we can show how to hanlde missing values in a dataframe.
First let's create a pair plot
_ = sns.pairplot(df, hue="Target")
_ = sns.displot(df, x="Age" ,hue="Sex", label="male", kind="kde", log_scale=False)
female = df[df.Sex == 'Female']
male = df[df.Sex == 'Male']
T-Test
t, p = stats.ttest_ind(female['Age'], male['Age'])
print("test statistic: {}".format(t))
print("p-value: {}".format(p))
test statistic: -16.092517011911756
p-value: 4.8239930687799265e-58
Wilcoxon rank-sum test
z, p = stats.ranksums(female['Age'], male['Age'])
print("test statistic: {}".format(z))
print("p-value: {}".format(p))
test statistic: -18.107256874221704
p-value: 2.79324734147619e-73