Skip to content

Latest commit

 

History

History
408 lines (283 loc) · 7.34 KB

statistical_analysis.md

File metadata and controls

408 lines (283 loc) · 7.34 KB

Note: This is a generated markdown export from the Jupyter notebook file statistical_analysis.ipynb. You can also view the notebook with the nbviewer from Jupyter.

Statistical analysis

In this notebook we use pandas and the stats module from scipy for some basic statistical analysis.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats 

import pandas as pd

from matplotlib import pyplot as plt
plt.style.use("ggplot")
 

First we need some data. Let'use pandas to load the 'adult' data set from the UC Irvine Machine Learning Repository in our dataframe.

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
        "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
        "Hours per week", "Country", "Target"])

# some data cleaning remove leading and trailing spaces
df['Sex'] = df['Sex'].str.strip()


df.head()
Age Workclass fnlwgt Education Education-Num Martial Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country Target
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Descriptive statistics

Let's have a first look at the shape of our dataframe.

df.shape
(32561, 15)

What are the column names.

df.columns
Index(['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-Num',
       'Martial Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours per week', 'Country', 'Target'],
      dtype='object')

We can calculate the mean, median, standard error of the mean (sem), variance, standard deviation (std) and the quantiles for every column in the dataframe

df.mean()
Age                   38.581647
fnlwgt            189778.366512
Education-Num         10.080679
Capital Gain        1077.648844
Capital Loss          87.303830
Hours per week        40.437456
dtype: float64
df.median()
Age                   37.0
fnlwgt            178356.0
Education-Num         10.0
Capital Gain           0.0
Capital Loss           0.0
Hours per week        40.0
dtype: float64
df.sem()
Age                 0.075593
fnlwgt            584.937250
Education-Num       0.014258
Capital Gain       40.927838
Capital Loss        2.233126
Hours per week      0.068427
dtype: float64
df.var()
Age               1.860614e+02
fnlwgt            1.114080e+10
Education-Num     6.618890e+00
Capital Gain      5.454254e+07
Capital Loss      1.623769e+05
Hours per week    1.524590e+02
dtype: float64
df.std()
Age                   13.640433
fnlwgt            105549.977697
Education-Num          2.572720
Capital Gain        7385.292085
Capital Loss         402.960219
Hours per week        12.347429
dtype: float64
df.quantile(q=0.5)
Age                   37.0
fnlwgt            178356.0
Education-Num         10.0
Capital Gain           0.0
Capital Loss           0.0
Hours per week        40.0
Name: 0.5, dtype: float64
df.quantile(q=[0.05, 0.95])
Age fnlwgt Education-Num Capital Gain Capital Loss Hours per week
0.05 19.0 39460.0 5.0 0.0 0.0 18.0
0.95 63.0 379682.0 14.0 5013.0 0.0 60.0

In the next sample we replace a value with None so that we can show how to hanlde missing values in a dataframe.

Basic visualization

First let's create a pair plot

_ = sns.pairplot(df, hue="Target")

png

_ = sns.displot(df, x="Age" ,hue="Sex", label="male", kind="kde", log_scale=False)

png

Inferential statistics

female = df[df.Sex == 'Female']
male = df[df.Sex == 'Male']

T-Test

t, p = stats.ttest_ind(female['Age'], male['Age'])
print("test statistic: {}".format(t))
print("p-value: {}".format(p))
test statistic: -16.092517011911756
p-value: 4.8239930687799265e-58

Wilcoxon rank-sum test

z, p = stats.ranksums(female['Age'], male['Age'])
print("test statistic: {}".format(z))
print("p-value: {}".format(p))
test statistic: -18.107256874221704
p-value: 2.79324734147619e-73