Note: This is a generated markdown export from the Jupyter notebook file statistical_analysis.ipynb. You can also view the notebook with the nbviewer from Jupyter.

Statistical analysis

In this notebook we use pandas and the stats module from scipy for some basic statistical analysis.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats 

import pandas as pd

from matplotlib import pyplot as plt
plt.style.use("ggplot")

First we need some data. Let'use pandas to load the 'adult' data set from the UC Irvine Machine Learning Repository in our dataframe.

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
        "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
        "Hours per week", "Country", "Target"])

# some data cleaning remove leading and trailing spaces
df['Sex'] = df['Sex'].str.strip()


df.head()

	Age	Workclass	fnlwgt	Education	Education-Num	Martial Status	Occupation	Relationship	Race	Sex	Capital Gain	Hours per week	Country	Target
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

Descriptive statistics

Let's have a first look at the shape of our dataframe.

df.shape

(32561, 15)

What are the column names.

df.columns

Index(['Age', 'Workclass', 'fnlwgt', 'Education', 'Education-Num',
       'Martial Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours per week', 'Country', 'Target'],
      dtype='object')

We can calculate the mean, median, standard error of the mean (sem), variance, standard deviation (std) and the quantiles for every column in the dataframe

df.mean()

Age                   38.581647
fnlwgt            189778.366512
Education-Num         10.080679
Capital Gain        1077.648844
Capital Loss          87.303830
Hours per week        40.437456
dtype: float64

df.median()

Age                   37.0
fnlwgt            178356.0
Education-Num         10.0
Capital Gain           0.0
Capital Loss           0.0
Hours per week        40.0
dtype: float64

df.sem()

Age                 0.075593
fnlwgt            584.937250
Education-Num       0.014258
Capital Gain       40.927838
Capital Loss        2.233126
Hours per week      0.068427
dtype: float64

df.var()

Age               1.860614e+02
fnlwgt            1.114080e+10
Education-Num     6.618890e+00
Capital Gain      5.454254e+07
Capital Loss      1.623769e+05
Hours per week    1.524590e+02
dtype: float64

df.std()

Age                   13.640433
fnlwgt            105549.977697
Education-Num          2.572720
Capital Gain        7385.292085
Capital Loss         402.960219
Hours per week        12.347429
dtype: float64

df.quantile(q=0.5)

Age                   37.0
fnlwgt            178356.0
Education-Num         10.0
Capital Gain           0.0
Capital Loss           0.0
Hours per week        40.0
Name: 0.5, dtype: float64

df.quantile(q=[0.05, 0.95])

	Age	fnlwgt	Education-Num	Capital Gain	Capital Loss	Hours per week
0.05	19.0	39460.0	5.0	0.0	0.0	18.0
0.95	63.0	379682.0	14.0	5013.0	0.0	60.0

In the next sample we replace a value with None so that we can show how to hanlde missing values in a dataframe.

Basic visualization

First let's create a pair plot

_ = sns.pairplot(df, hue="Target")

_ = sns.displot(df, x="Age" ,hue="Sex", label="male", kind="kde", log_scale=False)

Inferential statistics

female = df[df.Sex == 'Female']
male = df[df.Sex == 'Male']

T-Test

t, p = stats.ttest_ind(female['Age'], male['Age'])
print("test statistic: {}".format(t))
print("p-value: {}".format(p))

test statistic: -16.092517011911756
p-value: 4.8239930687799265e-58

Wilcoxon rank-sum test

z, p = stats.ranksums(female['Age'], male['Age'])
print("test statistic: {}".format(z))
print("p-value: {}".format(p))

test statistic: -18.107256874221704
p-value: 2.79324734147619e-73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statistical_analysis.md

statistical_analysis.md

Statistical analysis

Descriptive statistics

Basic visualization

Inferential statistics

Files

statistical_analysis.md

Latest commit

History

statistical_analysis.md

File metadata and controls

Statistical analysis

Descriptive statistics

Basic visualization

Inferential statistics