This is UC Davis BAX452 Machine Learning Individual Project.
The objective of the project is to predict whether a person makes over 50K a year given their demographic variations. To achieve this, several classification techniques are explored. In the end, random forest model yields to the best prediction result.
-
Income dataset is available at UCI machine learning website: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
-
Data dictionary is available at: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
Below is the table of content:
- 1 Introduction
- 2 Fetching Data
- 3 Data Cleaning
- 4 Feature Engineering
- 4.1 Predclass
- 4.2 Education
- 4.3 Marital-status
- 4.4 Occupation
- 4.5 Workclass
- 4.6 age
- 4.7 Race
- 4.8 Hours of Work
- 4.9 Create a crossing feature: Age + hour of work
- 5 EDA
- 6 Building Machine Learning Models
- 7 Reflection
A measure of statistical dispersion intended to represent the income or wealth distribution of a nation's residents, and is the most commonly used measure of inequality.
Source: Gini coefficient
This is a violin plot using matplotlib to show how different occupations yield to salary variations, controlling age variables.
Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis.[1] It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.
Source: Bivariate Analysis
A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Source:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Target
- Predclass: >50K, <=50K.
- Categorical, income Level is either higher or lower than $50K
Categorical Attributes
- workclass: (categorical) Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- Individual work category
- education: (categorical) Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- Individual's highest education degree
- marital-status: (categorical) Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- Individual marital status
- occupation: (categorical) Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- Individual's occupation
- relationship: (categorical) Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- Individual's relation in a family
- race: (categorical) White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- Race of Individual
- sex: (categorical) Female, Male.
- native-country: (categorical) United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- Individual's native country
Continuous Attributes
- age: continuous.
- Age of an individual
- education-num: number of education year, continuous.
- Individual's year of receiving education
- fnlwgt: final weight, continuous.
- The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- Individual's working hour per week