Analysis of H1-B disclosure data from the Office of Foreign Labor Certification. This dataset is interesting because it contains repeated observations of the same variables over 11 years.
- Problem Definition
- Data Extraction
- Data Ingestion
- Data Wrangling/Cleaning
- Feature Engineering
- Exploratory Data Analysis
- Model
- Model Deployment
The objective is use the publicly available H1-B disclosure data to build a model that can predict the visa approval status of a H1-B application. The secondary objective is to interpret the results to find which factors influence the approval status.
Assumption
The case status is skewed on certified value, (i.e. approved ), for the sake of simplicity and ease of development case status is assumed to be a binary field, any case status not certified will be grouped into denied.
The OFLC performance disclosure data can be obtained from this link https://www.foreignlaborcert.doleta.gov/performancedata.cfm. Save the 13 xlsx in a raw or a staging area, couple of columns contain embedded commas its recommened that save the xlsx to csv utf-8 format. Save the converted utf-8 csv's in a seperate folder.
The thirteen source files are in well defined csv's with headers, the most straight forward way is create a staging table in a RDBMS, I choose Postgres as my data sink. The source ddl's can be found here metadata_munging/ddl. Ingesting data in Postgres is very straight forward, we can leverage COPY .. FROM ..
construct.
Data cleaning is the most crucial part of any data science project. This data set is no different and is messy. Couple of challenges with this data set
- Normalize field names across all thirteen years
- Impute missing data
- The key observation prevailing wage or salary can be measured in hourly, weekly, biweekly, monthly or yearly format.
- Derive additional fields as necessary
- Create one core table for all data
Metadata Munging
Original Metadata, viewing the metadata of all 13 files in one layout helps in understanding the schema evolution. Not all fields are consistent and we have new and interesting fields not available for history. This new fields unfortunately be backfilled.
Click to expand!
2008 | 2009_1 | 2009_2 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SUBMITTED_DATE | SUBMITTED_DATE | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | CASE_NUMBER | CASE_NUMBER | CASE_NUMBER | CASE_NUMBER | CASE_NUMBER |
CASE_NO | CASE_NO | STATUS | STATUS | STATUS | STATUS | STATUS | STATUS | CASE_STATUS | CASE_STATUS | CASE_STATUS | CASE_STATUS | CASE_STATUS |
PROGRAM | PROGRAM_DESIGNATION | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | CASE_SUBMITTED | CASE_SUBMITTED | CASE_SUBMITTED | CASE_SUBMITTED | CASE_SUBMITTED |
NAME | EMPLOYER_NAME | Decision_Date | DECISION_DATE | DECISION_DATE | DECISION_DATE | Decision_Date | DECISION_DATE | DECISION_DATE | DECISION_DATE | DECISION_DATE | DECISION_DATE | DECISION_DATE |
ADDRESS1 | EMPLOYER_ADDRESS1 | VISA_CLASS | LCA_CASE_EMPLOYMENT_START_DATE | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS |
ADDRESS2 | EMPLOYER_ADDRESS2 | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE |
CITY | EMPLOYER_CITY | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE |
STATE | EMPLOYER_STATE | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_ADDRESS1 | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME |
POSTAL_CODE | EMPLOYER_COUNTY | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS2 | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS | EMPLOYER_ADDRESS1 | EMPLOYER_ADDRESS | EMPLOYER_BUSINESS_DBA | EMPLOYER_BUSINESS_DBA | EMPLOYER_BUSINESS_DBA |
NBR_IMMIGRANTS | EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | EMPLOYER_ADDRESS2 | EMPLOYER_CITY | EMPLOYER_ADDRESS | EMPLOYER_ADDRESS | EMPLOYER_ADDRESS |
BEGIN_DATE | NBR_IMMIGRANTS | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | EMPLOYER_CITY | EMPLOYER_STATE | EMPLOYER_CITY | EMPLOYER_CITY | EMPLOYER_CITY |
END_DATE | BEGIN_DATE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | EMPLOYER_STATE | EMPLOYER_POSTAL_CODE | EMPLOYER_STATE | EMPLOYER_STATE | EMPLOYER_STATE |
JOB_TITLE | END_DATE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | EMPLOYER_POSTAL_CODE | EMPLOYER_COUNTRY | EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE |
DOL_DECISION_DATE | JOB_TITLE | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | EMPLOYER_COUNTRY | EMPLOYER_PROVINCE | EMPLOYER_COUNTRY | EMPLOYER_COUNTRY | EMPLOYER_COUNTRY |
CERTIFIED_BEGIN_DATE | DOL_DECISION_DATE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | EMPLOYER_PROVINCE | EMPLOYER_PHONE | EMPLOYER_PROVINCE | EMPLOYER_PROVINCE | EMPLOYER_PROVINCE |
CERTIFIED_END_DATE | OCCUPATIONAL_CODE | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | EMPLOYER_PHONE | EMPLOYER_PHONE_EXT | EMPLOYER_PHONE | EMPLOYER_PHONE | EMPLOYER_PHONE |
JOB_CODE | OCCUPATIONAL_TITLE | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | EMPLOYER_PHONE_EXT | AGENT_ATTORNEY_NAME | EMPLOYER_PHONE_EXT | EMPLOYER_PHONE_EXT | EMPLOYER_PHONE_EXT |
APPROVAL_STATUS | APPROVAL_STATUS | LCA_CASE_WAGE_RATE_UNIT | TOTAL_WORKERS | LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | AGENT_ATTORNEY_NAME | AGENT_ATTORNEY_CITY | AGENT_REPRESENTING_EMPLOYER | AGENT_REPRESENTING_EMPLOYER | SECONDARY_ENTITY |
WAGE_RATE_1 | WAGE_RATE_1 | FULL_TIME_POS | WORK_LOCATION_CITY1 | FULL_TIME_POS | FULL_TIME_POS | FULL_TIME_POS | FULL_TIME_POS | AGENT_ATTORNEY_CITY | AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_NAME | AGENT_ATTORNEY_NAME | SECONDARY_ENTITY_BUSINESS_NAME |
RATE_PER_1 | RATE_PER_1 | TOTAL_WORKERS | WORK_LOCATION_STATE1 | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | AGENT_ATTORNEY_STATE | JOB_TITLE | AGENT_ATTORNEY_CITY | AGENT_ATTORNEY_CITY | AGENT_REPRESENTING_EMPLOYER |
MAX_RATE_1 | MAX_RATE_1 | LCA_CASE_WORKLOC1_CITY | PW_1 | LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | JOB_TITLE | SOC_CODE | AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_NAME |
PART_TIME_1 | PART_TIME_1 | LCA_CASE_WORKLOC1_STATE | PW_UNIT_1 | LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | SOC_CODE | SOC_NAME | JOB_TITLE | JOB_TITLE | AGENT_ATTORNEY_CITY |
CITY_1 | CITY_1 | PW_1 | PW_SOURCE_1 | PW_1 | PW_1 | PW_1 | PW_1 | SOC_NAME | NAIC_CODE | SOC_CODE | SOC_CODE | AGENT_ATTORNEY_STATE |
STATE_1 | STATE_1 | PW_UNIT_1 | OTHER_WAGE_SOURCE_1 | PW_UNIT_1 | PW_UNIT_1 | PW_UNIT_1 | PW_UNIT_1 | NAIC_CODE | TOTAL_WORKERS | SOC_NAME | SOC_NAME | JOB_TITLE |
PREVAILING_WAGE_1 | PREVAILING_WAGE_1 | PW_SOURCE_1 | YR_SOURCE_PUB_1 | PW_SOURCE_1 | PW_SOURCE_1 | PW_SOURCE_1 | PW_SOURCE_1 | TOTAL | FULL_TIME_POSITION | NAICS_CODE | NAICS_CODE | SOC_CODE |
WAGE_SOURCE_1 | WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | WORK_LOCATION_CITY2 | OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | WORKERS | PREVAILING_WAGE | TOTAL_WORKERS | TOTAL_WORKERS | SOC_NAME |
YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | WORK_LOCATION_STATE2 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | FULL_TIME_POSITION | PW_UNIT_OF_PAY | NEW_EMPLOYMENT | NEW_EMPLOYMENT | NAICS_CODE |
OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | LCA_CASE_WORKLOC2_CITY | PW_2 | LCA_CASE_WORKLOC2_CITY | LCA_CASE_WORKLOC2_CITY | LCA_CASE_WORKLOC2_CITY | LCA_CASE_WORKLOC2_CITY | PREVAILING_WAGE | PW_WAGE_SOURCE | CONTINUED_EMPLOYMENT | CONTINUED_EMPLOYMENT | TOTAL_WORKERS |
WAGE_RATE_2 | WAGE_RATE__2 | LCA_CASE_WORKLOC2_STATE | PW_UNIT_2 | LCA_CASE_WORKLOC2_STATE | LCA_CASE_WORKLOC2_STATE | LCA_CASE_WORKLOC2_STATE | LCA_CASE_WORKLOC2_STATE | PW_UNIT_OF_PAY | PW_SOURCE_YEAR | CHANGE_PREVIOUS_EMPLOYMENT | CHANGE_PREVIOUS_EMPLOYMENT | NEW_EMPLOYMENT |
RATE_PER_2 | RATE_PER_2 | PW_2 | PW_SOURCE_2 | PW_2 | PW_2 | PW_2 | PW_2 | PW_WAGE_LEVEL | PW_SOURCE_OTHER | NEW_CONCURRENT_EMPLOYMENT | NEW_CONCURRENT_EMP | CONTINUED_EMPLOYMENT |
MAX_RATE_2 | MAX_RATE_2 | PW_UNIT_2 | OTHER_WAGE_SOURCE_2 | PW_UNIT_2 | PW_UNIT_2 | PW_UNIT_2 | PW_UNIT_2 | PW_WAGE_SOURCE | WAGE_RATE_OF_PAY_FROM | CHANGE_EMPLOYER | CHANGE_EMPLOYER | CHANGE_PREVIOUS_EMPLOYMENT |
PART_TIME_2 | PART_TIME_2 | PW_SOURCE_2 | YR_SOURCE_PUB_2 | PW_SOURCE_2 | PW_SOURCE_2 | PW_SOURCE_2 | PW_SOURCE_2 | PW_WAGE_SOURCE_YEAR | WAGE_RATE_OF_PAY_TO | AMENDED_PETITION | AMENDED_PETITION | NEW_CONCURRENT_EMP |
CITY_2 | CITY_2 | OTHER_WAGE_SOURCE_2 | LCA_CASE_NAICS_CODE | OTHER_WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | PW_WAGE_SOURCE_OTHER | WAGE_UNIT_OF_PAY | FULL_TIME_POSITION | FULL_TIME_POSITION | CHANGE_EMPLOYER |
STATE_2 | STATE_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | WAGE_RATE_OF_PAY | H-1B_DEPENDENT | PREVAILING_WAGE | PREVAILING_WAGE | AMENDED_PETITION | |
PREVAILING_WAGE_2 | PREVAILING_WAGE_2 | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | WAGE_UNIT_OF_PAY | WILLFUL_VIOLATOR | PW_UNIT_OF_PAY | PW_UNIT_OF_PAY | FULL_TIME_POSITION | |
WAGE_SOURCE_2 | WAGE_SOURCE_2 | H-1B_DEPENDENT | WORKSITE_CITY | PW_WAGE_LEVEL | PW_WAGE_LEVEL | PREVAILING_WAGE | ||||||
YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | WILLFUL | WORKSITE_COUNTY | PW_SOURCE | PW_SOURCE | PW_UNIT_OF_PAY | ||||||
OTHER_WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | VIOLATOR | WORKSITE_STATE | PW_SOURCE_YEAR | PW_SOURCE_YEAR | PW_WAGE_LEVEL | ||||||
WITHDRAWN | WITHDRAWN | WORKSITE_CITY | WORKSITE_POSTAL_CODE | PW_SOURCE_OTHER | PW_SOURCE_OTHER | PW_SOURCE | ||||||
OCCUPATIONAL_TITLE | WORKSITE_COUNTY | ORIGINAL_CERT_DATE | WAGE_RATE_OF_PAY_FROM | WAGE_RATE_OF_PAY_FROM | PW_SOURCE_YEAR | |||||||
WORKSITE_STATE | WAGE_RATE_OF_PAY_TO | WAGE_RATE_OF_PAY_TO | PW_SOURCE_OTHER | |||||||||
WORKSITE_POSTAL_CODE | WAGE_UNIT_OF_PAY | WAGE_UNIT_OF_PAY | WAGE_RATE_OF_PAY_FROM | |||||||||
H1B_DEPENDENT | H1B_DEPENDENT | WAGE_RATE_OF_PAY_TO | ||||||||||
WILLFUL_VIOLATOR | WILLFUL_VIOLATOR | WAGE_UNIT_OF_PAY | ||||||||||
SUPPORT_H1B | SUPPORT_H1B | H1B_DEPENDENT | ||||||||||
LABOR_CON_AGREE | LABOR_CON_AGREE | WILLFUL_VIOLATOR | ||||||||||
PUBLIC_DISCLOSURE_LOCATION | PUBLIC_DISCLOSURE_LOCATION | SUPPORT_H1B | ||||||||||
WORKSITE_CITY | WORKSITE_CITY | STATUTORY_BASIS | ||||||||||
WORKSITE_COUNTY | WORKSITE_COUNTY | APPENDIX_ATTACHMENT | ||||||||||
WORKSITE_STATE | WORKSITE_STATE | LABOR_CON_AGREE | ||||||||||
WORKSITE_POSTAL_CODE | WORKSITE_POSTAL_CODE | PUBLIC_DISCLOSURE_LOCATION | ||||||||||
ORIGINAL_CERT_DATE | ORIGINAL_CERT_DATE | WORKSITE_CITY | ||||||||||
WORKSITE_COUNTY | ||||||||||||
WORKSITE_STATE | ||||||||||||
WORKSITE_POSTAL_CODE | ||||||||||||
ORIGINAL_CERT_DATE |
After reviewing the data dictionary for all thirteen files, here is a consolidated view
Click to expand!
2008 | 2009_1 | 2009_2 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CASE_NO | CASE_NO | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | LCA_CASE_NUMBER | CASE_NUMBER | CASE_NUMBER | CASE_NUMBER | CASE_NUMBER | CASE_NUMBER |
APPROVAL_STATUS | APPROVAL_STATUS | STATUS | STATUS | STATUS | STATUS | STATUS | STATUS | CASE_STATUS | CASE_STATUS | CASE_STATUS | CASE_STATUS | CASE_STATUS |
SUBMITTED_DATE | SUBMITTED_DATE | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | LCA_CASE_SUBMIT | CASE_SUBMITTED | CASE_SUBMITTED | CASE_SUBMITTED | CASE_SUBMITTED | CASE_SUBMITTED |
DOL_DECISION_DATE | DOL_DECISION_DATE | Decision_Date | DECISION_DATE | DECISION_DATE | DECISION_DATE | Decision_Date | DECISION_DATE | DECISION_DATE | DECISION_DATE | DECISION_DATE | DECISION_DATE | DECISION_DATE |
PROGRAM | PROGRAM_DESIGNATION | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | VISA_CLASS | |
BEGIN_DATE | BEGIN_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | LCA_CASE_EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE | EMPLOYMENT_START_DATE |
END_DATE | END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | LCA_CASE_EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE | EMPLOYMENT_END_DATE |
NAME | EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | LCA_CASE_EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME | EMPLOYER_NAME |
EMPLOYER_ADDRESS1 | EMPLOYER_BUSINESS_DBA | EMPLOYER_BUSINESS_DBA | EMPLOYER_BUSINESS_DBA | |||||||||
ADDRESS1 | EMPLOYER_ADDRESS1 | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS1 | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS | LCA_CASE_EMPLOYER_ADDRESS | EMPLOYER_ADDRESS2 | EMPLOYER_ADDRESS | EMPLOYER_ADDRESS | EMPLOYER_ADDRESS | EMPLOYER_ADDRESS |
CITY | EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | LCA_CASE_EMPLOYER_CITY | EMPLOYER_CITY | EMPLOYER_CITY | EMPLOYER_CITY | EMPLOYER_CITY | EMPLOYER_CITY |
STATE | EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | LCA_CASE_EMPLOYER_STATE | EMPLOYER_STATE | EMPLOYER_STATE | EMPLOYER_STATE | EMPLOYER_STATE | EMPLOYER_STATE |
POSTAL_CODE | EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | LCA_CASE_EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE | EMPLOYER_POSTAL_CODE |
EMPLOYER_COUNTRY | EMPLOYER_COUNTRY | EMPLOYER_COUNTRY | EMPLOYER_COUNTRY | EMPLOYER_COUNTRY | ||||||||
EMPLOYER_PROVINCE | EMPLOYER_PROVINCE | EMPLOYER_PROVINCE | EMPLOYER_PROVINCE | EMPLOYER_PROVINCE | ||||||||
EMPLOYER_PHONE | EMPLOYER_PHONE | EMPLOYER_PHONE | EMPLOYER_PHONE | EMPLOYER_PHONE | ||||||||
EMPLOYER_PHONE_EXT | EMPLOYER_PHONE_EXT | EMPLOYER_PHONE_EXT | EMPLOYER_PHONE_EXT | EMPLOYER_PHONE_EXT | ||||||||
SECONDARY_ENTITY | ||||||||||||
SECONDARY_ENTITY_BUSINESS_NAME | ||||||||||||
AGENT_REPRESENTING_EMPLOYER | AGENT_REPRESENTING_EMPLOYER | AGENT_REPRESENTING_EMPLOYER | ||||||||||
WORK_LOCATION_CITY1 | AGENT_ATTORNEY_NAME | AGENT_ATTORNEY_NAME | AGENT_ATTORNEY_NAME | AGENT_ATTORNEY_NAME | AGENT_ATTORNEY_NAME | |||||||
WORK_LOCATION_STATE1 | AGENT_ATTORNEY_CITY | AGENT_ATTORNEY_CITY | AGENT_ATTORNEY_CITY | AGENT_ATTORNEY_CITY | AGENT_ATTORNEY_CITY | |||||||
AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_STATE | AGENT_ATTORNEY_STATE | ||||||||
OCCUPATIONAL_TITLE | OCCUPATIONAL_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | LCA_CASE_JOB_TITLE | JOB_TITLE | JOB_TITLE | JOB_TITLE | JOB_TITLE | JOB_TITLE |
JOB_CODE | OCCUPATIONAL_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | LCA_CASE_SOC_CODE | SOC_CODE | SOC_CODE | SOC_CODE | SOC_CODE | SOC_CODE |
LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | LCA_CASE_SOC_NAME | SOC_NAME | SOC_NAME | SOC_NAME | SOC_NAME | SOC_NAME | ||
LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | LCA_CASE_NAICS_CODE | NAIC_CODE | NAIC_CODE | NAICS_CODE | NAICS_CODE | NAICS_CODE | ||
TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | TOTAL_WORKERS | ||
NEW_EMPLOYMENT | NEW_EMPLOYMENT | NEW_EMPLOYMENT | ||||||||||
CONTINUED_EMPLOYMENT | CONTINUED_EMPLOYMENT | CONTINUED_EMPLOYMENT | ||||||||||
CHANGE_PREVIOUS_EMPLOYMENT | CHANGE_PREVIOUS_EMPLOYMENT | CHANGE_PREVIOUS_EMPLOYMENT | ||||||||||
NEW_CONCURRENT_EMPLOYMENT | NEW_CONCURRENT_EMP | NEW_CONCURRENT_EMP | ||||||||||
CHANGE_EMPLOYER | CHANGE_EMPLOYER | CHANGE_EMPLOYER | ||||||||||
AMENDED_PETITION | AMENDED_PETITION | AMENDED_PETITION | ||||||||||
FULL_TIME_POS | FULL_TIME_POS | FULL_TIME_POS | FULL_TIME_POS | FULL_TIME_POS | FULL_TIME_POSITION | FULL_TIME_POSITION | FULL_TIME_POSITION | FULL_TIME_POSITION | FULL_TIME_POSITION | |||
PREVAILING_WAGE_1 | PREVAILING_WAGE_1 | PW_1 | PW_1 | PW_1 | PW_1 | PW_1 | PREVAILING_WAGE | PREVAILING_WAGE | PREVAILING_WAGE | PREVAILING_WAGE | PREVAILING_WAGE | |
RATE_PER_1/WAGE_SOURCE_1 | RATE_PER_1/WAGE_SOURCE_1 | PW_UNIT_1 | PW_UNIT_1 | PW_UNIT_1 | PW_UNIT_1 | PW_UNIT_1 | PW_UNIT_OF_PAY | PW_UNIT_OF_PAY | PW_UNIT_OF_PAY | PW_UNIT_OF_PAY | PW_UNIT_OF_PAY | |
PW_WAGE_LEVEL | PW_WAGE_LEVEL | PW_WAGE_LEVEL | PW_WAGE_LEVEL | |||||||||
PW_WAGE_SOURCE | PW_WAGE_SOURCE | PW_SOURCE | PW_SOURCE | PW_SOURCE | ||||||||
PW_WAGE_SOURCE_YEAR | PW_SOURCE_YEAR | PW_SOURCE_YEAR | PW_SOURCE_YEAR | PW_SOURCE_YEAR | ||||||||
PW_WAGE_SOURCE_OTHER | PW_SOURCE_OTHER | PW_SOURCE_OTHER | PW_SOURCE_OTHER | PW_SOURCE_OTHER | ||||||||
LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | LCA_CASE_WAGE_RATE_FROM | WAGE_RATE_OF_PAY_FROM | WAGE_RATE_OF_PAY_FROM | WAGE_RATE_OF_PAY_FROM | WAGE_RATE_OF_PAY_FROM | |||
LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | LCA_CASE_WAGE_RATE_TO | WAGE_RATE_OF_PAY | WAGE_RATE_OF_PAY_TO | WAGE_RATE_OF_PAY_TO | WAGE_RATE_OF_PAY_TO | WAGE_RATE_OF_PAY_TO | ||
LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | LCA_CASE_WAGE_RATE_UNIT | WAGE_UNIT_OF_PAY | WAGE_UNIT_OF_PAY | WAGE_UNIT_OF_PAY | WAGE_UNIT_OF_PAY | WAGE_UNIT_OF_PAY | |||
H1B_DEPENDENT | H1B_DEPENDENT | H1B_DEPENDENT | H1B_DEPENDENT | H1B_DEPENDENT | ||||||||
WILLFUL_VIOLATOR | WILLFUL_VIOLATOR | WILLFUL_VIOLATOR | WILLFUL_VIOLATOR | WILLFUL_VIOLATOR | ||||||||
SUPPORT_H1B | SUPPORT_H1B | SUPPORT_H1B | ||||||||||
STATUTORY_BASIS | ||||||||||||
APPENDIX_ATTACHMENT | ||||||||||||
LABOR_CON_AGREE | LABOR_CON_AGREE | LABOR_CON_AGREE | ||||||||||
PUBLIC_DISCLOSURE_LOCATION | PUBLIC_DISCLOSURE_LOCATION | PUBLIC_DISCLOSURE_LOCATION | ||||||||||
LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | LCA_CASE_WORKLOC1_CITY | WORKSITE_CITY | WORKSITE_CITY | WORKSITE_CITY | WORKSITE_CITY | WORKSITE_CITY | |||
LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | LCA_CASE_WORKLOC1_STATE | WORKSITE_COUNTY | WORKSITE_COUNTY | WORKSITE_COUNTY | WORKSITE_COUNTY | WORKSITE_COUNTY | |||
WORKSITE_STATE | WORKSITE_STATE | WORKSITE_STATE | WORKSITE_STATE | WORKSITE_STATE | ||||||||
WORKSITE_POSTAL_CODE | WORKSITE_POSTAL_CODE | WORKSITE_POSTAL_CODE | WORKSITE_POSTAL_CODE | WORKSITE_POSTAL_CODE | ||||||||
ORIGINAL_CERT_DATE | ORIGINAL_CERT_DATE | ORIGINAL_CERT_DATE | ORIGINAL_CERT_DATE | |||||||||
YYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY | YYYY |
FIELDS NOT REQUIRED | ||||||||||||
WAGE_RATE_2 | WITHDRAWN | PW_SOURCE_1 | PW_SOURCE_1 | LCA_CASE_WORKLOC2_CITY | LCA_CASE_WORKLOC2_CITY | LCA_CASE_WORKLOC2_CITY | PW_2 | |||||
RATE_PER_2 | WAGE_RATE__2 | OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | LCA_CASE_WORKLOC2_STATE | LCA_CASE_WORKLOC2_STATE | LCA_CASE_WORKLOC2_STATE | PW_UNIT_2 | |||||
MAX_RATE_2 | RATE_PER_2 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | PW_2 | PW_2 | PW_2 | PW_SOURCE_2 | |||||
PART_TIME_2 | MAX_RATE_2 | LCA_CASE_WORKLOC2_CITY | LCA_CASE_EMPLOYER_ADDRESS2 | PW_UNIT_2 | PW_UNIT_2 | PW_UNIT_2 | OTHER_WAGE_SOURCE_2 | |||||
CITY_2 | PART_TIME_2 | LCA_CASE_WORKLOC2_STATE | PW_2 | PW_SOURCE_2 | PW_SOURCE_2 | PW_SOURCE_2 | YR_SOURCE_PUB_2 | |||||
STATE_2 | CITY_2 | PW_2 | PW_UNIT_2 | OTHER_WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | LCA_CASE_WORKLOC2_CITY | |||||
PREVAILING_WAGE_2 | STATE_2 | PW_UNIT_2 | PW_SOURCE_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | LCA_CASE_WORKLOC2_STATE | |||||
WAGE_SOURCE_2 | PREVAILING_WAGE_2 | PW_SOURCE_2 | OTHER_WAGE_SOURCE_2 | PW_SOURCE_1 | PW_SOURCE_1 | PW_SOURCE_1 | ||||||
YR_SOURCE_PUB_2 | WAGE_SOURCE_2 | OTHER_WAGE_SOURCE_2 | YR_SOURCE_PUB_2 | OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | OTHER_WAGE_SOURCE_1 | ||||||
OTHER_WAGE_SOURCE_2 | YR_SOURCE_PUB_2 | YR_SOURCE_PUB_2 | PW_1 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | YR_SOURCE_PUB_1 | ||||||
MAX_RATE_1 | OTHER_WAGE_SOURCE_2 | JOB_TITLE | PW_UNIT_1 | |||||||||
PART_TIME_1 | MAX_RATE_1 | PW_SOURCE_1 | ||||||||||
CITY_1 | PART_TIME_1 | OTHER_WAGE_SOURCE_1 | ||||||||||
STATE_1 | CITY_1 | YR_SOURCE_PUB_1 | ||||||||||
PREVAILING_WAGE_1 | STATE_1 | WORK_LOCATION_CITY2 | ||||||||||
YR_SOURCE_PUB_1 | PREVAILING_WAGE_1 | WORK_LOCATION_STATE2 |
In short the response variable for the model will be the field which has the following aliases APPROVAL_STATUS, STATUS or CASE_STATUS and the other fields describe the applicants profile
After normalizing all columns, the following columns seem to have consistent history across time and are more relevant to the problem definition. This is the final schema used for modeling
FIELD | DESCRIPTION |
---|---|
CASE_NUMBER | Unique row id |
CASE_STATUS | Status associated with the last significant event or decision. Valid values include “Certified,” “Certified-Withdrawn,” Denied,” and “Withdrawn”. |
CASE_SUBMITTED | Date and time the application was submitted. |
EMPLOYMENT_START_DATE | Beginning date of employment. |
EMPLOYMENT_END_DATE | Ending date of employment. |
EMPLOYER_STATE | State of employer requesting LCA/H1B |
JOB_TITLE | title of job |
SOC_CODE | Occupational code associated with the job being requested for temporary labor condition, as classified by the Standard Occupational Classification (SOC) System. |
SOC_NAME | Occupational name associated with the SOC CODE. |
NAICS_CODE | Industry code associated with the employer requesting permanent labor condition, as classified by the North American Industrial Classification System (NAICS). |
NEW_EMPLOYMENT | Indicates requested worker(s) will begin employment for new employer, |
FULL_TIME_POSITION | Y = Full Time Position; N = Part Time Position. |
PREVAILING_WAGE | Prevailing Wage for the job being requested for temporary labor condition. |
PW_UNIT_OF_PAY | Unit of Pay. Valid values include “Daily (DAI),” “Hourly (HR),” “Bi-weekly (BI),” “Weekly (WK),” “Monthly (MTH),” and “Yearly (YR)”. |
PW_WAGE_LEVEL | Variables include "I", "II", "III", "IV" or "N/A." |
Derived Fields
FIELD | DESCRIPTION |
---|---|
YYYY | year of filing |
EMPLOYMENT_DURATION | Employment duration in days EMPLOYEMENT_END_DATE - EMPLOYMENT_START_DATE, |
EMPLOYMENT_DURATION_BAND | |
NAICS_CODE_2 | First two digits of six digits NAICS code |
ANNUAL_SALARY | Normalize prevailing wage to annual salary |
ANNUAL_SALARY_BAND | Split the annual salary into six bands |
US_REGION | Regions of United States |
CORE TABLE FOR MODELING
materialized view : h1b_yearly_performance_model
FIELD | DESCRIPTION |
---|---|
EMPLOYMENT_DURATION_YEARS | Categorical Feature |
US_REGION | Categorical Feature |
FULL_TIME_POSITION | True/False Feature |
NAICS_CODE_2 | Categorical Feature |
ANNUAL_SALARY_BAND | Categorical Feature |
CASE_STATUS | Response Variable |
Sample rows from h1b_yearly_performance_model
Click to expand!
EMPLOYMENT_DURATION_YEARS | US_REGION | ANNUAL_SALARY_BAND | FULL_TIME_POSITION | NAICS_CODE_2 | YYYY | CASE_STATUS |
---|---|---|---|---|---|---|
3 | Midwest | b3 | 1 | 53 | 2019 | 1 |
3 | South | b2 | 1 | 54 | 2019 | 1 |
3 | South | b2 | 1 | 54 | 2019 | 1 |
2 | Northeast | b2 | 1 | 52 | 2019 | 1 |
3 | West | b3 | 1 | 23 | 2019 | 1 |
2 | Northeast | b4 | 1 | 52 | 2019 | 1 |
3 | South | b2 | 1 | 54 | 2019 | 1 |
3 | South | b3 | 1 | 54 | 2019 | 1 |
3 | South | b3 | 1 | 54 | 2019 | 1 |
3 | South | b2 | 1 | 54 | 2019 | 1 |
Final Data Flow
The following transformations were applied to the dataset
Binarize Case Status
when lower(CASE_STATUS) = 'certified' then 1
when lower(CASE_STATUS) = 'certified-withdrawn' then 1
when lower(CASE_STATUS) = 'denied' then 0
when lower(CASE_STATUS) = 'invalidated' then 0
when lower(CASE_STATUS) = 'withdrawn' then 0
when lower(CASE_STATUS) = 'pending quality and compliance review - unassigned' then 0
when lower(CASE_STATUS) = 'rejected' then 0
Normalize Annual Salary
Using unit of pay, converted all prevailing wage information into annual salary
when PREVAILING_WAGE in ('', 'N/A', '.', '20-70') then '0'
when pw_unit_of_pay is null and (cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float) = 0) then 0
when pw_unit_of_pay is null and PREVAILING_WAGE is null then 0
when pw_unit_of_pay is null and PREVAILING_WAGE = '0.00' then 0
when pw_unit_of_pay is null and PREVAILING_WAGE = '0' then 0
when lower(pw_unit_of_pay) = 'year' then cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'yr' then cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'month' then 12*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'mth' then 12*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'bi-weekly' then 26*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'bi-weekly' and PREVAILING_WAGE = '0' then 0
when lower(pw_unit_of_pay) = 'bi-weekly' and PREVAILING_WAGE = '0.00' then 0
when lower(pw_unit_of_pay) = 'bi' then 26*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'weekly' then 52*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'week' then 52*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'wk' then 52*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'hour' then 2080*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
when lower(pw_unit_of_pay) = 'hr' then 2080*cast(REPLACE(REPLACE(PREVAILING_WAGE, '$',''), ',', '') as float)
Annual Salary Band Range Definition
when annual_salary between 0 and 50000 then 'b1'
when annual_salary between 50001 and 100000 then 'b2'
when annual_salary between 100001 and 150000 then 'b3'
when annual_salary between 150001 and 200000 then 'b4'
when annual_salary between 200000 and 250000 then 'b5'
when annual_salary > 250000 then 'b6'
Employment Duration in years
EMPLOYMENT_END_YEAR - EMPLOYMENT_START_YEAR as EMPLOYMENT_DURATION_YEARS,
cast(split_part(EMPLOYMENT_END_DATE, '/', 3) as integer) as EMPLOYMENT_END_YEAR,
cast(split_part(EMPLOYMENT_START_DATE, '/', 3) as integer) as EMPLOYMENT_START_YEAR,
Binarize Full time position field
when lower(FULL_TIME_POSITION) = 'y' then 1
when lower(FULL_TIME_POSITION) = 'n' then 0
One hot encoding for categorical fields
Generated dummies using one hot encoding
pd.get_dummies(df['us_region'], drop_first=True)
pd.get_dummies(df['annual_salary_band'], drop_first=True)
Used Random Forest Classifier with 100 estimators to fit the training set, model generated an accuracy of 93%. Exported the model using externals lib from scikit learn.
Firefly lightweight function as a service library written in Python is used. We can use POST request to the end point and obtain a prediction.
firefly h1b_model_preds.predict --bind 127.0.0.1:5000