Maria Guideng April 12, 2018
About
Gender is an R package that predicts the gender of a name based on historical data. I'll explore this package as well as apply it to salary data provided by NPRI on Nevada public employees.
The input and output files are located in the data
folder. In its original form, the input file all-nevada-2016.csv is obtained from a separate repository and has 12 columns. The output file all-nevada-2016-with-gender.csv is the same, but has an additional column for employee gender. My little contribution is the R script that adds this feature through the use of this package.
#install.packages("gender")
library(gender)
ls("package:gender")
## [1] "check_genderdata_package" "gender"
## [3] "gender_df" "install_genderdata_package"
help(package = "gender")
After referencing the package's documentation and vignettes, I see that the basic function is gender()
. One of its cool features is that it accounts for the fact that genders associated with names can change over time. Also, loading this package prompts you to install the genderdata
package, which contains several datasets that permits the user to make predictions based on different time periods and geographical regions. As such, gender()
allows you to choose a method to pass a set of names, places and birth years through the function.
Taken directly from the documentation, the basic usage is:
gender(names, years = c(1932, 2012), method = c("ssa", "ipums", "napp","kantrowitz", "genderize",
"demo"),countries = c("United States", "Canada","United Kingdom", "Germany", "Iceland", "Norway",
"Sweden"))
...where method
provides six options:
- ssa: United States from 1930 to 2012. Drawn from Social Security Administration data. Option to select by state.
- ipums: United States from 1789 to 1930. Drawn from Census data.
- napp: Canada, the United Kingdom, Germany, Iceland, Norway, and Sweden for years between 1758 and 1910.
- kantrowitz: 7,579 unique names compiled by Mark Kantrowitz in 1991.
- genderize: based on "user profiles across major social networks."
- demo: Uses the top 100 names in the ssa method; provided for demonstration purposes (and not suitable for research purposes) when the
genderdata
package is not installed.
...and countries
is the countries for which datasets are being used. For the:
- "ssa" and "ipums" methods, the only valid option is United States which will be assumed if no argument is specified;
- "napp" method, you may specify a character vector with: Canada, United Kingdom, Germany, Iceland, Norway, and Sweden; and
- "kantrowitz" and "genderize" methods, no country should be specified.
If no method or years are specified, gender()
will default to the ssa method for years 1932 - 2012. We'll keep things simple and go this route since it's the most suitable method for our salary data.
stringr
and dplyr
will help to manipulate the names and summarize tabular data. RCurl
will bring in the input file.
#install.packages(c("stringr", "dplyr", "RCurl"))
library(stringr)
library(dplyr)
library(RCurl)
Import the input file all-nevada.2016.csv from this repository. This will take a few minutes.
urlfile <- 'https://raw.githubusercontent.com/mguideng/Gender-Pkg-NV-Salary/master/data/all-nevada-2016.csv'
salary <- read.csv(urlfile)
salary$X <- NULL
Let's check it out!
str(salary)
## 'data.frame': 146662 obs. of 12 variables:
## $ Employee.Name : Factor w/ 144296 levels "1 Undercover",..: 58744 122022 116480 32508 75772 55431 16607 64548 85294 140455 ...
## $ Job.Title : Factor w/ 12995 levels ".INTERN I",".INTERN II",..: 2554 4528 4527 2530 5578 5570 4523 8888 2536 5570 ...
## $ Base.Pay : num 130742 123418 99301 126718 129679 ...
## $ Overtime.Pay : num 0 0 0 0 0 ...
## $ Other.Pay : num 31072 34365 47302 19326 15122 ...
## $ Benefits : num 54022 40143 50572 50350 51263 ...
## $ Total.Pay : num 161814 157783 146603 146044 144801 ...
## $ Total.PayBenefits: num 215836 197927 197175 196394 196064 ...
## $ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Notes : Factor w/ 3 levels "","Other Pay includes one-time retroactive payment.",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Agency : Factor w/ 105 levels "Boulder City",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Status : Factor w/ 2 levels "FT","PT": 1 1 1 1 1 1 1 1 1 1 ...
head(salary,5)
## Employee.Name Job.Title Base.Pay Overtime.Pay
## 1 Jay D Fraser CITY MANAGER 130742.05 0
## 2 Scott P Hansen DIRECTOR/PUBLIC WORKS 123417.90 0
## 3 Roger C Hall DIRECTOR/PARKS AND RECREATION 99300.74 0
## 4 David R Olsen CITY ATTORNEY 126717.80 0
## 5 Kevin D Nicholson FIRE CHIEF 129678.90 0
## Other.Pay Benefits Total.Pay Total.PayBenefits Year Notes Agency
## 1 31071.55 54022.18 161813.6 215835.8 2016 <NA> Boulder City
## 2 34365.30 40143.36 157783.2 197926.6 2016 <NA> Boulder City
## 3 47302.25 50571.52 146603.0 197174.5 2016 <NA> Boulder City
## 4 19325.80 50349.90 146043.6 196393.5 2016 <NA> Boulder City
## 5 15122.30 51263.28 144801.2 196064.5 2016 <NA> Boulder City
## Status
## 1 FT
## 2 FT
## 3 FT
## 4 FT
## 5 FT
There's a total of 146,662 employees. We're only interested in using the Employee.Name
variable here. Since the names are combined, we'll need to parse out the vector of character strings.
Before passing the data through the gender function, it has to be prepped. The historical data of names within the package doesn't have any non-character symbols (e.g., digits, periods, hyphens, whitespace). It also uses lower title cases. Then we'll isolate the first names. The expected naming order convention is that the first name will be followed by the the middle and last names ("maria guideng"). It's also common to report the first name following a comma ("guideng, maria"). Our data should mimic these features. So those are three tasks so far.
# Task 1. Remove non-characters. The order these are run is important.
salary$Employee.Name <- str_replace_all(salary$Employee.Name, "\\d |\\d", "") # Remove digits
salary$Employee.Name <- gsub("\\.", " ", salary$Employee.Name) # Replace '.' with a space
salary$Employee.Name <- gsub("-", " ", salary$Employee.Name) # Replace '-' with a space
salary$Employee.Name <- gsub(" ", " ", salary$Employee.Name) # Remove triple spaces
salary$Employee.Name <- gsub(" ", " ", salary$Employee.Name) # Remove double spaces
# Task 2. Isolate first names. If a comma is present ("guideng, maria"), take first element of the string as `First.Name`.
# Otherwise ("maria guideng"), the first element of the string will be taken.
salary$First.Name <- sub(".*\\, ", "", salary$Employee.Name) # If comma, take element following ','
salary$First.Name <- sub(" .*", "", salary$First.Name) # Otherwise, just take first element.
# Task 3. Lower case the first names.
salary$First.Name <- str_to_lower(salary$First.Name)
Alas, let's run the gender function on the newly-created First.Name
column.
salary_gender <- gender(salary$First.Name)
It inferred the gender of 136,306 of the 146,662 total employees. That's pretty good! Let's take a peek.
head(salary_gender,5)
## # A tibble: 5 x 6
## name proportion_male proportion_female gender year_min year_max
## <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 aalya 0. 1.00 female 1932. 2012.
## 2 aamir 1.00 0. male 1932. 2012.
## 3 aarin 0.564 0.436 male 1932. 2012.
## 4 aarin 0.564 0.436 male 1932. 2012.
## 5 aarin 0.564 0.436 male 1932. 2012.
The gender function took a character vector of names and a range of years to predict the gender. Here, "aalya" has been proportionately all female, while "aamir" has undoubtedly been a male name. It's not so clear for the name "aarin" however, which has historically been nearly half and half: 56.4% being males and 43.6% females. However, it favors a male gender and is assigned as such.
We will join this to the salary dataframe, so let's clean it up. Keep just the name
and gender
variables and rename them so they match the salary header names. Removing the duplicate names will also help make joining easier.
# Join to the salary dataframe
salary_gender <- salary_gender[c("name", "gender")]
colnames(salary_gender) <- c("First.Name", "Gender")
salary_gender <- subset(salary_gender, !duplicated(salary_gender$First.Name))
salary <- left_join(salary, salary_gender)
## Joining, by = "First.Name"
What's the distribution of males and females employees?
salary %>%
group_by(Gender) %>%
summarize(employees=n()) %>%
mutate(percent=round((employees/sum(employees)*100))) %>%
arrange(desc(percent))
## # A tibble: 3 x 3
## Gender employees percent
## <chr> <int> <dbl>
## 1 female 77743 53.
## 2 male 58563 40.
## 3 <NA> 10356 7.
Based on our work so far, women dominate the workforce when it comes to public service in Nevada. It's estimated to be at 53%, but it would actually be more since it doesn't capture any from the 7% whose gender is undetermined. Let's set them to "undetm" in our data.
salary$Gender[is.na(salary$Gender)] <- "undetm"
Let's take a closer look and see what's going with the undetermined gender names by checking for patterns in the names and agencies. Can we do a better job?
undetm <- salary %>%
filter(Gender=="undetm") %>%
group_by(First.Name, Employee.Name, Agency) %>%
summarize (n = n()) %>%
arrange(-n)
print(undetm)
## # A tibble: 10,324 x 4
## # Groups: First.Name, Employee.Name [10,292]
## First.Name Employee.Name Agency n
## <chr> <chr> <fct> <int>
## 1 redacted Redacted Undercover Redacted Undercover Henderson 16
## 2 undercover Undercover Sparks 8
## 3 escamilla ESCAMILLA MARTIN University Me~ 2
## 4 gonzalez GONZALEZ ANTHONY Clark County 2
## 5 gonzalez GONZALEZ TOMAS Clark County 2
## 6 grana GRANA JOHN Clark County 2
## 7 morales MORALES JOSE Clark County 2
## 8 noi PAPPAS, NOI Clark County 2
## 9 orozco OROZCO ELIZABETH University Me~ 2
## 10 sandoval SANDOVAL SANDRA University Me~ 2
## # ... with 10,314 more rows
Which agencies have employees with the most undetermined genders?
undetm %>%
group_by(Agency) %>%
summarize (n = n()) %>%
arrange(-n)
## # A tibble: 81 x 2
## Agency n
## <fct> <int>
## 1 Clark County 2906
## 2 University Medical Center 2371
## 3 Clark County School District 1837
## 4 State of Nevada 649
## 5 Washoe County School District 430
## 6 University of Nevada, Las Vegas 303
## 7 University of Nevada, Reno 293
## 8 Elko County 213
## 9 Las Vegas Metro Police Department 191
## 10 Las Vegas 134
## # ... with 71 more rows
After some poking around, here are the issues identified:
-
There are some names that are redacted or classified as being undercover. In these cases, keep them as is. Also, as expected, the package fails to make predictions for quite a bit of ethnic names, particularly for those of Asian descent. We could look for a solution later, but let's move on for now.
-
Another issue is that University Medical Center reports employees using a different naming convention where last name is reported first without a comma preface ("guideng maria"). Clark County also does at times, but much less often than not. And while Clark County has the highest number of ambiguous names by count, it's also one of the largest agencies in that state with nearly 14,000 employees.
-
First names sometimes appear as suffixes ("sr", "iii", "iv") such as for these guys:
- "sr timothy a maleport sr"
- "iii carl william hart"
- "iv joseph m schum"
-
Also, there's numerous instances where the first and last elements of
Employee.Name
are the same. For example:- "abarca daniel gonzalez abarca"
- "kirgan marsha matsunaga kirgan"
- "suarez juan r calvillo suarez"
-
Lastly, single-initial letters are being picked up as first names and that defies inferring a gender. Like for these folks:
- "a elaine renta"
- "m shane leavitt"
- "p d ingram"
Fortunately, the wrangling to fix issues 2 - 5 won't be so bad. It would be reasonable to adopt the second string of characters in Employee.Name
as the first name in these cases, don't you think? It won't resolve them all, like for Mr.? or Ms.? P D Ingram, but it will help.
Ok, let's do it!
# First, some basic string split operations for later pattern matching functions
salary$stringcount <- str_count(salary$Employee.Name," ")+1 # Count of all elements in string
salary$string1 <- sapply(strsplit(salary$Employee.Name, " "), function(x) x[1]) # 1st elem
salary$string1len <- nchar(salary$string1) # Length of 1st elem
salary$string2 <- sapply(strsplit(salary$Employee.Name, " "), function(x) x[2]) # 2nd elem
salary$stringN <- sapply(strsplit(salary$Employee.Name, " "), tail, 1) # Last elem
# Fix issue 2. UMC reports last names first
salary$First.Name <- ifelse(salary$Agency %in% c("University Medical Center"), salary$string2, salary$First.Name)
# Fix issue 3. Suffix
salary$First.Name <- ifelse(salary$First.Name %in% c("jr","sr", "ii", "iii", "iv"),
salary$string2, salary$First.Name)
# Fix issue 4. First string = Last string
salary$First.Name <- ifelse(salary$string1 == salary$stringN & salary$stringcount>1,
salary$string2, salary$First.Name)
# Fix issue 5. Initials. Identified as those with `First.Name` having a length of 1
# (basically, a single letter) AND a count of 3+ elements in `Employee.Name`
salary$First.Name <- ifelse(salary$string1len == 1 & salary$stringcount>=3,
salary$string2, salary$First.Name)
Ready to rerun that gender function on this updated dataframe.
salary_gender <- gender(salary$First.Name)
# Again, just keep and rename the `name` and `gender` variables, and remove the dupes
salary_gender <- salary_gender[c("name", "gender")]
colnames(salary_gender) <- c("First.Name", "Gender")
salary_gender <- subset(salary_gender, !duplicated(salary_gender$First.Name))
# Now we can re-join it to the salary dataframe, but first remove `Gender`
# from the previous run.
salary$Gender <- NULL
salary <- left_join(salary, salary_gender)
## Joining, by = "First.Name"
salary$Gender[is.na(salary$Gender)] <- "undetm"
Any better from that unknown 7%?
table(salary$Gender)
##
## female male undetm
## 80935 57904 7823
prop.table(table(salary$Gender))
##
## female male undetm
## 0.55184710 0.39481256 0.05334033
Yes, that's a bit better! Applying these fixes matched gender for another ~2,500 employees to 138,839 total employees. It's now down to 5% overall. How about on the agency level?
undetm <- salary %>%
filter(Gender=="undetm") %>%
group_by(First.Name, Employee.Name, Agency) %>%
summarize (n = n()) %>%
arrange(-n)
undetm %>%
group_by(Agency) %>%
summarize (n = n()) %>%
arrange(-n)
## # A tibble: 81 x 2
## Agency n
## <fct> <int>
## 1 Clark County 2907
## 2 Clark County School District 1784
## 3 State of Nevada 647
## 4 University Medical Center 444
## 5 University of Nevada, Las Vegas 228
## 6 Elko County 213
## 7 University of Nevada, Reno 198
## 8 Las Vegas Metro Police Department 188
## 9 Washoe County School District 177
## 10 Las Vegas 133
## # ... with 71 more rows
There's improvements pretty much across the board for all agencies, especially with UMC.
Ok, so our last step is to revert back to the original structure of 12 columns, plus another column for Gender
.
names(salary)
## [1] "Employee.Name" "Job.Title" "Base.Pay"
## [4] "Overtime.Pay" "Other.Pay" "Benefits"
## [7] "Total.Pay" "Total.PayBenefits" "Year"
## [10] "Notes" "Agency" "Status"
## [13] "First.Name" "stringcount" "string1"
## [16] "string1len" "string2" "stringN"
## [19] "Gender"
salary <- salary[, -c(13:18)] # delete columns 13 through 18
Success. It's now ready for export.
write.csv(salary, "all-Nevada-2016-with-gender.csv")
Output dimensions
File | Total Rows | Total Columns | Columns |
---|---|---|---|
all-Nevada-2016-with-gender.csv | 146,662 | 13 | Employee.Name, Job.Title, Base.Pay, Overtime.Pay, Other.Pay, Benefits, Total.Pay, Total.PayBenefits, Year, Notes, Agency, Status, Gender |
Exploration ideas
- How many women work in government versus men and for which agencies?
- How is pay different between men and women by job titles?
Project purposes
- Adding a gender column using Gender, an R package.
- Being new to R, suggestions for improvement are appreciated.