Predicting Income and Exploring Gender Wage Gap: Machine Learning using Census Data

Author Generated Image

A gender wage gap refers to men and women having different salaries despite of the same educational levels, years of experience and qualifications. This gap is especially wider for women of color. Understanding these gaps and how they occur in a population could be helpful for designing government policies and benefitting communities.

Predict whether a person makes more than $50K annual given the features in the dataset. I have used KNN regression for this predictive model, though Naïve Bayes, Logistic Regress and Decision Trees could be used as well.

About the Dataset
The Census Bureau adult dataset from 1994, compiled by Barry Becker, hosted by The Machine Learning Group at UCI has been used. It has 32561 observations, with 16 features as follows: Age, Work, Fnlwgt, Education, EducYears, MaritalStatus, Occupation, Relationship, Race, Sex, CapitalGain, CapitalLoss, HoursPerWeek, NativeCountry, Income.

The attributes of our interest are:

  • high_income: target class.
  • Age: continuous.
  • Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • Education-num: continuous.
  • Sex: Female, Male.

Additional features:

  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • fnlwgt: continuous.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Data Exploration
The dataset has approximately 25% female and 75% male income observations.

Predictive Model Development
For this project, I used RStudio and R libraries. It is important to understand how the model was developed and why it predicted a certain result.

  • Split model into train and test sets to ensure our model works on unseen data.
  • Analyze which features are significant. The scope of this exercise involved using education years, sex and age features to predict income. These features were scaled as standard deviation is different.
  • One hot encoding used for Sex feature to treat the character variable as numeric.
  • 25 k-nearest neighbor (KNN) classification model applied.

The model had an 85% accuracy with 10 fold cross validation.

Prediction Results

Our hypothesis that as education increases, income increases holds true for both men and women in general.

However, the hypothesis that the combination of education and age increase means higher income only holds true for men. Men with ≤ 16 years of education also have a higher incomes if they are older, as shown in the top right corner of the graph, validating the assumption. We can see that with education of ≥ 12 years, income is higher for men even in the 60+ age range, though some could face age based discrimination.

For women, however, even with 16+ years of education income does not increase with age. In comparison to men, we do not see any women who make ≥ $50K below 30 years of age.

In summary, women need both job experience and high levels of education to earn good income. They are only predicted to have higher income if they are between 30 and 60 in addition to having a higher educational degree.

Further Studies
It would be interesting to see how the gender wage gap has changed over the years since 1994 using latest Census data. Additional exploration can be done in the following areas:

  • How the wage gap shows up for both men and women of color?
  • Are people from certain race experiencing a larger wage gap?
  • Does marital status have any bearing on income?
  • Does native country and native language matter?
  • Which occupations have the largest gender wage gap?
  • What are the other factors on which income could depend — such as the education background, occupation, geography, number of working hours/week, capital income/loss, etc.?




