Post

Employee Compensation Estimator

Employee Compensation Estimator

Open in Github Page

Employee compensation involves all the ways your organization gives back to team members for their hard work. The obvious form of compensation is pay, whether it’s salaried, hourly, or sales-based. It’s important that how much an organization financially compensates an employee is fair, especially in terms of balancing the job role itself and the organization’s budget.

The salary or compensation to be paid to an employee of an organization depends on various factors like the organization group, department, job, salaries, etc. of the employee.

Problem Statement

Imagine you are working as a data scientist in a big organization which has thousands of employees. The HR department is planning to provide some additional compensation to each working employee which needs to be calculated by looking at the profile of each employee and the benefits they are getting.

The HR department asks your help if you can use your data science and machine learning skills and calculate an estimated ‘Total Compensation’ for each employee.

Objective

To build a machine learning model to estimate the total compensation to be provided to an employee.

Code and Resources Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, xgboost, lightgbm
Data Source: City and County of San Francisco
Data Link: https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd

Data Dictionary

OGC: Organization Group Code - Org Group is a group of Departments. For example, the Public Protection Org Group includes departments such as the Police, Fire, Adult Probation, District Attorney, and Sheriff.

OG: Organization Group names

DC: Department Code - Departments are the primary organizational unit used by the City and County of San Francisco. Examples include Recreation and Parks, Public Works, and the Police Department.

Dept: Department name

UC: Union Code - Unions represent employees in collective bargaining agreements. A job belongs to one union, although some jobs are unrepresented (usually temporarily).

Union: Union names

JF: Job Family - Job Family combines similar Jobs into meaningful groups.

Job: Job name

EI: Employee Identifier

Salaries: Salary of the employee

Overtime: Amounts paid to City employees working in excess of 40 hours per week.

H/D: Health/Dental - City-paid premiums to health and dental insurance plans covering City employees. To protect confidentiality as legally required, pro-rated citywide averages are presented in lieu of employee-specific health and dental benefits.

YT: Year Type - Fiscal (July through June) or Calendar (January through December)

Total_Compensation: The final compensation i.e. the sum of all salaries and benefits paid to City employees.

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights.

Dataset Shape: (287836, 15)

 NamedtypesMissingUniques
0Yearint6404
1OGCint6407
2OGobject07
3DCobject054
4Deptobject054
5UCint640789
6Unionobject3673
7JFobject3855
8Jobobject01136
9EIint64052403
10Salariesint640104444
11Overtimeint64033632
12H/Dfloat640113669
13YTobject02
14Total_Compensationint640155965

correlation compensation_by_JF

Data Pre-Processing

  1. Dropped the columns Union, Dept & OG since the columns UC, DC & OGC already encode their values.
  2. Removed the ‘Job’ column due to its extensive range of values. Utilizing the ‘Job Family’ column effectively groups and encapsulates these values, streamlining the dataset and minimizing redundancy for clearer representation.
  3. Considering the vast number of categories in the JF, UC & DC columns, a streamlined approach was adopted. The top 10 most frequent values were selected, while the remaining categories were dropped. One-hot encoding was then applied solely to these top 10 values. This strategy significantly reduced dimensionality while retaining crucial categorical information for analysis and modeling.
  4. Implemented the Standard Scaler on the numerical columns.

Model Building

Split the data into train and tests sets with a test size of 20%.

I tried four different models and evaluated them using Root Mean Squared Error. I choose RMSE because it provides a measure of the model’s prediction accuracy in the same units as the target variable & also it is sensitive to outliers.

Models:

  • Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor
  • XGBoost Regressor

Model performance

The XGBoost Regressor model outperformed the other approaches on the test and validation set.

 ModelRMSE
0XGBoost Regressor6370.08
1Random Forest Regressor6508.09
2Linear Regression7406.97
3Decision Tree Regressor8411.68
This post is licensed under CC BY 4.0 by the author.