Post

Medical Insurance Cost Estimator

Medical Insurance Cost Estimator

Open in Github Page

Health Insurance is medical coverage that helps you meet your medical expenses by offering financial assistance. Due to the high cost of hospitalization expenses, it is important to have a health insurance plan in place. In the current pandemic situation, health insurance plays a vital role in safeguarding your finances.

Problem Statement & Objective

Imagine yourself working as a data scientist in an insurance company. Your manager asked you to come up with a data science solution to estimate the medical cost of an individual who has bought health insurance in the institution. Build a machine learning model to estimate the medical cost of an individual.

Code and Resources Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, xgboost, lightgbm
Data Source: Zach Stednick Data Link: https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv

Data Dictionary

age: age of primary beneficiary. (if the age is given in decimal, consider it as the nearest integer, for example, if age = 19.1, it’s nearest integer is 19, if age = 22.6, it’s nearest integer is 23)

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights.

Dataset Shape: (3630, 7)

 NamedtypesMissingUniques
0agefloat6401589
1sexobject02
2bmifloat6402322
3smokerobject02
4regionobject04
5childrenint6406
6chargesfloat6402951

boxplot_charges_sex boxplot_charges_smoker

Model Building

Split the data into train and tests sets with a test size of 20%.

I tried five different models and evaluated them using Root Mean Squared Error. I choose RMSE because it provides a measure of the model’s prediction accuracy in the same units as the target variable & also it is sensitive to outliers.

Models:

  • Linear Regression
  • Decision Tree Regressor
  • Random Forest Regressor
  • XGBoost Regressor
  • LGBM Regressor

Model performance

The Random Forest Regressor model outperformed the other approaches on the test and validation set.

 ModelRMSE
0Random Forest Regressor3331.72
1XGBoost Regressor3597.40
2Decision Tree Regressor4453.63
3Linear Regression5668.19
This post is licensed under CC BY 4.0 by the author.