Medical Insurance Cost Estimator

Posted Jul 22, 2021 Updated Nov 30, 2023

By Aryan Jain

2 min read

Health Insurance is medical coverage that helps you meet your medical expenses by offering financial assistance. Due to the high cost of hospitalization expenses, it is important to have a health insurance plan in place. In the current pandemic situation, health insurance plays a vital role in safeguarding your finances.

Problem Statement & Objective

Imagine yourself working as a data scientist in an insurance company. Your manager asked you to come up with a data science solution to estimate the medical cost of an individual who has bought health insurance in the institution. Build a machine learning model to estimate the medical cost of an individual.

Code and Resources Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, xgboost, lightgbm
Data Source: Zach Stednick Data Link: https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv

Data Dictionary

age: age of primary beneficiary. (if the age is given in decimal, consider it as the nearest integer, for example, if age = 19.1, it’s nearest integer is 19, if age = 22.6, it’s nearest integer is 23)

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights.

Dataset Shape: (3630, 7)

	Name	dtypes	Uniques
0	age	float64	1589
1	sex	object	2
2	bmi	float64	2322
3	smoker	object	2
4	region	object	4
5	children	int64	6
6	charges	float64	2951

Model Building

Split the data into train and tests sets with a test size of 20%.

I tried five different models and evaluated them using Root Mean Squared Error. I choose RMSE because it provides a measure of the model’s prediction accuracy in the same units as the target variable & also it is sensitive to outliers.

Models:

Linear Regression
Decision Tree Regressor
Random Forest Regressor
XGBoost Regressor
LGBM Regressor

Model performance

The Random Forest Regressor model outperformed the other approaches on the test and validation set.

	Model	RMSE
0	Random Forest Regressor	3331.72
1	XGBoost Regressor	3597.40
2	Decision Tree Regressor	4453.63
3	Linear Regression	5668.19

Projects, Machine Learning, Regression

This post is licensed under CC BY 4.0 by the author.