How does the difference in individuals’ characteristics influence the insurance claims?

By

Jedsada Thavornfung

Eric Barragan

Michael Cheng

Mark Miller

University of Texas at Austin

PSY 371M Introduction to Machine Learning

February 2023 - May 2023

Advisor: Pd.D. Chen Yu


Program:

  • Excel - Organize data

  • Jupyter Notebook - Data visualization

Introduction

What influences how much a person would receive in an insurance claim? In this study, we investigated the trend of insurance claims by clustering patients into different groups.  The project is interesting because clustering information into different groups can help identify the trends of claims in each group and compare them with other information. KMeans was chosen as a tool for clustering. KMeans clustering is one of the most commonly used clustering techniques in data science and machine learning. It is a type of unsupervised learning method that involves dividing data points into k distinct clusters based on similarities in their features or characteristics. K-means clustering is particularly useful in identifying patterns and trends within large and complex datasets, making it an essential tool in many fields, including healthcare, finance, marketing, and social sciences.


In this project, KMeans clustering was used to group patients based on certain characteristics, such as age, gender, medical history, and types of insurance policies held. By analyzing the trends of insurance claims within each cluster, researchers can gain insights into factors that contribute to the likelihood and severity of claims. The hypothesis is that there are distinct groups of patients with low, middle, and high-risk insurance claims. By applying KMeans clustering to the data, researchers can identify these groups and analyze the trends and characteristics of each group to better understand the factors that contribute to their risk levels.


By implementing KMeans clustering into the analysis, researchers used a combination of Excel and Python. In Excel, researchers preprocess and clean the data, and perform some basic exploratory analysis. In Python, researchers used libraries like Pandas, NumPy, and other methods to apply the KMeans clustering algorithm for visualization and perform further analysis.


Overall, by using these methods, insurance companies can improve their risk assessment and pricing models, making it easier to provide affordable and comprehensive coverage to their policyholders. Additionally, by identifying high-risk clusters and offering targeted education and preventative measures, insurance companies can help reduce the number and severity of claims, leading to a more efficient and effective healthcare system.

Data and Data Preprocessing

Before performing KMeans clustering, the data was being cleaned and organized. In this research, the data is called the Insurance dataset from Data.Word. The data includes patient ID, age, gender, BMI, blood pressure, diabetic history, number of children, smoking status, region, and insurance claim. The example of the data is being illustrated in Figure 1. Once the data was collected, researchers cleaned and organized the data by deleting any patient with missing information. Then, the data was sorted based on insurance claim (lowest to highest) and smoking status (alphabetical order), respectively.


Figure 1. Overall data of the Insurance dataset where the green highlighted columns were the selected columns to undergo KMeans clustering.


As illustrated in Figure 2, there were 1340 patients in total in the dataset. Patients (male = 678 and female = 662) range from 18 - 60 years old with the mean of 38 years old. The BMI ranges from 16 - 53.1 lbs/in2 with the average of 30.67 lbs/in2. Blood pressure ranges from 80 - 140 lbs/in2 with the average of 94.16 mmHg. There are 642 patients with diabetic, and another 698 patients without diabetic. On average, patients in this dataset have 1 child in total; however, it ranges from 0 - 5 children depending on the individual. There are 1066 patients who are not smokers, and the other 274 patients identify as smokers. In addition, participants were classified from four different regions: Southeast (n = 443), Northwest (n = 352), Southwest (n = 314), and Northeast (n = 231). Lastly, the insurance claims range from $1,121.87 - $63,770.43 with the average of $13,252.75.


Figure 2. Descriptive Statistics of the numerical aspect of the data

Exploratory Data Analysis With Visualization

At the outset, the columns that were already expressed in numerical forms: age, BMI, blood pressure, number of children, and insurance claims were chosen to undergo KMeans clustering due to their pre-existing suitability for analysis. Other columns such as gender, diabetic, smoker, and region were investigated to see how much those elements related to the insurance claims. If those elements have high correlation or illustrate trends that relate to the insurance claim, those elements will be transformed to numerical values to be used in KMeans clustering. For instance, if gender illustrated a high correlation with insurance claim, male and female will be changed to 0 and 1 before adding to the already existing numerical columns to undergo KMeans clustering.


A.

A.

B.

B.

C.

C.

D.

D.

E.

E.

Figure 3. Descriptive Statistics of the numerical aspect of the data. (A) Average Medical Claim Amount by Region and Smoking Status. (B) Relationship between BMI and Medical Claim amount for Smoking status. (C) Relationship between Diabetic status and Medical Claim amount for Smoking status. (D) Relationship between blood pressure and Medical Claim amount for Smoking status. (E) Relationship between Age and Medical Claim amount for Smoking status


Based upon Figure 3, every domain that was highlighted in Figure 1 did not have a significant impact on the claim amount. However, Figure 3A, 3B, 3D, and 3E illustrated that smoking status had obvious impacts on the insurance claims. As illustrated in Figure 4, the smoking status of patients has the highest correlation with the insurance claims they receive, as high as 0.79 or 79% correlation. Thus, only smoking status was included alongside variables for KMeans clustering by changing from “No/Yes” variable to 0 and 1 where 0 means not smoking and 1 means smoking.


Figure 4. Overview of all correlation between each domain, particularly focusing on the importance of claim.

Method and Results

In this research, KMeans clustering was performed in both Excel and Python to investigate if they generate similar results. Both methods were chosen because researchers want to include the lesson from class (KMeans clustering with Excel) with the new challenge (KMeans clustering with Python). To simplify the research, researchers decided to generate 3 clusters, which represent low risk insurance claims, middle risk insurance claims, and high-risk insurance claims.

Excel Method

When conducting analysis in Excel, researchers were left with different results in comparison with the python method. The template in Excel executed KMeans clustering by implementing a formula that calculates the Euclidean distance between each datapoint in comparison to its closest neighboring cluster. The Euclidean distance formula is:

p, q = two points in Euclidean n-space

qi,  pi = Euclidean vectors, starting from the origin of the space (initial point)

n = n-space

For the preprocessing step, the data was normalized through a min-max normalization method. As mentioned above, researchers decided to use the columns with numerical data points which included: Age, BMI, Blood Pressure, Children, and Claim. This was necessary as KMeans only supports clustering observations with numerical values. The Smoker column was also added to the model and converted to 1 and 0, which represents “yes” and “no” respectively since smoking status had a high correlation with the insurance claims in the dataset. Three clusters were generated in the template in order to categorize our data. In order to successfully perform KMeans clustering, 11 iterations of the algorithm were performed in order to reach a 0 convergence. As a result, cluster 1 (n = 508), cluster 2 (n = 274), and cluster 3 (n = 558) each held unique means of each column which allowed researchers to extrapolate correlates in the data. Cluster 1 held a high age group, were non-smokers generally, and held a small claim amount. Cluster 2 held a medium age group, very high claim amount, and held a large population of smokers. Cluster 3 was another general cluster as it held a lower age range, low claim amount, and were non-smokers for the most part. The overall correlation of insurance claims based on the cluster groups are illustrated in Figure 5. After visualizing the data, it seemed like smoking greatly affected the insurance claims in a person’s insurance policy. Although the result from Excel was different compared to the result from Python, both methods had come to the conclusion that smoking held a very large correlation with claim amounts and concluded that it was one of the biggest risk factors in terms of insurance policy claims. 


Figure 5. The correlation of insurance claims of each cluster group from Excel method.


Python Method

For the Python method, Jupyter Notebook (.ipynb) was selected as a tool to generate KMeans clustering. Firstly, the libraries were imported into the Jupyter Notebook as illustrated below. The more significant libraries are Pandas, which help import the .csv file into the Jupyter Notebook; SKLearn, which help perform KMeans clustering based on the given dataset; Seaborn, Matplotlib, and Plotly, which help create the visualizations.

# Import Library
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler     # Scaling Data with Min/Max
from sklearn.preprocessing import StandardScaler   # Scaling Data with Standard Scaler
from sklearn.preprocessing import normalize        # Normalize Data
from matplotlib import pyplot as plt               # Visualization
%matplotlib inline                                 # Visualization
from sklearn.decomposition import PCA              # Generate PCA
import seaborn as sns                              # Visualization
pd.options.mode.chained_assignment = None          # Get rid of 'SettingWithCopyWarning'
import plotly.express as px                        # 3D Plot

# Import Library
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler     # Scaling Data with Min/Max
from sklearn.preprocessing import StandardScaler   # Scaling Data with Standard Scaler
from sklearn.preprocessing import normalize        # Normalize Data
from matplotlib import pyplot as plt               # Visualization
%matplotlib inline                                 # Visualization
from sklearn.decomposition import PCA              # Generate PCA
import seaborn as sns                              # Visualization
pd.options.mode.chained_assignment = None          # Get rid of 'SettingWithCopyWarning'
import plotly.express as px                        # 3D Plot

# Import Library
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler     # Scaling Data with Min/Max
from sklearn.preprocessing import StandardScaler   # Scaling Data with Standard Scaler
from sklearn.preprocessing import normalize        # Normalize Data
from matplotlib import pyplot as plt               # Visualization
%matplotlib inline                                 # Visualization
from sklearn.decomposition import PCA              # Generate PCA
import seaborn as sns                              # Visualization
pd.options.mode.chained_assignment = None          # Get rid of 'SettingWithCopyWarning'
import plotly.express as px                        # 3D Plot

Once all of the libraries were imported, the dataset was then imported into the Jupyter Notebook using Pandas library. As discussed above, only age, BMI, blood pressure, number of children, smoking status, and insurance claim were selected as domains for the KMeans clustering in this research. Thus, the new variable was created with only the selected domains/columns. Next, by using the km.fit_predict where km was set to equals to 3 to represent low risk insurance claims, middle risk insurance claims, and high-risk insurance claims, the different clusters were generated. However, since there were 6 domains, the principal component analysis (PCA) was performed to reduce from 6 domains to 3 domains for visualization. PCA is a statistical technique used to reduce the dimensionality of large data sets while retaining the most important information, which are ranked by the amount of variance they explain in the data, with the first component explaining the most variance.


Once all of the libraries were imported, the dataset was then imported into the Jupyter Notebook using Pandas library. As discussed above, only age, BMI, blood pressure, number of children, smoking status, and insurance claim were selected as domains for the KMeans clustering in this research. Thus, the new variable was created with only the selected domains/columns. Next, by using the km.fit_predict where km was set to equals to 3 to represent low risk insurance claims, middle risk insurance claims, and high-risk insurance claims, the different clusters were generated. However, since there were 6 domains, the principal component analysis (PCA) was performed to reduce from 6 domains to 3 domains for visualization. PCA is a statistical technique used to reduce the dimensionality of large data sets while retaining the most important information, which are ranked by the amount of variance they explain in the data, with the first component explaining the most variance.


Once all of the libraries were imported, the dataset was then imported into the Jupyter Notebook using Pandas library. As discussed above, only age, BMI, blood pressure, number of children, smoking status, and insurance claim were selected as domains for the KMeans clustering in this research. Thus, the new variable was created with only the selected domains/columns. Next, by using the km.fit_predict where km was set to equals to 3 to represent low risk insurance claims, middle risk insurance claims, and high-risk insurance claims, the different clusters were generated. However, since there were 6 domains, the principal component analysis (PCA) was performed to reduce from 6 domains to 3 domains for visualization. PCA is a statistical technique used to reduce the dimensionality of large data sets while retaining the most important information, which are ranked by the amount of variance they explain in the data, with the first component explaining the most variance.


A.

A.

B.

B.

Figure 6. The visual representation of PCA that was generated by sns.scatterplot after reducing from 6 domains: age, BMI, blood pressure, number of children, smoking status, and insurance claim to 3 domains. Figure A and B are the same figures with different representation methods. Figure A represents the 2D version of KMeans clustering. Figure B represents the 3D version of KMeans clustering. Interactive 3D plot via https://nbviewer.org/github/jedsadatha/Insurance_KMeans/blob/main/index.html

Click on the animation to play with the model.


Figure 6A represents the 2D version of the KMeans clustering, which was generated using PCA 1 and 2. As illustrated in Figure 6A, the tree clusters are clearly distinguished from each other where most of cluster 1 does not exceed 0 on the PCA 1. Meanwhile, cluster 2 ranges from 0 to approximately 15,000 on PCA 1. Lastly, cluster 3 is starting from approximately 15,000 on PCA 1. However, Figure 6A appears to be inadequately depicted, leaving its meaning unclear to the audience. Therefore, in order to enhance the comprehensibility of the information conveyed, an additional Figure 6B has been created. Figure 6B represents the 3D version of KMeans clustering, which was generated using PCA 1, 2, and 3. However, the size of the dot on each data point depends on the insurance claims where the smaller size represents the smaller insurance claims, and the bigger size represents the larger insurance claims. As illustrated in Figure 6B, cluster 1 has patients with smaller insurance claims compared to cluster 2 and 3. Cluster 1 has a total of 890 patients with insurance claims ranging from $1,121.87 - $11,187.66, the average insurance claim is $6,418.91 (low risk insurance claims). Cluster 2 has a total of 162 patients with insurance claims ranging from $28,468.92 - $63,770.43, the average insurance claim is $40,761.31 (high risk insurance claims). Lastly, cluster 3 has a total of 288 patients with the insurance claims range from $11,244.38 - $27,641.65, the average insurance claim is $18,897.64 (middle risk insurance claims). To investigate further, researchers calculated the correlation of PCA values with each domain.


Figure 7. The correlation between PCA 1, 2, and 3 with the selected domains: age, BMI, blood pressure, number of children, smoking status, and insurance claim.


As illustrated in Figure 7, PCA 1, which is considered the most important PCA because it explains the largest amount of variation in the data, has a strong correlation with both smoking status of 0.79 and insurance claim of 1. Thus, by using SKLearn, smoking status of patients has a strong correlation with the amount of insurance claim they would receive, which was suggested by looking at the visual representation and correlation.

Interpretation and Future Work

Although the clustering from Excel and Python methods resulted differently, both methods illustrated that smoking status of patients had a high influence on insurance claims compared to other elements from the dataset. The smoking status also heavily influenced PCA 1 value, which means that the variable was strongly associated with the direction of greatest variance in the data. In addition, there are several potential errors that cause the result discrepancy from Excel and Python methods despite using the same dataset. First, KMeans clustering from SKLearn in Python involves randomly initializing the cluster centers, which can lead to different results each time the algorithm is run. Second, different implementations of KMeans may use different convergence criteria, such as the number of iterations or the change in the cluster centers. If the convergence criteria are different, it can lead to different results. In conclusion, the results from both methods provided evidence that each element or characteristic of patients influenced the insurance claim they could receive. However, each characteristic may weigh differently from each other. For instance, based on the correlation, insurance companies used the smoking status of patients/customers as a main criterion to define the insurance claims individuals would receive.


For future study, first, researchers could choose all the characteristics on the dataset instead of choosing the specific characteristics: age, BMI, blood pressure, number of children, smoking status, and insurance claim. Although gender, diabetic status, and region were proven to have low correlation with insurance claims, those characteristics might have some impacts on clustering groups or even influence the PCA values. Second, researchers could try to use different approaches to generate the clusters such as using the K-Nearest Neighbors algorithm or using different methods of normalizing the data. Third, the dataset could be improved by collecting another domain that may influence the insurance claims: income of patients. A patient's income level can determine whether they are eligible for certain insurance plans or government-sponsored programs, such as Medicaid. Patients with lower incomes may qualify for these programs, which can provide more comprehensive coverage at a lower cost. In the end, there are many more potential ways to improve this study other than those that were mentioned, but the main improvements that researchers suggest based on this study are incorporating more data into the clustering, using different algorithms and methods, and collecting more variables.

Data and Additional Files

Let’s connect!

jedsada.thavornfung@gmail.com

Let’s connect!

jedsada.thavornfung@gmail.com