Allstate Claims Severity: Insurance claim severity prediction

6 min readNov 30, 2020

Table of Content:

Introduction
Business Problem
Mapping to ML Problem
About the Data
Evaluation Metrics
Exploratory Data Analysis
Data Preparation and Feature Engineering
Applying ML models
Conclusion & Future Work
References

1. Introduction

It is true, that life is unpredictable. There might be some uncertain events that can occur in your life, such as a serious car accident, a financial setback in your business, and home damage caused by windstorm or hail. And when you are already devastated by those unfortunate events, your only focus will be on your family and other loved ones. Filing insurance claims and submerging yourself in all the paperwork with your insurance agent will be the last place you want your time and energy to be drained. What if someone makes this claims service process easier for you?

2. Business Problem

Allstate, an insurance company in the United States, they are continually seeking innovative ideas for better expansion of their business. Thereby improving customer experience by making their claims service process faster, smoother, and hassle-free. By relying on a customer-centric approach, we will see how Allstate is going to achieve its business goals.

3. Mapped to ML Problem

In the previous section, we discussed Allstate’s Business problem. Here in this section, we will be focusing on how to solve their business problem using Machine Learning techniques. Whenever any insurance claim is filed, all the paperwork goes through the hand of some human who reviews it checks it thoroughly. Then decides the amount of claim that should be paid to the customer, this is a time taking process. What if we design some machine learning algorithm which takes some input and it spits out the loss amount within a fraction of seconds rather than taking many days. Hence this business problem can be easily thought of as a regression problem, and with the right form of data, we can predict the claim severity.

4. About the Data

To solve this regression problem, data is provided by Allstate in an anonymous form. Data can be downloaded from this Kaggle page.

Each row in the given dataset represents an insurance claim. We must predict the value for the ‘loss’ column. Variables prefaced with ‘cat’ are categorical, while those prefaced with ‘cont’ are continuous.

Shape of train data: 188318 rows and 132 columns.
Shape of test data: 125546 rows and 131 columns.
There are total of 136 features/columns and having the following details:

id: Unique id assigned to each row

cat1, cat2,…, cat116: These are 116 anonymized categorical features.

cont1, cont2,…,cont14: These are 14 anonymized continuous features.

loss: Target variable that needs to be predicted. Amount of cost for each claim.

5. Evaluation Metrics

To evaluate the performance of ML models, Mean Absolute Error(MAE) is used as metrics.

6. Exploratory Data Analysis

Data given has anonymized attributes. And to know more about the given data, EDA had been done in the following ways:

Importing required libraries. Then loading required data files.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import scipy.stats as ss
import warnings
warnings.filterwarnings('ignore')train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

A brief glimpse of training data.

Checking out missing values.

While exploring 116 categorical features, the minimum number of unique categories in some column is 2, and the maximum number of unique categories in some column is 326.

Frequency Plot for Categorical attributes with only two categories

Frequency Plot for Categorical attributes with more than two categories

By using Cramer’s V method, the correlation among categorical features was calculated.

Multiple pairs of categorical attributes have a higher association among them. In other words, the correlation among several pairs of categorical features goes up to 0.99 as well, which I dropped in a later section.

While exploring 14 continuous features, I found that values in all these attributes have a range from 0 to 1. To see their distributions, I used violin plots.

When the distribution plot for the target variable named ‘loss’ is plotted, the distribution was highly skewed. Hence, I applied a log transformation on the loss column.

Distribution of target variable **‘loss’**

Distribution of log-transformed **‘loss’**

7. Data Preparation & Feature Engineering

This section discusses the process followed to make given data fit and readable for machine learning algorithms that I am gonna talk about in the next section. Numerical features don’t need any special attention, as they can be easily fed as it is to any ML models. Whereas for categorical attributes, I need to convert it into numerical form.

Before that, I dropped a certain number of categorical columns which shows a higher correlation while performing EDA.

#Categorical column names to be dropped
cat_drop = ['cat2','cat3','cat4','cat5','cat6','cat7','cat8','cat9',
'cat50','cat71','cat86','cat95','cat96','cat98','cat104']
train_df.drop(cat_drop,axis=1,inplace=True)
test_df.drop(cat_drop,axis=1,inplace=True)

After dropping highly associated categorical attributes, only 101 categorical features are left. Rather than using label encoding or lexical encoding, I tried a bit different approach. Each row/data point(DP) will now have 101 categorical feature values and I will be joining all of them with a white space character. And the result will be a text/sentence with 101 words in it(thinking of each category name as a word).

Categorical feature converted as a sentence with 101 words

Then the above method is applied to all the data points(DPs).

def new_feature(row, df):
    word_list = np.array(df.loc[row,'cat1':'cat116'])
    text = ' '.join(word_list)
    return texttrain_df['text'] = None
test_df['text'] = None
for i in tqdm(range(train_df.shape[0])):
    train_df.loc[i,'text'] = new_feature(i,train_df)

for i in tqdm(range(test_df.shape[0])):
    test_df.loc[i,'text'] = new_feature(i, test_df)

To vectorize that text column, I used Countvectorizer and TfidfVectorizer to convert them into numerical form. Also by tweaking the ngram_range parameter unigram, bigrams, and trigrams features are generated.

Like this, I created 8 sets of data by stacking either unigram or bigram or trigram with continuous features. Out of these, I choose trigram and bigram features to be performing better.

8. Applying ML models

After choosing to work with bigram and trigram features I applied various kinds of models. I tried simple linear models as well as ensemble models too. And I selected 1000 best features using SelectKBest rather than working with all the features. Instead of using CridSearchCV or RandomSearchCV for hyperparameter tuning, I chose to use the Bayesian optimization method for a faster hyperparameter tuning process.

Best hyperparameter value for Elastic net

XgBoost applied on the trigram feature set gave the best result, the score generated after submitting on Kaggle is 1228.07 MAE.

9. Conclusion & Future Work

While solving this Kaggle problem, I mostly focused on trying some new feature engineering method that is not yet tried by anyone as I cannot find anything similar on the respective competition page. I tried only a few ML models, and to improve the MAE value further there are multiple ways that can be tried. Future work that can be tried for improvement is: trying some Neural network model, using stacking regressor, or taking the weighted average of more than one model.

For complete code and steps, check out my Github Repo.