G26: Fraud Analysis for Vehicle Insurance Claim

There are several insurance-claim frauds being faced by the insurers. It has become quite frequent over the years. These frauds have major repercussions such as higher premiums for honest policy holders, more scrutiny while issuing policy etc. There are several acknowledged types of frauds including but not limited to identity fraud, staged accidents and inflated claims. In a recent research, it was claimed that insurance fraud grows 23% YoY and has a global impact of $6.4 billion. Hence, in order to solve this problem, several fraud detection techniques can be applied through ML algorithms where the goal is to identify patterns and anomalies as quickly and efficiently as possible that could indicate fraudulent claims. This will help the insurance companies to reduce their losses as well as increase people’s trust in the insurance industries.

For vehicle insurance claim fraud, a dataset is chosen from Kaggle having 33 features and 15420 data entries. This dataset has several features that indicate the following:

Type	Features
Vehicle	make, category, price, age
Customer	sex, age, marital-status, driver-rating, number-of-vehicles
Policy	policy-number, fault, type, deductible, days between events, police-report filed, witness present, agent-type, accident-area, address-change, base-policy, supplements
Data labeled with binary classification if claim was fraudulent or not (1 or 0)	fraud-found

The data size available for this use-case is sufficient - more than 10 times degrees of freedom. The other observations ensure acceptable nulls/missing values (2%), the data is imbalanced 6%, however, it is in line with insurance industry trends.

The following research questions may be addressed from the analysis:

Can insurance claim fraud be predicted and what is the performance of the algorithms?
Which features are important and will affect predicting the fraud?
What is the importance of claim data completeness in fraud detection?

This is a supervised ML (classification) problem. Considering the numerical and categorical data in the dataset, the algorithms that could be used are decision-trees, random-forest and logistic regression. The dataset will be undersampled and split into training, validation and test. To account for undersampling in the dataset, stratified k-fold cross validation shall be applied. Decision trees can also be used for feature selection along with Pearson (or Kendall) correlation. While decision trees perform better with categorical data, logistic regression could perform better given that the size of data is less.

The primary KPI for this analysis shall be precision, recall and F1 score over accuracy since it’s more important to predict fraudsters.

References:

Vehicle Insurance Claim Fraud Dataset
Stratified Cross Validator (since Spark doesn't have a built-in method for stratified k-fold cross validation)
Insurance Fraud Detection Global Market Report 2023

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

G26: Fraud Analysis for Vehicle Insurance Claim

About

Languages

License

Shefali-Upadhyaya/vehicle-insurance-claim-fraud-detection

Folders and files

Latest commit

History

Repository files navigation

G26: Fraud Analysis for Vehicle Insurance Claim

About

Topics

Resources

License

Stars

Watchers

Forks

Languages