Import all the necessary libraries

# Basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as spy
%matplotlib inline
import copy

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Pandas display settings - columns

# Display all columns
pd.set_option("display.max_columns", None)

Data ingestion

# Load dataset
data = pd.read_excel("InsurancePremiumDefault.xlsx",sheet_name='premium')

Data Inspection

Preview dataset

# Preview the dataset
# View the first 5, last 5 and random 10 rows
print('First five rows', '--'*55)
display(data.head())

print('Last five rows', '--'*55)
display(data.tail())

print('Random ten rows', '--'*55)
np.random.seed(1)
display(data.sample(n=10))

First five rows --------------------------------------------------------------------------------------------------------------

Output:

	id	perc_premium_paid_by_cash_credit	age_in_days	Income	Count_3-6_months_late	Count_6-12_months_late	Count_more_than_12_months_late	Marital Status	Veh_Owned	No_of_dep	Accomodation	risk_score	no_of_premiums_paid	sourcing_channel	residence_area_type	premium	default
0	1	0.317	11330	90050	0	0	0	0	3	3	1	98.810	8	A	Rural	5400	1
1	2	0.000	30309	156080	0	0	0	1	3	1	1	99.066	3	A	Urban	11700	1
2	3	0.015	16069	145020	1	0	0	0	1	1	1	99.170	14	C	Urban	18000	1
3	4	0.000	23733	187560	0	0	0	1	1	1	0	99.370	13	A	Urban	13800	1
4	5	0.888	19360	103050	7	3	4	0	2	1	0	98.800	15	A	Urban	7500	0

Last five rows --------------------------------------------------------------------------------------------------------------

Output:

	id	perc_premium_paid_by_cash_credit	age_in_days	Income	Count_3-6_months_late	Marital Status	Veh_Owned	No_of_dep	risk_score	no_of_premiums_paid	sourcing_channel	residence_area_type	premium	default
79848	79849	0.249	25555	64420	0	1	2	4	99.08	10	A	Urban	5700	1
79849	79850	0.003	16797	660040	1	0	2	1	99.65	9	B	Urban	28500	1
79850	79851	0.012	24835	227760	0	0	2	3	99.66	11	A	Rural	11700	1
79851	79852	0.190	10959	153060	1	0	2	1	99.46	24	A	Urban	11700	1
79852	79853	0.000	19720	324030	0	0	3	3	99.80	7	D	Rural	3300	1

Random ten rows --------------------------------------------------------------------------------------------------------------

Output:

	id	perc_premium_paid_by_cash_credit	age_in_days	Income	Count_3-6_months_late	Marital Status	Veh_Owned	No_of_dep	Accomodation	risk_score	no_of_premiums_paid	sourcing_channel	residence_area_type	premium	default
53658	53659	0.150	22643	171080	0	1	2	2	0	98.78	12	A	Urban	11700	1
25309	25310	0.201	10232	75090	0	1	3	4	0	99.61	7	A	Rural	1200	1
26163	26164	0.010	19719	144120	0	1	3	1	0	99.89	7	A	Urban	11700	1
55134	55135	0.000	23734	350070	0	1	1	1	1	99.51	9	A	Urban	1200	1
29706	29707	0.094	27023	69110	0	1	3	4	0	99.83	8	A	Urban	5400	1
11087	11088	0.123	28844	180030	0	0	2	1	1	99.47	8	A	Rural	9600	1
71267	71268	0.990	12785	55640	1	0	1	2	1	99.66	7	A	Rural	3300	1
4389	4390	0.051	22276	171080	0	0	2	2	0	99.69	11	A	Rural	7500	1
47666	47667	0.000	19717	219430	0	1	2	3	0	99.27	11	D	Urban	11700	1
31240	31241	0.000	15708	150110	0	0	3	2	0	99.71	4	C	Urban	11700	1

Initial observations

id is row identifier, which does not add any value. This variable can be removed later.
perc_premium_paid_by_cash_credit is a continuous,ratio numerical variable.
age_in_days,Income,Veh_Owned, No_of_dep,No_of_dep,no_of_premiums_paid & premium are discrete,interval numerical variables.
Income is a discrete,interval numerical variable.
Count_3-6_months_late, Count_6-12_months_late & Count_more_than_12_months_late are interval and discrete numerical variables.
Marital Status is binary (0 - Unmarried, 1 - Married)
Accomodation is binary (0 - Rented, 1 - Owned)
risk_score is a continuous, numerical variable.
sourcing_channel & residence_area_type are categorical nominal variables.
default is the Target variable. It is a binary (0 - defaulter, 1 - non_defaulter) variable.

Variable List

# Display list of variables in dataset
variable_list = data.columns.tolist()
print(variable_list)

['id', 'perc_premium_paid_by_cash_credit', 'age_in_days', 'Income', 'Count_3-6_months_late', 'Count_6-12_months_late', 'Count_more_than_12_months_late', 'Marital Status', 'Veh_Owned', 'No_of_dep', 'Accomodation', 'risk_score', 'no_of_premiums_paid', 'sourcing_channel', 'residence_area_type', 'premium', 'default']

Let's rename the variables for ease of programming

# Column rename dictionary
renamed_columns = {
    'id': 'ID',
    'perc_premium_paid_by_cash_credit': 'Perc_premium_paid_in_cash',
    'age_in_days': 'Age_in_days',
    'Count_3-6_months_late': 'Late_premium_payment_3-6_months',
    'Count_6-12_months_late': 'Late_premium_payment_6-12_months',
    'Count_more_than_12_months_late': 'Late_premium_payment_>12_months',
    'Marital Status': 'Marital_Status',
    'Veh_Owned': 'Vehicles_Owned',
    'No_of_dep': 'No_of_dependents',
    'risk_score': 'Risk_score',
    'no_of_premiums_paid': 'No_of_premiums_paid',
    'sourcing_channel': 'Sourcing_channel',
    'residence_area_type': 'Customer_demographic',
    'premium': 'Premium_payment',
    'default': 'Default'
}

# Rename dataframe columns names
data = data.rename(columns = renamed_columns)

# Check for updated column names
variable_list = data.columns.tolist()
print(variable_list)

['ID', 'Perc_premium_paid_in_cash', 'Age_in_days', 'Income', 'Late_premium_payment_3-6_months', 'Late_premium_payment_6-12_months', 'Late_premium_payment_>12_months', 'Marital_Status', 'Vehicles_Owned', 'No_of_dependents', 'Accomodation', 'Risk_score', 'No_of_premiums_paid', 'Sourcing_channel', 'Customer_demographic', 'Premium_payment', 'Default']

Dataset shape

shape = data.shape
n_rows = shape[0]
n_cols = shape[1]
print(f"The Dataframe consists of '{n_rows}' rows and '{n_cols}' columns")

The Dataframe consists of '79853' rows and '17' columns

Data types

# Check the data types
data.dtypes

ID                                    int64
Perc_premium_paid_in_cash           float64
Age_in_days                           int64
Income                                int64
Late_premium_payment_3-6_months       int64
Late_premium_payment_6-12_months      int64
Late_premium_payment_>12_months       int64
Marital_Status                        int64
Vehicles_Owned                        int64
No_of_dependents                      int64
Accomodation                          int64
Risk_score                          float64
No_of_premiums_paid                   int64
Sourcing_channel                     object
Customer_demographic                 object
Premium_payment                       int64
Default                               int64
dtype: object

Data info

# Get info of the dataframe columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79853 entries, 0 to 79852
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ID                                79853 non-null  int64  
 1   Perc_premium_paid_in_cash         79853 non-null  float64
 2   Age_in_days                       79853 non-null  int64  
 3   Income                            79853 non-null  int64  
 4   Late_premium_payment_3-6_months   79853 non-null  int64  
 5   Late_premium_payment_6-12_months  79853 non-null  int64  
 6   Late_premium_payment_>12_months   79853 non-null  int64  
 7   Marital_Status                    79853 non-null  int64  
 8   Vehicles_Owned                    79853 non-null  int64  
 9   No_of_dependents                  79853 non-null  int64  
 10  Accomodation                      79853 non-null  int64  
 11  Risk_score                        79853 non-null  float64
 12  No_of_premiums_paid               79853 non-null  int64  
 13  Sourcing_channel                  79853 non-null  object 
 14  Customer_demographic              79853 non-null  object 
 15  Premium_payment                   79853 non-null  int64  
 16  Default                           79853 non-null  int64  
dtypes: float64(2), int64(13), object(2)
memory usage: 10.4+ MB

Two (2) variables have been identified as Panda object type. These shall be converted to the category type.

Convert Pandas Objects to Category type

# Convert variables with "object" type to "category" type
for i in data.columns:
    if data[i].dtypes == "object":
        data[i] = data[i].astype("category") 

# Confirm if there no variables with "object" type
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79853 entries, 0 to 79852
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   ID                                79853 non-null  int64   
 1   Perc_premium_paid_in_cash         79853 non-null  float64 
 2   Age_in_days                       79853 non-null  int64   
 3   Income                            79853 non-null  int64   
 4   Late_premium_payment_3-6_months   79853 non-null  int64   
 5   Late_premium_payment_6-12_months  79853 non-null  int64   
 6   Late_premium_payment_>12_months   79853 non-null  int64   
 7   Marital_Status                    79853 non-null  int64   
 8   Vehicles_Owned                    79853 non-null  int64   
 9   No_of_dependents                  79853 non-null  int64   
 10  Accomodation                      79853 non-null  int64   
 11  Risk_score                        79853 non-null  float64 
 12  No_of_premiums_paid               79853 non-null  int64   
 13  Sourcing_channel                  79853 non-null  category
 14  Customer_demographic              79853 non-null  category
 15  Premium_payment                   79853 non-null  int64   
 16  Default                           79853 non-null  int64   
dtypes: category(2), float64(2), int64(13)
memory usage: 9.3 MB

The memory usage has decreased from 10.4+ MB to 9.3 MB

Missing value summary function

def missing_val_chk(data):
    """
    This function to checks for missing values 
    and generates a summary.
    """
    if data.isnull().sum().any() == True:
        # Number of missing in each column
        missing_vals = pd.DataFrame(data.isnull().sum().sort_values(
            ascending=False)).rename(columns={0: '# missing'})

        # Create a percentage missing
        missing_vals['percent'] = ((missing_vals['# missing'] / len(data)) *
                                   100).round(decimals=3)

        # Remove rows with 0
        missing_vals = missing_vals[missing_vals['# missing'] != 0].dropna()

        # display missing value dataframe
        print("The missing values summary")
        display(missing_vals)
    else:
        print("There are NO missing values in the dataset")

Missing Values Check

#Applying the missing value summary function
missing_val_chk(data)

There are NO missing values in the dataset

Before we check the 5 Point numerical summary, let's verify that ID is row identifier.

If this is the case then the number of unique values will equal the number of rows.

# Check to see if ID unique values equal number of rows

if data.ID.nunique() == data.shape[0]:
    print("ID is a row identifier, we shall drop ID variable")
else:
    print("ID is a row identifier")

ID is a row identifier, we shall drop ID variable

Dropping ID variable

data.drop(columns="ID", axis=1, inplace=True)

5 Point Summary

Numerical type Summary

# Five point summary of all numerical type variables in the dataset
data.describe().T

Output:

	count	mean	std	min	25%	50%	75%	max
Perc_premium_paid_in_cash	79853.0	0.314288	0.334915	0.0	0.034	0.167	0.538	1.00
Age_in_days	79853.0	18846.696906	5208.719136	7670.0	14974.000	18625.000	22636.000	37602.00
Income	79853.0	208847.171177	496582.597257	24030.0	108010.000	166560.000	252090.000	90262600.00
Late_premium_payment_3-6_months	79853.0	0.248369	0.691102	0.0	0.000	0.000	0.000	13.00
Late_premium_payment_6-12_months	79853.0	0.078093	0.436251	0.0	0.000	0.000	0.000	17.00
Late_premium_payment_>12_months	79853.0	0.059935	0.311840	0.0	0.000	0.000	0.000	11.00
Marital_Status	79853.0	0.498679	0.500001	0.0	0.000	0.000	1.000	1.00
Vehicles_Owned	79853.0	1.998009	0.817248	1.0	1.000	2.000	3.000	3.00
No_of_dependents	79853.0	2.503012	1.115901	1.0	2.000	3.000	3.000	4.00
Accomodation	79853.0	0.501296	0.500001	0.0	0.000	1.000	1.000	1.00
Risk_score	79853.0	99.067243	0.725892	91.9	98.830	99.180	99.520	99.89
No_of_premiums_paid	79853.0	10.863887	5.170687	2.0	7.000	10.000	14.000	60.00
Premium_payment	79853.0	10924.507533	9401.676542	1200.0	5400.000	7500.000	13800.000	60000.00
Default	79853.0	0.937410	0.242226	0.0	1.000	1.000	1.000	1.00

Perc_premium_paid_in_cash is highly right skewed as the mean is almost twice the median. Also the standard deviation is greater than the mean implying a wide spread.
Age_in_days is fairly symmetrical with mean and median being very close but there is some right skew as the difference between Q3 & Q4 is larger than other quartiles.
Income highly right skewed as the mean is greater than the median. Also the standard deviation is more than twice the mean implying a wide spread.
Late_premium_payment_3-6_months, Late_premium_payment_6-12_months & Late_premium_payment_>12_months are categorical ordinal variables.
Marital_Status is a binary variable with approximately 50% of the rows having a value of 1 (50% of the customers are married).
Vehicles_Owned & No_of_dependents are categorical ordinal variables.
Accomodation is a binary variable with approximately 50% of the rows having a value of 1 (50% of the customers owned their homes).
Risk_score is fairly symmetrical with mean and median being very close.
No_of_premiums_paid is fairly symmetrical with mean and median being very close but there is some right skew as the difference between Q3 & Q4 is larger than other quartiles.
Premium_payment is highly right skewed as the mean is greater than the median. Also the standard deviation is close to the mean implying a wide spread.
Default is a binary variable with approximately 94% of the rows having a value of 1 (94% of the customers are non-defaulters). This target variable is heavily imbalanced as only 6% of the rows are defaulters.

Categorical type Summary

data.describe(include=['category']).T

Output:

	count	unique	top	freq
Sourcing_channel	79853	5	A	43134
Customer_demographic	79853	2	Urban	48183

Sourcing_channel there are 5 different states in which "A" is the most frequent.
Customer_demographic there are 2 different states in which "Urban" is the most frequent.

This dataset will be skewed to policy holders sourced from Channel A and Urban residences

Number of unique states for all variables

# Check the unique values
data.nunique().to_frame()

Output:

	0
Perc_premium_paid_in_cash	1001
Age_in_days	833
Income	24165
Late_premium_payment_3-6_months	14
Late_premium_payment_6-12_months	17
Late_premium_payment_>12_months	10
Marital_Status	2
Vehicles_Owned	3
No_of_dependents	4
Accomodation	2
Risk_score	673
No_of_premiums_paid	57
Sourcing_channel	5
Customer_demographic	2
Premium_payment	30
Default	2

Categorical Variable Identification

Although the following variables are numerical in nature, they represent categorical variables:

Late_premium_payment_3-6_months
Late_premium_payment_6-12_months
Late_premium_payment_>12_months
Vehicles_Owned
No_of_dependents

Create a list of numerical variables

numerical_vars = [
    'Perc_premium_paid_in_cash', 'Age_in_days', 'Income', 'Risk_score',
    'No_of_premiums_paid', 'Premium_payment'
]

Create a list of categorical variables

categorical_vars = [
    'Late_premium_payment_3-6_months', 'Late_premium_payment_6-12_months',
    'Late_premium_payment_>12_months', 'Marital_Status', 'Vehicles_Owned',
    'No_of_dependents', 'Accomodation', 'Sourcing_channel',
    'Customer_demographic', 'Default'
]

Numerical data

data[numerical_vars].describe().T

Output:

	count	mean	std	min	25%	50%	75%	max
Perc_premium_paid_in_cash	79853.0	0.314288	0.334915	0.0	0.034	0.167	0.538	1.00
Age_in_days	79853.0	18846.696906	5208.719136	7670.0	14974.000	18625.000	22636.000	37602.00
Income	79853.0	208847.171177	496582.597257	24030.0	108010.000	166560.000	252090.000	90262600.00
Risk_score	79853.0	99.067243	0.725892	91.9	98.830	99.180	99.520	99.89
No_of_premiums_paid	79853.0	10.863887	5.170687	2.0	7.000	10.000	14.000	60.00
Premium_payment	79853.0	10924.507533	9401.676542	1200.0	5400.000	7500.000	13800.000	60000.00

Skew Summary

# Display the skew summary for the numerical variables
for var in data[numerical_vars].columns:
    var_skew = data[var].skew()
    if var_skew > 1:
        print(f"The '{var}' distribution is highly right skewed.\n")
    elif var_skew < -1:
        print(f"The '{var}' distribution is highly left skewed.\n")
    elif (var_skew > 0.5) & (var_skew < 1):
        print(f"The '{var}' distribution is moderately right skewed.\n")
    elif (var_skew < -0.5) & (var_skew > -1):
        print(f"The '{var}' distribution is moderately left skewed.\n")
    else:
        print(f"The '{var}' distribution is fairly symmetrical.\n")

The 'Perc_premium_paid_in_cash' distribution is moderately right skewed.

The 'Age_in_days' distribution is fairly symmetrical.

The 'Income' distribution is highly right skewed.

The 'Risk_score' distribution is highly left skewed.

The 'No_of_premiums_paid' distribution is highly right skewed.

The 'Premium_payment' distribution is highly right skewed.

Outlier check function

# Outlier check
def outlier_count(data):
    """
    This function checks the lower and upper 
    outliers for all numerical variables.
    
    Outliers are found where data points exists either:
    - Greater than `1.5*IQR` above the 75th percentile
    - Less than `1.5*IQR` below the 25th percentile
    """
    numeric = data.select_dtypes(include=np.number).columns.to_list()
    for i in numeric:
        # Get name of series
        name = data[i].name
        # Calculate the IQR for all values and omit NaNs
        IQR = spy.stats.iqr(data[i], nan_policy="omit")
        # Calculate the boxplot upper fence
        upper_fence = data[i].quantile(0.75) + 1.5 * IQR
        # Calculate the boxplot lower fence
        lower_fence = data[i].quantile(0.25) - 1.5 * IQR
        # Calculate the count of outliers above upper fence
        upper_outliers = data[i][data[i] > upper_fence].count()
        # Calculate the count of outliers below lower fence
        lower_outliers = data[i][data[i] < lower_fence].count()
        # Check if there are no outliers
        if (upper_outliers == 0) & (lower_outliers == 0):
            continue
        print(
            f"The '{name}' distribution has '{lower_outliers}' lower outliers and '{upper_outliers}' upper outliers.\n"
        )

Outlier check

#Applying the Outlier check function for the sub-dataframe of numerical variables
outlier_count(data[numerical_vars])

The 'Age_in_days' distribution has '0' lower outliers and '44' upper outliers.

The 'Income' distribution has '0' lower outliers and '3428' upper outliers.

The 'Risk_score' distribution has '3784' lower outliers and '0' upper outliers.

The 'No_of_premiums_paid' distribution has '0' lower outliers and '1426' upper outliers.

The 'Premium_payment' distribution has '0' lower outliers and '4523' upper outliers.

Numerical Variable Summary

Variable	Skew	Outliers
Perc_premium_paid_in_cash	Moderately right skewed	No Outliers
Age_in_days	Fairly symmetrical	44 Upper Outliers
Income	Highly right skewed	3428 Upper Outliers
Risk_score	Highly left skewed	3784 Lower Outliers
No_of_premiums_paid	Highly right skewed	1426 Upper Outliers
Premium_payment	Highly right skewed	4523 Upper Outliers

Categorical data

Unique states

Detailed investigation of unique values

# Display the unique values for all categorical variables
for i in categorical_vars:
    print('Unique values in',i, 'are :')
    print(data[i].value_counts())
    print('--'*55)

Unique values in Late_premium_payment_3-6_months are :
Late_premium_payment_3-6_months
0     66898
1      8826
2      2519
3       954
4       374
5       168
6        68
7        23
8        15
9         4
13        1
12        1
10        1
11        1
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Late_premium_payment_6-12_months are :
Late_premium_payment_6-12_months
0     75928
1      2680
2       693
3       317
4       130
5        46
6        26
7        11
8         5
9         4
10        4
11        2
14        2
13        2
15        1
17        1
12        1
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Late_premium_payment_>12_months are :
Late_premium_payment_>12_months
0     76135
1      2996
2       498
3       151
4        48
5        13
6         6
7         3
8         2
11        1
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Marital_Status are :
Marital_Status
0    40032
1    39821
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Vehicles_Owned are :
Vehicles_Owned
1    26746
3    26587
2    26520
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in No_of_dependents are :
No_of_dependents
3    20215
2    19902
4    19896
1    19840
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Accomodation are :
Accomodation
1    40030
0    39823
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Sourcing_channel are :
Sourcing_channel
A    43134
B    16512
C    12039
D     7559
E      609
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Customer_demographic are :
Customer_demographic
Urban    48183
Rural    31670
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------
Unique values in Default are :
Default
1    74855
0     4998
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------

Late_premium_payment_6-12_months - there is discontinuity in the unique values as state 16 is missing in the range.
Late_premium_payment_>12_months - there is discontinuity in the unique values as states 9 & 10 are missing in the range.

Categorical Variable Summary

There are categorical variables in the numeric format.

Variable	Type	Range
Late_premium_payment_3-6_months	Ordinal	14 states
Late_premium_payment_6-12_months	Ordinal	17 states
Late_premium_payment_>12_months	Ordinal	10 states
Marital_Status	Nominal	2 states
Vehicles_Owned	Ordinal	3 states
No_of_dependents	Ordinal	4 states
Accomodation	Nominal	2 states
Sourcing_channel	Nominal	5 states
Customer_demographic	Nominal	2 states
Default	Nominal	2 states

Target Variable

Target variable is Default

# Checking the distribution of target variable

# Count the different "Default" states
count = data["Default"].value_counts().T
# Calculate the percentage different "Default" states
percentage = data['Default'].value_counts(normalize=True).T * 100
# Join count and percentage series
target_dist = pd.concat([count, percentage], axis=1)
# Set column names
target_dist.columns = ['count', 'percentage']
# Set Index name
target_dist.index.name = "Default"
# Display target distribution dataframe
target_dist

Output:

	count	percentage
Default
1	74855	93.740999
0	4998	6.259001

Out of the 79854 policy holders, only 6.26% defaulted

The Target variable is Heavily Imbalanced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Import all the necessary libraries

Data ingestion

Data Inspection

Variable List

Dataset shape

Missing Values Check

5 Point Summary

Numerical data

Skew Summary

Outlier check

Numerical Variable Summary

Categorical data

Unique states

Categorical Variable Summary

Target Variable

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Import all the necessary libraries

Data ingestion

Data Inspection

Variable List

Dataset shape

Missing Values Check

5 Point Summary

Numerical data

Skew Summary

Outlier check

Numerical Variable Summary

Categorical data

Unique states

Categorical Variable Summary

Target Variable