Customer Satisfaction Analysis Project

Problem Analysis

Problem Statement

Following the pandemic, the airline industry suffered a massive setback, with ICAO estimating a 371 billion dollar loss in 2020, and a 329 billion dollar loss with reduced seat capacity. As a result, in order to revitalise the industry in the face of the current recession, it is absolutely necessary to understand the customer pain points and improve their satisfaction with the services provided.
This data set contains a survey on air passenger satisfaction survey.Need to predict Airline passenger satisfaction level:1.Satisfaction 2.Neutral or dissatisfied.
Select the best predictive models for predicting passengers satisfaction.

Key Observations

This is a binary classification problem,it is necessary to predict which of the two levels of satisfaction with the airline the passenger belongs to:Satisfaction, Neutral or dissatisfied
Before diving into the data, thinking intuitively and being an avid traveller myself, from my experience, the main factors should be:

Delays in the flight
Staff efficiency to address customer needs
Services provided in the flight

Data Gathering and Initial Insights

Installing and Importing the required packages

## Data Analysis packages
import numpy as np
import pandas as pd

## Data Visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import matplotlib
%matplotlib inline
from pylab import rcParams
import missingno as msno

## General Tools
import os
import re
import joblib
import json
import warnings


# sklearn library
import sklearn

### sklearn preprocessing tools
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import StratifiedKFold,train_test_split
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,auc,accuracy_score,roc_auc_score
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer, PowerTransformer,FunctionTransformer,OneHotEncoder


# Error Metrics 
from sklearn.metrics import r2_score #r2 square
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score,recall_score,f1_score


### Machine learning classification Models
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier #stacstic gradient descent clasifeier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.naive_bayes import GaussianNB
import lightgbm as lgb
from sklearn.ensemble import AdaBoostClassifier


#crossvalidation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
#from sklearn.metrics import plot_confusion_matrix


#hyper parameter tunning
from sklearn.model_selection import GridSearchCV,cross_val_score,RandomizedSearchCV

Downloading the dataset

The dataset is from Kaggle. it provides cutting-edge data science, faster and better than most people ever thought possible. Kaggle offers both public and private data science competitions and on-demand consulting by an elite global talent pool.
When you execute od.download, you will be asked to provide your Kaggle username and API key. Follow these instructions to create an API key: http://bit.ly/kaggle-creds
Dataset link https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

Information about the dataset

There is the following information about the passengers of some airline:

Gender: male or female
Customer type: regular or non-regular airline customer
Age: the actual age of the passenger
Type of travel: the purpose of the passenger's flight (personal or business travel)
Class: business, economy, economy plus
Flight distance
Inflight wifi service: satisfaction level with Wi-Fi service on board (0: not rated; 1-5)
Departure/Arrival time convenient: departure/arrival time satisfaction level (0: not rated; 1-5)
Ease of Online booking: online booking satisfaction rate (0: not rated; 1-5)
Gate location: level of satisfaction with the gate location (0: not rated; 1-5)
Food and drink: food and drink satisfaction level (0: not rated; 1-5)
Online boarding: satisfaction level with online boarding (0: not rated; 1-5)
Seat comfort: seat satisfaction level (0: not rated; 1-5)
Inflight entertainment: satisfaction with inflight entertainment (0: not rated; 1-5)
On-board service: level of satisfaction with on-board service (0: not rated; 1-5)
Leg room service: level of satisfaction with leg room service (0: not rated; 1-5)
Baggage handling: level of satisfaction with baggage handling (0: not rated; 1-5)
Checkin service: level of satisfaction with checkin service (0: not rated; 1-5)
Inflight service: level of satisfaction with inflight service (0: not rated; 1-5)
Cleanliness: level of satisfaction with cleanliness (0: not rated; 1-5)
Departure delay in minutes:
Arrival delay in minutes:
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction).

Train Dataset

train_df = pd.read_csv("./data/train.csv")
train_df.head()

	Unnamed: 0	id	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	...	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
0	0	70172	Male	Loyal Customer	13	Personal Travel	Eco Plus	460	3	4	...	5	4	3	4	4	5	5	25	18.0	neutral or dissatisfied
1	1	5047	Male	disloyal Customer	25	Business travel	Business	235	3	2	...	1	1	5	3	1	4	1	1	6.0	neutral or dissatisfied
2	2	110028	Female	Loyal Customer	26	Business travel	Business	1142	2	2	...	5	4	3	4	4	4	5	0	0.0	satisfied
3	3	24026	Female	Loyal Customer	25	Business travel	Business	562	2	5	...	2	2	5	3	1	4	2	11	9.0	neutral or dissatisfied
4	4	119299	Male	Loyal Customer	61	Business travel	Business	214	3	3	...	3	3	4	4	3	3	3	0	0.0	satisfied

5 rows × 25 columns

## Initial Statistical description

train_df.describe()

	Unnamed: 0	id	Age	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes
count	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103904.000000	103594.000000
mean	51951.500000	64924.210502	39.379706	1189.448375	2.729683	3.060296	2.756901	2.976883	3.202129	3.250375	3.439396	3.358158	3.382363	3.351055	3.631833	3.304290	3.640428	3.286351	14.815618	15.178678
std	29994.645522	37463.812252	15.114964	997.147281	1.327829	1.525075	1.398929	1.277621	1.329533	1.349509	1.319088	1.332991	1.288354	1.315605	1.180903	1.265396	1.175663	1.312273	38.230901	38.698682
min	0.000000	1.000000	7.000000	31.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	25975.750000	32533.750000	27.000000	414.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	3.000000	3.000000	3.000000	2.000000	0.000000	0.000000
50%	51951.500000	64856.500000	40.000000	843.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	4.000000	4.000000	4.000000	4.000000	4.000000	3.000000	4.000000	3.000000	0.000000	0.000000
75%	77927.250000	97368.250000	51.000000	1743.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	5.000000	4.000000	4.000000	4.000000	5.000000	4.000000	5.000000	4.000000	12.000000	13.000000
max	103903.000000	129880.000000	85.000000	4983.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	1592.000000	1584.000000

Observations

The average delay in flights are 15 minutes, with a deviation of 38
Median of the delays are 0, which means 50% of the flights from this data, were not delayed

## removing the first two columns
train_df.drop(["Unnamed: 0", 'id'], axis=1, inplace=True)

train_df.head(2)

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	...	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
0	Male	Loyal Customer	13	Personal Travel	Eco Plus	460	3	4	3	1	...	5	4	3	4	4	5	5	25	18.0	neutral or dissatisfied
1	Male	disloyal Customer	25	Business travel	Business	235	3	2	3	3	...	1	1	5	3	1	4	1	1	6.0	neutral or dissatisfied

2 rows × 23 columns

## shape of the train dataset
train_df.shape

(103904, 23)

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Gender                             103904 non-null  object 
 1   Customer Type                      103904 non-null  object 
 2   Age                                103904 non-null  int64  
 3   Type of Travel                     103904 non-null  object 
 4   Class                              103904 non-null  object 
 5   Flight Distance                    103904 non-null  int64  
 6   Inflight wifi service              103904 non-null  int64  
 7   Departure/Arrival time convenient  103904 non-null  int64  
 8   Ease of Online booking             103904 non-null  int64  
 9   Gate location                      103904 non-null  int64  
 10  Food and drink                     103904 non-null  int64  
 11  Online boarding                    103904 non-null  int64  
 12  Seat comfort                       103904 non-null  int64  
 13  Inflight entertainment             103904 non-null  int64  
 14  On-board service                   103904 non-null  int64  
 15  Leg room service                   103904 non-null  int64  
 16  Baggage handling                   103904 non-null  int64  
 17  Checkin service                    103904 non-null  int64  
 18  Inflight service                   103904 non-null  int64  
 19  Cleanliness                        103904 non-null  int64  
 20  Departure Delay in Minutes         103904 non-null  int64  
 21  Arrival Delay in Minutes           103594 non-null  float64
 22  satisfaction                       103904 non-null  object 
dtypes: float64(1), int64(17), object(5)
memory usage: 18.2+ MB

Only Arrival Delay in Minutes has null values. Lets visualize to see any patterns in the missing values

msno.matrix(train_df)

Observations

There are 103904 rows for 23 features in our data
we see in the training data, that all the datatypes belongs to a numeric class that is int, float and object
only arrival delay in minutes have some null values

# percentage of null values

train_df.isnull().sum()

Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             310
satisfaction                           0
dtype: int64

The number of null values is 310 in "Arrival Delay in Minutes" column
The percentage of null values is ~ 0.3%

round(train_df.describe().T, 2)

	count	mean	std	min	25%	50%	75%	max
Age	103904.0	39.38	15.11	7.0	27.0	40.0	51.0	85.0
Flight Distance	103904.0	1189.45	997.15	31.0	414.0	843.0	1743.0	4983.0
Inflight wifi service	103904.0	2.73	1.33	0.0	2.0	3.0	4.0	5.0
Departure/Arrival time convenient	103904.0	3.06	1.53	0.0	2.0	3.0	4.0	5.0
Ease of Online booking	103904.0	2.76	1.40	0.0	2.0	3.0	4.0	5.0
Gate location	103904.0	2.98	1.28	0.0	2.0	3.0	4.0	5.0
Food and drink	103904.0	3.20	1.33	0.0	2.0	3.0	4.0	5.0
Online boarding	103904.0	3.25	1.35	0.0	2.0	3.0	4.0	5.0
Seat comfort	103904.0	3.44	1.32	0.0	2.0	4.0	5.0	5.0
Inflight entertainment	103904.0	3.36	1.33	0.0	2.0	4.0	4.0	5.0
On-board service	103904.0	3.38	1.29	0.0	2.0	4.0	4.0	5.0
Leg room service	103904.0	3.35	1.32	0.0	2.0	4.0	4.0	5.0
Baggage handling	103904.0	3.63	1.18	1.0	3.0	4.0	5.0	5.0
Checkin service	103904.0	3.30	1.27	0.0	3.0	3.0	4.0	5.0
Inflight service	103904.0	3.64	1.18	0.0	3.0	4.0	5.0	5.0
Cleanliness	103904.0	3.29	1.31	0.0	2.0	3.0	4.0	5.0
Departure Delay in Minutes	103904.0	14.82	38.23	0.0	0.0	0.0	12.0	1592.0
Arrival Delay in Minutes	103594.0	15.18	38.70	0.0	0.0	0.0	13.0	1584.0

# Duplicate values
train_df.duplicated().sum()

# target variable
train_df.satisfaction.value_counts()[1]/len(train_df.satisfaction)*100

43.333269171542966

This problem is a binary classification problem of classes 0 or 1 denoting customer satisfaction, The class 1 has 43.33% of total values. Hence, this is a balanced learning problem. hence will not be requiring any resampling techniques to tackle this

Independent Variables or features**

train_df.columns[:-1]

Index(['Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class',
       'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes',
       'Arrival Delay in Minutes'],
      dtype='object')

Exploratory Data Analysis and Visualization

Before training a machine learning model, it's always a good idea to explore the distributions of various columns and see how they are related to the target column. Let's explore and visualize the data using the Plotly, Matplotlib and Seaborn libraries.

train_df.corr(numeric_only= True)

	Age	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes
Age	1.000000	0.099461	0.017859	0.038125	0.024842	-0.001330	0.023000	0.208939	0.160277	0.076444	0.057594	0.040583	-0.047529	0.035482	-0.049427	0.053611	-0.010152	-0.012147
Flight Distance	0.099461	1.000000	0.007131	-0.020043	0.065717	0.004793	0.056994	0.214869	0.157333	0.128740	0.109526	0.133916	0.063184	0.073072	0.057540	0.093149	0.002158	-0.002426
Inflight wifi service	0.017859	0.007131	1.000000	0.343845	0.715856	0.336248	0.134718	0.456970	0.122658	0.209321	0.121500	0.160473	0.120923	0.043193	0.110441	0.132698	-0.017402	-0.019095
Departure/Arrival time convenient	0.038125	-0.020043	0.343845	1.000000	0.436961	0.444757	0.004906	0.070119	0.011344	-0.004861	0.068882	0.012441	0.072126	0.093333	0.073318	0.014292	0.001005	-0.000864
Ease of Online booking	0.024842	0.065717	0.715856	0.436961	1.000000	0.458655	0.031873	0.404074	0.030014	0.047032	0.038833	0.107601	0.038762	0.011081	0.035272	0.016179	-0.006371	-0.007984
Gate location	-0.001330	0.004793	0.336248	0.444757	0.458655	1.000000	-0.001159	0.001688	0.003669	0.003517	-0.028373	-0.005873	0.002313	-0.035427	0.001681	-0.003830	0.005467	0.005143
Food and drink	0.023000	0.056994	0.134718	0.004906	0.031873	-0.001159	1.000000	0.234468	0.574556	0.622512	0.059073	0.032498	0.034746	0.087299	0.033993	0.657760	-0.029926	-0.032524
Online boarding	0.208939	0.214869	0.456970	0.070119	0.404074	0.001688	0.234468	1.000000	0.420211	0.285066	0.155443	0.123950	0.083280	0.204462	0.074573	0.331517	-0.018982	-0.021949
Seat comfort	0.160277	0.157333	0.122658	0.011344	0.030014	0.003669	0.574556	0.420211	1.000000	0.610590	0.131971	0.105559	0.074542	0.191854	0.069218	0.678534	-0.027898	-0.029900
Inflight entertainment	0.076444	0.128740	0.209321	-0.004861	0.047032	0.003517	0.622512	0.285066	0.610590	1.000000	0.420153	0.299692	0.378210	0.120867	0.404855	0.691815	-0.027489	-0.030703
On-board service	0.057594	0.109526	0.121500	0.068882	0.038833	-0.028373	0.059073	0.155443	0.131971	0.420153	1.000000	0.355495	0.519134	0.243914	0.550782	0.123220	-0.031569	-0.035227
Leg room service	0.040583	0.133916	0.160473	0.012441	0.107601	-0.005873	0.032498	0.123950	0.105559	0.299692	0.355495	1.000000	0.369544	0.153137	0.368656	0.096370	0.014363	0.011843
Baggage handling	-0.047529	0.063184	0.120923	0.072126	0.038762	0.002313	0.034746	0.083280	0.074542	0.378210	0.519134	0.369544	1.000000	0.233122	0.628561	0.095793	-0.005573	-0.008542
Checkin service	0.035482	0.073072	0.043193	0.093333	0.011081	-0.035427	0.087299	0.204462	0.191854	0.120867	0.243914	0.153137	0.233122	1.000000	0.237197	0.179583	-0.018453	-0.020369
Inflight service	-0.049427	0.057540	0.110441	0.073318	0.035272	0.001681	0.033993	0.074573	0.069218	0.404855	0.550782	0.368656	0.628561	0.237197	1.000000	0.088779	-0.054813	-0.059196
Cleanliness	0.053611	0.093149	0.132698	0.014292	0.016179	-0.003830	0.657760	0.331517	0.678534	0.691815	0.123220	0.096370	0.095793	0.179583	0.088779	1.000000	-0.014093	-0.015774
Departure Delay in Minutes	-0.010152	0.002158	-0.017402	0.001005	-0.006371	0.005467	-0.029926	-0.018982	-0.027898	-0.027489	-0.031569	0.014363	-0.005573	-0.018453	-0.054813	-0.014093	1.000000	0.965481
Arrival Delay in Minutes	-0.012147	-0.002426	-0.019095	-0.000864	-0.007984	0.005143	-0.032524	-0.021949	-0.029900	-0.030703	-0.035227	0.011843	-0.008542	-0.020369	-0.059196	-0.015774	0.965481	1.000000

plt.figure(figsize=(20, 10))
sns.heatmap(train_df.corr(numeric_only= True), annot=True, vmax=1, cmap='coolwarm')
plt.show()

departure delay in minutes and arrival dalay in minutes are highly co-related!

Data distribution graphs

sns.set(rc={
    "font.size":15,
    "axes.titlesize":10,
    "axes.labelsize":15},
    style="darkgrid")
fig, axs = plt.subplots(6, 3, figsize=(20,30))
fig.tight_layout(pad=4.0)

for f, ax in zip(train_df, axs.ravel()):
    sns.set(font_scale = 2)
    ax = sns.histplot(ax=ax, data=train_df, x=train_df[f], kde=True, color='purple')
    ax.set_title(f)

Piechart perrcentage distribution features

new_train_df = train_df.copy()

new_train_df.drop(['Age','Flight Distance','Departure Delay in Minutes', 'Arrival Delay in Minutes','satisfaction'], axis=1, inplace=True)

sns.set(rc={
            "font.size":10,
            "axes.titlesize":10,
            "axes.labelsize":13},
             style="darkgrid")
fig, axes = plt.subplots(6, 3, figsize = (20, 30))
for i, col in enumerate(new_train_df):
    column_values = new_train_df[col].value_counts()
    labels = column_values.index
    sizes = column_values.values
    axes[i//3, i%3].pie(sizes,labels = labels, colors = sns.color_palette("RdGy_r"),autopct = '%1.0f%%', startangle = 90)
    axes[i//3, i%3].axis('equal')
    axes[i//3, i%3].set_title(col)
plt.show()

Observations:

The number of men and women in this sample is approximately the same
The vast majority of the airline's customers are repeat customers
Most of the clients flew for business rather than personal reasons
About half of the passengers were in business class
More than 60% of passengers were satisfied with the luggage transportation service(rated 4-5 out of 5)
More than 50% of pessengers were compfortable sitting in thier seats(rated 4-5 out of 5)

## Satisfaction

train_df.satisfaction.value_counts()

neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15,6))
train_df.satisfaction.value_counts().plot.pie(explode=(0, 0.05), colors=sns.color_palette("RdYlBu"),autopct='%1.1f%%',ax=ax1)
ax1.set_title("Percentage of Satisfaction")
sns.countplot(x= "satisfaction", data=train_df, ax=ax2, palette='RdYlBu')
ax2.set_title("Distribution of Satisfaction")

Text(0.5, 1.0, 'Distribution of Satisfaction')

Observation:

As per the given data, 56.7% people are dissatisfied and neutral
And 43.3 people are satisfied

To analyse and visualise the data lets divide data columns into categorical and numerical columns.

# numerical and categorical features
numerical_cols = train_df.select_dtypes(include=np.number).columns.to_list()
categorical_cols = train_df.select_dtypes('object').columns.to_list()

#numerical columns
print("Total number of columns are:",len(numerical_cols))
print(numerical_cols)

Total number of columns are: 18
['Age', 'Flight Distance', 'Inflight wifi service', 'Departure/Arrival time convenient', 'Ease of Online booking', 'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort', 'Inflight entertainment', 'On-board service', 'Leg room service', 'Baggage handling', 'Checkin service', 'Inflight service', 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes']

#Categorical Columns
print("Total number of columns are:",len(categorical_cols))
print(categorical_cols)

Total number of columns are: 5
['Gender', 'Customer Type', 'Type of Travel', 'Class', 'satisfaction']

categorical_cols.remove('satisfaction')

Exploratory Data Analysis and Visualization on Numerical Columns

sns.set(rc={
            "font.size":10,
            "axes.titlesize":10,
            "axes.labelsize":15},
             style="darkgrid",
            )
fig, axs = plt.subplots(6, 3, figsize=(15, 30))
fig.tight_layout(pad=3.0)

for f, ax in zip(numerical_cols, axs.ravel()):
    sns.set(font_scale=2)
    ax= sns.boxplot(ax=ax, data=train_df, y=train_df[f], palette='BuGn')

Observations:

Flight distance, checkin service, Departure Delay in minutes, Arrival delay in minutes has some outliers

Barplot representation of numerical features

sns.set(rc={'figure.figsize':(8,6),
            "font.size":10,
            "axes.titlesize":10,
            "axes.labelsize":15},
             style="darkgrid")

for col in numerical_cols:
    sns.barplot(data=train_df, x="satisfaction", y=col, palette='BuGn')
    plt.show()

Observations:

From above graphs,it is clear that the age and Gate location, does not play a huge role in flight satisfaction.
And also the gender does not tell us mush as seen in the earlier plot. hence we can rop these values

Exploratory Data Analysis and Visualization on Categorical Column

Barplot representaion on Categorical Column

sns.set(rc={'figure.figsize':(11.7,8.27),
            "font.size":10,
            "axes.titlesize":10,
            "axes.labelsize":15},
             style="darkgrid",
            )
for col in categorical_cols:
    plt.figure(figsize=(8, 6))
    sns.countplot(data=train_df,x=col, hue='satisfaction', palette='PuRd_r')
    plt.legend(loc=(1.05,0.5))

Observations:

Gender doesn't play an important role in the satisfaction, as men and women seems to equally concerned about the same factors
Number of loyal customers for this airline is high, however, the dissatisfaction level is high irrespective of the loyalty. Airline will have to work on maintaining the loyal customers
Business Travellers seems to be more satisfied with the flight, than the personal travellers
People in business class seems to be the most satisfied lot, and those in economy class are least satisfied

Arrival Delay in Minutes VS Departure Delay in minutes.

train_df.groupby('satisfaction')['Arrival Delay in Minutes'].mean()

satisfaction
neutral or dissatisfied    17.127536
satisfied                  12.630799
Name: Arrival Delay in Minutes, dtype: float64

sns.set(rc={
            "font.size":10,
            "axes.titlesize":10,
            "axes.labelsize":13},
             style="darkgrid")
plt.figure(figsize=(10, 5), dpi=100)
sns.scatterplot(data=train_df, x="Arrival Delay in Minutes", y= "Departure Delay in Minutes", hue='satisfaction', palette="magma_r",alpha=0.8)

<Axes: xlabel='Arrival Delay in Minutes', ylabel='Departure Delay in Minutes'>

Observations:

The arrival and departure delay seems to have a linear relationship, which makes complete sense! And well, there is 1 customer who was satisfied even after a delay of 1300 minutes!!

Flight distance vs Departure Delay in Minutes

sns.set(rc={
            "font.size":10,
            "axes.titlesize":10,
            "axes.labelsize":13},
             style="darkgrid")
plt.figure(figsize=(10, 5), dpi=100)
sns.scatterplot(data=train_df, x="Flight Distance", y= "Departure Delay in Minutes", hue='satisfaction', palette="magma_r",alpha=0.8)
plt.ylim(0,1000)

(0.0, 1000.0)

Observations:

The most important takeaway here is the longer the flight distance, most passengers are okay with flight delay in departure, which is strance finding from this plot!
So departure delay is less of a factor for a long distance flight, comparitively, however, short distance travellers does not seem to be excited about the departure delays, which also makes sense

Age and Customer type

f, ax = plt.subplots(1,2, figsize=(15, 5))
sns.boxplot(data=train_df, x="Customer Type", y= "Age",palette = "gnuplot2_r", ax=ax[0])
sns.histplot(data=train_df, x="Age", hue="Customer Type", multiple="stack", palette = "gnuplot2_r",edgecolor = ".3", linewidth = .5, ax = ax[1])

<Axes: xlabel='Age', ylabel='Count'>

Observations:

From above we can conclude that most of the airline's regular customers are between the ages of 30 and 50(their average age is slightly above 40)
The age range of non-regular customers is slightly smaller (from 25 to 40 years old, on average - a little less than 30).

Age vs Class

f, ax  =plt.subplots(1,2,figsize=(15,5))
sns.boxplot(data=train_df, x="Class", y="Age",palette = "gnuplot2_r", ax=ax[0])
sns.histplot(data=train_df, x="Age", hue="Class", multiple="stack", palette="gnuplot2_r",edgecolor = ".3", linewidth = .5, ax = ax[1])

<Axes: xlabel='Age', ylabel='Count'>

It can be seen that, on average, the age range of those customers who travel in business class is the same (according to the previous box chart) as the age range of regular customers. Based on this observation, it can be assumed that regular customers mainly buy business class for themselves.

f, ax = plt.subplots(1, 2, figsize = (15,5))
sns.boxplot(x = "Class", y = "Flight Distance", palette = "gnuplot2_r", data = train_df, ax = ax[0])
sns.histplot(train_df, x = "Flight Distance", hue = "Class", multiple = "stack", palette = "gnuplot2_r", edgecolor = ".3", linewidth = .5, ax = ax[1])

<Axes: xlabel='Flight Distance', ylabel='Count'>

Observations:

customers whose flight distance is long, mostly fly in business class.

Flight Distance

f,ax = plt.subplots(2,2, figsize=(15, 8))
sns.boxplot(x = "Inflight entertainment", y = "Flight Distance", palette = "gnuplot2_r", data = train_df, ax = ax[0, 0])
sns.histplot(train_df, x = "Flight Distance", hue = "Inflight entertainment", multiple = "stack", palette = "gnuplot2_r", edgecolor = ".3", linewidth = .5, ax = ax[0, 1])
sns.boxplot(x = "Leg room service", y = "Flight Distance", palette = "gnuplot2_r", data = train_df, ax = ax[1, 0])
sns.histplot(train_df, x = "Flight Distance", hue = "Leg room service", multiple = "stack", palette = "gnuplot2_r", edgecolor = ".3", linewidth = .5, ax = ax[1, 1])

<Axes: xlabel='Flight Distance', ylabel='Count'>

Observations:

The more distance an aircraft passenger travels (respectively, the longer they are in flight)
The more they are satisfied with the entertainment in flight and the extra legroom (on average).

Data preprocessing and Feature engineering

input_cols = list(train_df.iloc[:, :-1])
target_cols = "satisfaction"

pd.options.display.max_columns=30

train_df.head()

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
0	Male	Loyal Customer	13	Personal Travel	Eco Plus	460	3	4	3	1	5	3	5	5	4	3	4	4	5	5	25	18.0	neutral or dissatisfied
1	Male	disloyal Customer	25	Business travel	Business	235	3	2	3	3	1	3	1	1	1	5	3	1	4	1	1	6.0	neutral or dissatisfied
2	Female	Loyal Customer	26	Business travel	Business	1142	2	2	2	2	5	5	5	5	4	3	4	4	4	5	0	0.0	satisfied
3	Female	Loyal Customer	25	Business travel	Business	562	2	5	5	5	2	2	2	2	2	5	3	1	4	2	11	9.0	neutral or dissatisfied
4	Male	Loyal Customer	61	Business travel	Business	214	3	3	3	3	4	5	5	3	3	4	4	3	3	3	0	0.0	satisfied

train_df["Gender"] = pd.get_dummies(train_df["Gender"], drop_first=True, dtype="int")

train_df["Customer Type"]= pd.get_dummies(train_df["Customer Type"], drop_first=True, dtype="int")

train_df["Type of Travel"]= pd.get_dummies(train_df["Type of Travel"], drop_first=True, dtype="int")

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

train_df["Class"]=le.fit_transform(train_df["Class"])

train_df["Class"]

0         2
1         0
2         0
3         0
4         0
         ..
103899    1
103900    0
103901    0
103902    1
103903    0
Name: Class, Length: 103904, dtype: int32

train_df["Arrival Delay in Minutes"]

0         18.0
1          6.0
2          0.0
3          9.0
4          0.0
          ... 
103899     0.0
103900     0.0
103901    14.0
103902     0.0
103903     0.0
Name: Arrival Delay in Minutes, Length: 103904, dtype: float64

from sklearn.impute import SimpleImputer

median=train_df["Arrival Delay in Minutes"].median()

train_df["Arrival Delay in Minutes"].fillna(median, inplace=True)

train_df.isnull().sum()

Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
dtype: int64

train_df

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
0	1	0	13	1	2	460	3	4	3	1	5	3	5	5	4	3	4	4	5	5	25	18.0	neutral or dissatisfied
1	1	1	25	0	0	235	3	2	3	3	1	3	1	1	1	5	3	1	4	1	1	6.0	neutral or dissatisfied
2	0	0	26	0	0	1142	2	2	2	2	5	5	5	5	4	3	4	4	4	5	0	0.0	satisfied
3	0	0	25	0	0	562	2	5	5	5	2	2	2	2	2	5	3	1	4	2	11	9.0	neutral or dissatisfied
4	1	0	61	0	0	214	3	3	3	3	4	5	5	3	3	4	4	3	3	3	0	0.0	satisfied
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
103899	0	1	23	0	1	192	2	1	2	3	2	2	2	2	3	1	4	2	3	2	3	0.0	neutral or dissatisfied
103900	1	0	49	0	0	2347	4	4	4	4	2	4	5	5	5	5	5	5	5	4	0	0.0	satisfied
103901	1	1	30	0	0	1995	1	1	1	3	4	1	5	4	3	2	4	5	5	4	7	14.0	neutral or dissatisfied
103902	0	1	22	0	1	1000	1	1	1	5	1	1	1	1	4	5	1	5	4	1	0	0.0	neutral or dissatisfied
103903	1	0	27	0	0	1723	1	3	3	3	1	1	1	1	1	1	4	4	3	1	0	0.0	neutral or dissatisfied

103904 rows × 23 columns

train_df["satisfaction"] = le.fit_transform(train_df["satisfaction"])

train_df

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
0	1	0	13	1	2	460	3	4	3	1	5	3	5	5	4	3	4	4	5	5	25	18.0	0
1	1	1	25	0	0	235	3	2	3	3	1	3	1	1	1	5	3	1	4	1	1	6.0	0
2	0	0	26	0	0	1142	2	2	2	2	5	5	5	5	4	3	4	4	4	5	0	0.0	1
3	0	0	25	0	0	562	2	5	5	5	2	2	2	2	2	5	3	1	4	2	11	9.0	0
4	1	0	61	0	0	214	3	3	3	3	4	5	5	3	3	4	4	3	3	3	0	0.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
103899	0	1	23	0	1	192	2	1	2	3	2	2	2	2	3	1	4	2	3	2	3	0.0	0
103900	1	0	49	0	0	2347	4	4	4	4	2	4	5	5	5	5	5	5	5	4	0	0.0	1
103901	1	1	30	0	0	1995	1	1	1	3	4	1	5	4	3	2	4	5	5	4	7	14.0	0
103902	0	1	22	0	1	1000	1	1	1	5	1	1	1	1	4	5	1	5	4	1	0	0.0	0
103903	1	0	27	0	0	1723	1	3	3	3	1	1	1	1	1	1	4	4	3	1	0	0.0	0

103904 rows × 23 columns

Save the processed data

train_df.to_csv(path_or_buf="processed_data/train_df.csv", index=False)

train_df = pd.read_csv('processed_data/train_df.csv')

train_df.head()

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
0	1	0	13	1	2	460	3	4	3	1	5	3	5	5	4	3	4	4	5	5	25	18.0	0
1	1	1	25	0	0	235	3	2	3	3	1	3	1	1	1	5	3	1	4	1	1	6.0	0
2	0	0	26	0	0	1142	2	2	2	2	5	5	5	5	4	3	4	4	4	5	0	0.0	1
3	0	0	25	0	0	562	2	5	5	5	2	2	2	2	2	5	3	1	4	2	11	9.0	0
4	1	0	61	0	0	214	3	3	3	3	4	5	5	3	3	4	4	3	3	3	0	0.0	1

Splitting the data

from sklearn.model_selection import train_test_split

train_val_df, test_df = train_test_split(train_df, test_size=0.2, random_state=42)

train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

(62342, 23)
(20781, 23)
(20781, 23)

# train_df['satisfaction'] = train_df['satisfaction'].map({'neutral or dissatisfied':0 , 'satisfied':1})
# val_df['satisfaction'] = val_df['satisfaction'].map({'neutral or dissatisfied':0 , 'satisfied':1})
# test_df['satisfaction'] = test_df['satisfaction'].map({'neutral or dissatisfied':0 , 'satisfied':1})

train_df

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Inflight wifi service	Departure/Arrival time convenient	Ease of Online booking	Gate location	Food and drink	Online boarding	Seat comfort	Inflight entertainment	On-board service	Leg room service	Baggage handling	Checkin service	Inflight service	Cleanliness	Departure Delay in Minutes	Arrival Delay in Minutes	satisfaction
83488	0	0	51	1	0	366	2	1	2	3	3	3	3	4	4	2	1	4	4	4	0	0.0	0
31648	0	0	38	1	1	109	4	3	4	4	5	4	5	5	1	1	4	5	1	5	0	2.0	1
22340	1	0	50	1	1	78	3	5	3	3	5	3	4	5	3	1	3	3	4	5	0	0.0	0
68992	0	0	43	0	0	1770	5	5	5	5	5	5	4	4	4	4	4	4	4	3	17	8.0	1
100108	1	0	19	1	1	762	3	5	3	3	2	3	4	2	4	3	5	4	5	2	0	0.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
44593	1	0	54	0	2	989	4	3	3	3	4	4	4	4	4	5	4	3	4	4	25	17.0	0
59278	1	0	60	0	0	3358	0	4	0	2	2	5	4	5	5	5	5	3	5	5	0	0.0	1
29978	1	0	58	1	2	787	3	4	3	2	4	3	3	4	3	5	5	5	4	4	0	0.0	0
92224	0	0	57	1	1	431	0	5	0	2	2	5	5	5	5	0	5	3	5	5	0	0.0	1
67702	0	0	36	1	1	227	1	5	1	1	1	1	1	1	4	2	5	5	5	1	0	0.0	0

62342 rows × 23 columns

Scaling the Numeric features

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# select the columns to be used for training/prediction

# training dataset
X_train_ = train_df.drop("satisfaction", axis=1)
X_train = scaler.fit_transform(X_train_) ##scaled
y_train = train_df.satisfaction

X_val_ = val_df.drop("satisfaction", axis=1)
X_val = scaler.transform(X_val_) ##scaled
y_val = val_df.satisfaction

X_test_ = test_df.drop("satisfaction", axis=1)
X_test = scaler.transform(X_test_) ##scaled
y_test = test_df.satisfaction

Model Training Experiments

Data Modelling

Helper Functions

def plot_roc_curve(y_true,y_prob_preds,ax):
    """
    To plot the ROC curve for the given predictions and model

    """ 
    fpr,tpr,threshold = roc_curve(y_true,y_prob_preds)
    roc_auc = auc(fpr,tpr)
    ax.plot(fpr,tpr,"b",label="AUC = %0.2f" % roc_auc)
    ax.set_title("Receiver Operating Characteristic")
    ax.legend(loc='lower right')
    ax.plot([0,1],[0,1],'r--')
    ax.set_xlim([0,1])
    ax.set_ylim([0,1])
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate");
    plt.show();

def plot_confustion_matrix(y_true,y_preds,axes,name=''):
    """
    To plot the Confusion Matrix for the given predictions

    """     
    cm = confusion_matrix(y_true, y_preds)
    group_names = ['TN','FP','FN','TP']
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)]
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(cm, annot=labels, fmt='', cmap='Blues',ax=axes)
    axes.set_ylim([2,0])
    axes.set_xlabel('Prediction')
    axes.set_ylabel('Actual')
    axes.set_title(f'{name} Confusion Matrix');

def make_classification_report(model,inputs,targets,model_name=None,record=False):
    """
     To Generate the classification report with all the metrics of a given model with confusion matrix as well as ROC AUC curve.

    """
    ### Getting the model name from model object
    if model_name is None: 
        model_name = str(type(model)).split(".")[-1][0:-2]

    ### Making the predictions for the given model
    preds = model.predict(inputs)
    if model_name in ["LinearSVC"]:
        prob_preds = model.decision_function(inputs)
    else:
        prob_preds = model.predict_proba(inputs)[:,1]

    ### printing the ROC AUC score
    auc_score = roc_auc_score(targets,prob_preds)
    print("ROC AUC Score : {:.2f}%\n".format(auc_score * 100.0))
    

    ### Plotting the Confusion Matrix and ROC AUC Curve
    fig, axes = plt.subplots(1, 2, figsize=(18,6))
    plot_confustion_matrix(targets,preds,axes[0],model_name)
    plot_roc_curve(targets,prob_preds,axes[1])

Non Tree Models

Logistic Rregression

This type of statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.

In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds, and this logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))

ln(pi/(1-pi)) = Beta_0 + Beta_1X_1 + … + B_kK_k

In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable. The beta parameter, or coefficient, in this model is commonly estimated via maximum likelihood estimation (MLE). This method tests different values of beta through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimate. Once the optimal coefficient (or coefficients if there is more than one independent variable) is found, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.

For binary classification, a probability less than .5 will predict 0 while a probability greater than 0 will predict 1. After the model has been computed, it’s best practice to evaluate the how well the model predicts the dependent variable, which is called goodness of fit.

source

# Import the model
from sklearn.linear_model import LogisticRegression

#fit the model
model = LogisticRegression()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_test)


# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		LOGISTICREGRESSION MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.88      0.90      0.89     35308
              satisfaction       0.87      0.83      0.85     27034

                  accuracy                           0.87     62342
                 macro avg       0.87      0.87      0.87     62342
              weighted avg       0.87      0.87      0.87     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.57      0.58      0.58     11858
              satisfaction       0.43      0.42      0.43      8923

                  accuracy                           0.51     20781
                 macro avg       0.50      0.50      0.50     20781
              weighted avg       0.51      0.51      0.51     20781

Accuracy score for training dataset 0.874161881235764
Accuracy score for validation dataset 0.5129685770655887
ROC AUC Score : 92.65%

Observations

The auc roc score is 92.65 %
But this model is not working good with validation data. And also not predecting the True Positives.

Gaussian Naive Bayes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

Note: The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence assumption is never correct but often works well in practice.

Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

where A and B are events and P(B) ≠ 0.

Basically, we are trying to find the probability of event A, given the event B is true. Event B is also termed as evidence.

P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event B).
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

where, y is class variable and X is a dependent feature vector (of size n)

After substituting and solving the above equation we get the below

Now, To create a classifier model. we need to find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability. This can be expressed mathematically as:

So, finally, we are left with the task of calculating P(y) and P(xi | y).

Please note that P(y) is also called class probability and P(xi | y) is called conditional probability.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi | y).

Gaussian Naive Bayes classifier

In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:

The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by:

# import the model
from sklearn.naive_bayes import GaussianNB

#fit the model
model =GaussianNB()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		GAUSSIANNB MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.87      0.90      0.88     35308
              satisfaction       0.86      0.82      0.84     27034

                  accuracy                           0.86     62342
                 macro avg       0.86      0.86      0.86     62342
              weighted avg       0.86      0.86      0.86     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.87      0.90      0.88     11858
              satisfaction       0.86      0.82      0.84      8923

                  accuracy                           0.86     20781
                 macro avg       0.86      0.86      0.86     20781
              weighted avg       0.86      0.86      0.86     20781

Accuracy score for training dataset 0.8636874017516282
Accuracy score for validation dataset 0.8641066358693037
ROC AUC Score: 92.33%

Observations

The ROC AUC score is 92.33%. But the Recall and F1 scores are low. Thus we can say our model is failing to predict the True Positives
The Recall and F1 Score of the GaussianNB is more less than Logistic Regresssion.
This model working better with validation data.

SVM(Support Vector Machines)

Support Vector Machine, or SVM, is one of the most popular supervised learning algorithms, and it can be used both for classification as well as regression problems. However, in machine learning, it is primarily used for classification problems.

In the SVM algorithm, each data item is plotted as a point in n-dimensional space, where n is the number of features we have at hand, and the value of each feature is the value of a particular coordinate.
The goal of the SVM algorithm is to create the best line, or decision boundary, that can segregate the n-dimensional space into distinct classes, so that we can easily put any new data point in the correct category, in the future. This best decision boundary is called a hyperplane.
The best separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class. Indeed, there are many hyperplanes that might classify the data. Aas reasonable choice for the best hyperplane is the one that represents the largest separation, or margin, between the two classes.

The SVM algorithm chooses the extreme points that help in creating the hyperplane. These extreme cases are called support vectors, while the SVM classifier is the frontier, or hyperplane, that best segregates the distinct classes.

The diagram below shows two distinct classes, denoted respectively with blue and green points.

Support Vector Machine can be of two types:

Linear SVM: A linear SVM is used for linearly separable data, which is the case of a dataset that can be classified into two distinct classes by using a single straight line.
Non-linear SVM: A non-linear SVM is used for non-linearly separated data, which means that a dataset cannot be classified by using a straight line.


LinearSVM	Non-linear SVM

We need to choose the best Kernel according to our need.

The linear kernel is mostly preferred for text classification problems as it performs well for large datasets.
Gaussian kernels tend to give good results when there is no additional information regarding data that is not available.
Rbf kernel is also a kind of Gaussian kernel which projects the high dimensional data and then searches a linear separation for it.
Polynomial kernels give good results for problems where all the training data is normalized.

# import the model
from sklearn.svm import LinearSVC

#fit the model
model =LinearSVC()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

C:\Users\prajw\anaconda3\Lib\site-packages\sklearn\svm\_classes.py:32: FutureWarning: The default value of `dual` will change from `True` to `'auto'` in 1.5. Set the value of `dual` explicitly to suppress the warning.
  warnings.warn(


		LINEARSVC MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.88      0.91      0.89     35308
              satisfaction       0.87      0.83      0.85     27034

                  accuracy                           0.87     62342
                 macro avg       0.87      0.87      0.87     62342
              weighted avg       0.87      0.87      0.87     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.88      0.90      0.89     11858
              satisfaction       0.87      0.84      0.85      8923

                  accuracy                           0.87     20781
                 macro avg       0.87      0.87      0.87     20781
              weighted avg       0.87      0.87      0.87     20781

Accuracy score for training dataset 0.8734721375637612
Accuracy score for validation dataset 0.8743082623550359
ROC AUC Score : 92.59%

Observations

The ROC AUC score is 92.59%.
But the Recall and F1 scores are low. Thus we can say our model is failing to predict the True Positives

K-Nearest Neighbours

K-nearest neighbors is a supervised machine learning algorithm for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-nearest neighbors are used for classification or regression.

The main idea behind K-NN is to find the K nearest data points, or neighbors, to a given data point and then predict the label or value of the given data point based on the labels or values of its K nearest neighbors.

K can be any positive integer, but in practice, K is often small, such as 3 or 5. The “K” in K-nearest neighbors refers to the number of items that the algorithm uses to make its prediction whether its a classification problem or a regression problem.

Once K and distance metric are selected, K-NN algorithm goes through the following steps:

Calculate distance: The K-NN algorithm calculates the distance between a new data point and all training data points. This is done using the selected distance metric.
Find nearest neighbors: Once distances are calculated, K-nearest neighbors are determined based on a set value of K.
Predict target class label: After finding out K nearest neighbors, we can then predict the target class label for a new data point by taking majority vote from its K neighbors (in case of classification) or by taking average from its K neighbors (in case of regression).

Below are the different distance functions to calculate the nearest neighbours

# import the model
from sklearn.neighbors import KNeighborsClassifier

#fit the model
model =KNeighborsClassifier()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		KNEIGHBORSCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.93      0.98      0.95     35308
              satisfaction       0.97      0.90      0.94     27034

                  accuracy                           0.95     62342
                 macro avg       0.95      0.94      0.94     62342
              weighted avg       0.95      0.95      0.95     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.91      0.96      0.94     11858
              satisfaction       0.95      0.88      0.91      8923

                  accuracy                           0.93     20781
                 macro avg       0.93      0.92      0.92     20781
              weighted avg       0.93      0.93      0.93     20781

Accuracy score for training dataset 0.9460075069776395
Accuracy score for validation dataset 0.9263750541359896
ROC AUC Score: 96.69%

Observations:

The ROC AUC score is 96.69%.
The Recall and F1 scores are good.
But the model is failing to predict the True Positives.

SGDClassifier

Gradient Descent

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems.

The general idea is to tweak parameters iteratively in order to minimize the cost function.
An important parameter of Gradient Descent (GD) is the size of the steps, determined by the learning rate hyperparameters. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time, and if it is too high we may jump the optimal value.

Note: When using Gradient Descent, we should ensure that all features have a similar scale (e.g. using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.

Types of Gradient Descent: There are three types of Gradient Descent:

Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent

Stochastic Gradient Descent

The word 'stochastic' means a system or process linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration.
If the sample size is very large, it becomes computationally very expensive to find the golbal minima over the entire dataset. With SGD a random sample is selected to perform each iteration. This sample is randomly shuffled and selected for performing the iteration.

In SGDClassifier from scikit learn implements regularized linear models with stochastic gradient descent (SGD) learning. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM). The various loss function supported is

'hinge' gives a linear SVM.
'log_loss’ gives logistic regression, a probabilistic classifier.
'modified_huber' is another smooth loss that brings tolerance to outliers as well as probability estimates.
'squared_hinge' is like a hinge but is quadratically penalized.
'perceptron' is the linear loss used by the perceptron algorithm.

# import the model
from sklearn.linear_model import SGDClassifier

#fit the model
model =SGDClassifier(loss='modified_huber',n_jobs=-1,random_state=42)
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		SGDCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.91      0.80      0.85     35308
              satisfaction       0.77      0.89      0.83     27034

                  accuracy                           0.84     62342
                 macro avg       0.84      0.84      0.84     62342
              weighted avg       0.85      0.84      0.84     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.91      0.79      0.85     11858
              satisfaction       0.76      0.90      0.83      8923

                  accuracy                           0.84     20781
                 macro avg       0.84      0.84      0.84     20781
              weighted avg       0.85      0.84      0.84     20781

Accuracy score for training dataset 0.837781912675243
Accuracy score for validation dataset 0.8373514267840816
ROC AUC Score : 92.37%

Observations:

The ROC AUC score is 92.37%. But the Recall and F1 scores are low.

Tree Based models

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. A decision tree starts at a single point (or ‘node’) which then branches (or ‘splits’) in two or more directions. Each branch offers different possible outcomes, incorporating a variety of decisions and chance events until a final outcome is achieved.

While there are multiple ways to select the best attribute at each node, two methods, information gain and Gini impurity, act as popular splitting criteria for decision tree models. They help to evaluate the quality of each test condition and how well it will be able to classify samples into a class.

Entropy and Information Gain

Entropy is a concept that stems from information theory, which measures the impurity of the sample values. It is defined by the following formula, where:

S - Set of all instances

N - Number of distinct class values

Pi - Event probablity

Information gain indicates how much information a particular variable or feature gives us about the final outcome. It can be found out by subtracting the entropy of a particular attribute inside the data set from the entropy of the whole data set.

H(S) - entropy of whole data set S

|Sj| - number of instances with j value of an attribute A

|S| - total number of instances in the dataset

v - set of distinct values of an attribute A

H(Sj) - entropy of subset of instances for attribute A

H(A, S) - entropy of an attribute A

# import the model
from sklearn.tree import DecisionTreeClassifier

#fit the model
model =DecisionTreeClassifier(random_state=42)
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		DECISIONTREECLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       1.00      1.00      1.00     35308
              satisfaction       1.00      1.00      1.00     27034

                  accuracy                           1.00     62342
                 macro avg       1.00      1.00      1.00     62342
              weighted avg       1.00      1.00      1.00     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.96      0.95      0.95     11858
              satisfaction       0.94      0.94      0.94      8923

                  accuracy                           0.95     20781
                 macro avg       0.95      0.95      0.95     20781
              weighted avg       0.95      0.95      0.95     20781

Accuracy score for the training dataset is 1.0
Accuracy score for validation dataset 0.9486069005341418
ROC AUC Score: 94.79%

Observations:

The ROC AUC score is 94.79%.
The Recall and F1 scores are good.
But the model will cause overfitting. as the accuracy score for the training dataset is 1.

Random Forest classifier

Random Forest Classifier is an Ensemble algorithm. Random forest classifier creates a set of decision trees from randomly selected subset of the training set. It then aggregates the votes from different decision trees to decide the final class of the test object.

This works well because a single decision tree may be prone to noise, but the aggregate of many decision trees reduces the effect of noise giving more accurate results.

Random forest algorithms have three main hyperparameters, which need to be set before training. These include node size, the number of trees, and the number of features sampled. From there, the random forest classifier can be used to solve for regression or classification problems.

The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.
Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample.
Another instance of randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees.
Depending on the type of problem, the determination of the prediction will vary. For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e. the most frequent categorical variable—will yield the predicted class.
Finally, the oob sample is then used for cross-validation, finalizing that prediction.

#import the model

from sklearn.ensemble import RandomForestClassifier

#fit the model
model =RandomForestClassifier(random_state=42)
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		RANDOMFORESTCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       1.00      1.00      1.00     35308
              satisfaction       1.00      1.00      1.00     27034

                  accuracy                           1.00     62342
                 macro avg       1.00      1.00      1.00     62342
              weighted avg       1.00      1.00      1.00     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.96      0.98      0.97     11858
              satisfaction       0.97      0.94      0.95      8923

                  accuracy                           0.96     20781
                 macro avg       0.96      0.96      0.96     20781
              weighted avg       0.96      0.96      0.96     20781

Accuracy score for training dataset 1.0
Accuracy score for validation dataset 0.9609739666041095
ROC AUC Score: 99.34%

Observations:

The ROC AUC score is 99.37%.
The Recall and F1 scores are good.
But model can cause overfitting, as the accuracy score for training dataset is 1,
But after hypertunning we can train this model model is working much better with the validation dataset set as compared to other trained model.

ADA Boost Classifer

AdaBoost is an ensemble learning method (also known as “meta-learning”) which was initially created to increase the efficiency of binary classifiers. AdaBoost uses an iterative approach to learn from the mistakes of weak classifiers, and turn them into strong ones.

Rather than being a model in itself, AdaBoost can be applied on top of any classifier to learn from its shortcomings and propose a more accurate model. It is usually called the “best out-of-the-box classifier” for this reason.

Stumps have one node and two leaves. AdaBoost uses a forest of such stumps rather than trees.

Adaboost works in the following steps:

Initially, Adaboost selects a training subset randomly. It iteratively trains the AdaBoost machine learning model by selecting the training set based on the accurate prediction of the last training. It assigns the higher weight to wrong classified observations so that in the next iteration these observations will get the high probability for classification.
Also, It assigns the weight to the trained classifier in each iteration according to the accuracy of the classifier. The more accurate classifier will get high weight.
This process iterate until the complete training data fits without any error or until reached to the specified maximum number of estimators. To classify, perform a "vote" across all of the learning algorithms you built.

Pros of Aaboost

AdaBoost is easy to implement. It iteratively corrects the mistakes of the weak classifier and improves accuracy by combining weak learners. You can use many base classifiers with AdaBoost. AdaBoost is not prone to overfitting. This can be found out via experiment results, but there is no concrete reason available.

Cons of Aaboost AdaBoost is sensitive to noise data. It is highly affected by outliers because it tries to fit each point perfectly. AdaBoost is slower compared to XGBoost.

#import the model

from sklearn.ensemble import AdaBoostClassifier
#fit the model
model =AdaBoostClassifier()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		ADABOOSTCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.93      0.94      0.94     35308
              satisfaction       0.92      0.91      0.92     27034

                  accuracy                           0.93     62342
                 macro avg       0.93      0.93      0.93     62342
              weighted avg       0.93      0.93      0.93     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.94      0.94      0.94     11858
              satisfaction       0.92      0.92      0.92      8923

                  accuracy                           0.93     20781
                 macro avg       0.93      0.93      0.93     20781
              weighted avg       0.93      0.93      0.93     20781

Accuracy score for training dataset 0.9275929549902152
Accuracy score for validation dataset 0.9282517684423272
ROC AUC Score : 97.74%

Observations:

The ROC AUC score is 97.74%.
The Recall and F1 scores are good but comparatively lower than the random forest.

Gradient Boosting Classifier

Gradient boosting classifiers are a group of machine learning algorithms that combine many weak learning models together to create a strong predictive model. Decision trees are usually used when doing gradient boosting. Gradient boosting models are becoming popular because of their effectiveness at classifying complex datasets.

The gradient boosting algorithm is one of the most powerful algorithms in the field of machine learning. As we know that the errors in machine learning algorithms are broadly classified into two categories i.e. Bias Error and Variance Error. As gradient boosting is one of the boosting algorithms it is used to minimize bias error of the model.

Gradient Boosting has three main components:

1.Loss Function - The role of the loss function is to estimate how good the model is at making predictions with the given data. This could vary depending on the problem at hand. For example, if we're trying to predict the weight of a person depending on some input variables (a regression problem), then the loss function would be something that helps us find the difference between the predicted weights and the observed weights. On the other hand, if we're trying to categorize if a person will like a certain movie based on their personality, we'll require a loss function that helps us understand how accurate our model is at classifying people who did or didn't like certain movies.

2.Weak Learner - A weak learner is one that classifies our data but does so poorly, perhaps no better than random guessing. In other words, it has a high error rate. These are typically decision trees (also called decision stumps, because they are less complicated than typical decision trees).

3.Additive Model - This is the iterative and sequential approach of adding the trees (weak learners) one step at a time. After each iteration, we need to be closer to our final model. In other words, each iteration should reduce the value of our loss function.

#import the model

from sklearn.ensemble import GradientBoostingClassifier

#fit the model
model =GradientBoostingClassifier()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		GRADIENTBOOSTINGCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.94      0.96      0.95     35308
              satisfaction       0.95      0.92      0.93     27034

                  accuracy                           0.94     62342
                 macro avg       0.94      0.94      0.94     62342
              weighted avg       0.94      0.94      0.94     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.94      0.96      0.95     11858
              satisfaction       0.94      0.92      0.93      8923

                  accuracy                           0.94     20781
                 macro avg       0.94      0.94      0.94     20781
              weighted avg       0.94      0.94      0.94     20781

Accuracy score for traing dataset 0.942735234673254
Accuracy score for validation dataset 0.9418218565035369
ROC AUC Score : 98.71%

Observations:

The ROC AUC score is 98.71%.
The Recall and F1 scores are good.
We can choose this dataset to train our model.

Gradient Boosting Machines(XGBoost)

XgBoost stands for Extreme Gradient Boosting. It implements machine learning algorithms under the Gradient Boosting framework.

In this algorithm, decision trees are created in sequential form. Weights play an important role in XGBoost.
Weights are assigned to all the independent variables which are then fed into the decision tree which predicts results.
The weight of variables predicted wrong by the tree is increased and these variables are then fed to the second decision tree. These individual classifiers/predictors then ensemble to give a strong and more precise model.
It can work on regression, classification, ranking, and user-defined prediction problems.

#import the model

from xgboost import XGBClassifier

#fit the model
model =XGBClassifier()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for training dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

		XGBCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.98      0.99      0.98     35308
              satisfaction       0.99      0.97      0.98     27034

                  accuracy                           0.98     62342
                 macro avg       0.98      0.98      0.98     62342
              weighted avg       0.98      0.98      0.98     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.96      0.98      0.97     11858
              satisfaction       0.97      0.94      0.96      8923

                  accuracy                           0.96     20781
                 macro avg       0.96      0.96      0.96     20781
              weighted avg       0.96      0.96      0.96     20781

Accuracy score for training dataset 0.9802861634211286
Accuracy score for validation dataset 0.9622732303546508
ROC AUC Score: 99.51%

Observations:

The ROC AUC score is 99.51%.slightly higher than gradient Boosting.
The Recall and F1 scores are good.
We can choose this dataset to train our model. Ans can also improve our model with Hyperparameter tuning.

LightBoost

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency.
Lower memory usage.
Better accuracy.
Support of parallel, distributed, and GPU learning.
Capable of handling large-scale data.

LightGBM uses histogram-based algorithms, which bucket continuous feature (attribute) values into discrete bins. This speeds up training and reduces memory usage.

LightGBM grows trees leaf-wise (best-first). It will choose the leaf with max delta loss to grow. Holding #leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.

Leaf-wise may cause over-fitting when #data is small, so LightGBM includes the max_depth parameter to limit tree depth. However, trees still grow leaf-wise even when max_depth is specified.

#import the model

import lightgbm as lgb

#fit the model
model =lgb.LGBMClassifier()
model.fit(X_train,y_train)

# prediction
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# model name
model_name = str(type(model)).split(".")[-1][0:-2]
print(f"\t\t{model_name.upper()} MODEL\n")

print('Training part:')
print(classification_report(y_train, pred_train,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print('validation part:')
print(classification_report(y_val, pred_val,
                                    target_names=['neutral or dissatisfaction', 'satisfaction']))
print("Accuracy score for traing dataset",accuracy_score(y_train, pred_train))
print("Accuracy score for validation dataset",accuracy_score(y_val, pred_val))

make_classification_report(model,X_val,y_val)

[LightGBM] [Info] Number of positive: 27034, number of negative: 35308
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002503 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 929
[LightGBM] [Info] Number of data points in the train set: 62342, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.433640 -> initscore=-0.267014
[LightGBM] [Info] Start training from score -0.267014
		LGBMCLASSIFIER MODEL

Training part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.96      0.98      0.97     35308
              satisfaction       0.98      0.94      0.96     27034

                  accuracy                           0.97     62342
                 macro avg       0.97      0.96      0.97     62342
              weighted avg       0.97      0.97      0.97     62342

validation part:
                            precision    recall  f1-score   support

neutral or dissatisfaction       0.96      0.98      0.97     11858
              satisfaction       0.97      0.94      0.96      8923

                  accuracy                           0.96     20781
                 macro avg       0.96      0.96      0.96     20781
              weighted avg       0.96      0.96      0.96     20781

Accuracy score for training dataset 0.9671970742035867
Accuracy score for validation dataset 0.9630431644290458
ROC AUC Score: 99.49%

Observations:

this model is performing best with our Dataset.
The ROC AUC score is 99.49%.
The Recall and F1 scores are Very good.
We can choose this dataset to train our model.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
docs		docs
processed_data		processed_data
.gitignore		.gitignore
Airline Passenger Satisfaction ML Project.ipynb		Airline Passenger Satisfaction ML Project.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

praj2408/Customer-Satisfaction-Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

Customer Satisfaction Analysis Project

Problem Analysis

Problem Statement

Key Observations

Data Gathering and Initial Insights

Installing and Importing the required packages

Downloading the dataset

Information about the dataset

Independent Variables or features**

Exploratory Data Analysis and Visualization

Data distribution graphs

Piechart perrcentage distribution features

Exploratory Data Analysis and Visualization on Numerical Columns

Barplot representation of numerical features

Exploratory Data Analysis and Visualization on Categorical Column

Barplot representaion on Categorical Column

Arrival Delay in Minutes VS Departure Delay in minutes.

Flight distance vs Departure Delay in Minutes

Age and Customer type

Age vs Class

Flight Distance

Data preprocessing and Feature engineering

Save the processed data

Splitting the data

Scaling the Numeric features

Model Training Experiments

Data Modelling

Helper Functions

Non Tree Models

Logistic Rregression

Gaussian Naive Bayes

SVM(Support Vector Machines)

K-Nearest Neighbours

SGDClassifier

Gradient Descent

Tree Based models

Tree Based models

Random Forest classifier

ADA Boost Classifer

Gradient Boosting Classifier

Gradient Boosting Machines(XGBoost)

LightBoost

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages