Analysis of Credit score dataset from Kaggle for a case study
The classification lebel is the Credit_Score column present as the last column for train.csv dataset, test.csv does not contain this column and as such it is impossible to evaluate the model using test.csv
ID - Identificator for every transaction, primary key in relational database terms.
Customer_ID - Identificator unique for each customer, not primary key, since customers can repeat in the database.
Month - According to the original kaggle dataset description:"Represents the month of the year", presumably the date when the request was made. Different values are possible for the same Customer_iD.
Name - Name of the person.
Age - Age of the person (integer). Some values are incorrect like "-500".
SSN - Represents the social security number of a person. Should (probably) be unique for each Customer_ID but some values appear to be corrupted in the dataset. In the US first 4 digits used to be related to geaographic area but since 2011 this is no longer the case, so depending on when the dataset was constructed it might not be useful anymore.
Occupation - Job of the customer. String but also a nominal attribute. Some missing values are not actually denoted as multiple-> _ characters
Annual_Income - Annual income of the customer. Some values contain _ at the end of the number for some reason, like: 35547.71_. Also there seem to be some significant outliers or errors like for example customer with Customer_ID: CUS_0x284a had Annual_Income=131313.4 for transcation ID=0x164f but then had Annual_Income=10909427.0 for transaction ID=0x1650
Monthly_Inhand_Salary - Represents the monthly base salary of a person. Contains a lot of missing values and may also contain outliers, although it is not certain. This time there does not seem to be a problem with _ sign at the end of some values but it may require further analysis.
Num_Bank_Accounts - Number of bank accounts owned by the given customer. Possible errors/outliers like: 1414.
Num_Credit_Card - Represents the number of other credit cards held by a person. Also contains errors/outliers like: 1385.
Interest_Rate - Represents the interest rate on credit card. Contains some errors since I think that: 5318 is not legal
Num_of_Loan - Represents the number of loans taken from the bank. Some values contain _ at the end or errors like: -100
Type_of_Loan - String values seperated by ',' sign. When seperated, seems nominal
Delay_from_due_date - Represents the average number of payments delayed by a person. Might be integer or float.
Num_of_Delayed_Payment - Represents the average number of payments delayed by a person. Might be integer or float.
Changed_Credit_Limit - Represents the percentage change in credit card limit. Float numbers, can be negative.
Num_Credit_Inquiries - Represents the number of credit card inquiries. Possibly contains errors/outliers like: 1936.0
Credit_Mix - Ordinal, seems to refer to the rating given to previous credits, but I am not certain. Definiton according to chatgpt is: "Credit mix refers to the types of accounts that make up your credit report. It determines 10% of your FICO score. The different types of credit that might be part of your credit mix include credit cards, student loans, automobile loans, and mortgages.". Missing values denoted with '-'
Outstanding_Debt - Represents the remaining debt to be paid (in USD). Some values still contain _ sign at the end
Credit_Utilization_Ratio - Represents the utilization ratio of credit card. Credit utilization is the ratio of your outstanding credit balances (on both credit cards and lines of credit) compared to your overall credit limit combined across your accounts. For example, if you currently have a balance of $500 against your $1,000 credit limit, your credit utilization is 50%.
Credit_History_Age - Represents the age of credit history of the person. Represented as a string like: "22 Years and 1 Months". Missing values denoted using "NA".
Payment_of_Min_Amount - Represents whether only the minimum amount was paid by the person. Values like: "Yes, No, NM". NM could mean Not Mentioned but it is not stated explicitly.
Total_EMI_per_month - Represents the monthly amount invested by the customer (in USD). EMI is a fixed payment amount made by a borrower to a lender at a specified date each calendar month
Amount_invested_monthly - Represents the monthly amount invested by the customer (in USD). Contains missing values and some values denoted as in an unexpected way like: "_ _1000_ _"
Payment_Behaviour - Seems categorical but very verbouse and contains corrupted values like: !@9#%8
Monthly_Balance - Represents the monthly balance amount of the customer (in USD).
Credit_Score - Classification variable. Possible values: "Poor, Standard, Good"