I had two datasets:
First dataset:
- Date
- Account ID Number
- Opening Date of Account
- Account Status: Active or Close
- Account Type
- Account Balance at the end of the day
Second dataset:
- Account ID Number
- Transaction Amount
- Transaction Category
- Transaction Time
I found one null value, I dropped it since it had no information.
Feature Introduction: I introduced 6 features based on my knowledge of bank transactions.
- Feature 1: The total number of transactions for each account in these three months.
- Feature 2: The amount average of each transaction.
- Feature 3: The variance of amount of the transaction for each customer.
- Feature 4: The average of amount of account balance at end of each day.
- Feature 5: Account balance variance
- Feature 6: The duration of the account activity, which includes two parts: from the day of account opening to the beginning of the dataset, plus the active dates of accounts in the dataset. I should convert system date from Gregorian to Jalali to make dataset dates and opening dates of the same type.
I scaled financial features using log function and min-max scaling. (I test the model without the scaling I realized that scaling ends in a better clustering) I scaled active time duration using min-max scaling.
I made a dataframe from all features.
Then, I used scatter_matrix to visualize the features.
I used box plot to show the existence of outliers:
I used a standard score. Then I dropped those data that correspond to standard scores greater than 3. Then, the boxplot become:
I used the elbow curve of inertia, to specify the number of clusters:
For clustering kmeans with n=6, we clustered the data. I used the scatter matrix to visualize it:
clusters correspond the label=5,6 reperesnt cusstomers, with :
- longer active time
- bigger account balance
- bigger amount of transaction
- greater number of transaction.
I did a second kmeans on clusters with label=5,6. Using Elbow curve, I consider n=3.
Then, as we see in this clustering, the cluster with label=2 corresponds, to customers with:
- bigger account balance
- bigger amount of transaction
As a second method, I used DBSCAN. I used KNN to specify the value of epsilon:
DBSCAN on datas resulted in following clustering:
Clustering wasn't successful in this method.