This repository contains the source code, data, and documentation for the project Fairness Analysis in Mortgage Lending, which investigates biases in U.S. mortgage lending using the 2022 Home Mortgage Disclosure Act (HMDA) dataset. The project employs machine learning techniques and fairness evaluation tools to analyze and address demographic disparities affecting mortgage approvals.
Mortgage lending plays a critical role in financial stability and wealth creation. However, biases against certain demographic groups such as racial minorities, women, and older individuals persist despite regulatory frameworks. This project aims to:
- Detect and quantify these biases.
- Build machine learning models for mortgage approval predictions.
- Suggest fairness techniques to mitigate biases and promote equity.
The project leverages machine learning models like Random Forest and XGBoost, alongside IBM's AI Fairness 360 toolkit, to assess and address potential disparities.
The directory structure is organized as follows:
analysis/
|------ models.py
|------ preprocessing.py
|------ random_forest.py
data/
|------ processed/
| |------ preprocessed_X_test.csv
| |------ preprocessed_X_train.csv
| |------ preprocessed_y_test.csv
| |------ preprocessed_y_train.csv
| |------ sampled_preprocessed_data.csv
|------ X_test.csv
|------ X_train.csv
|------ y_test.csv
|------ y_train.csv
docs/
|------ project_proposal.pdf
|------ project_proposal_graded.pdf
|------ project_spotlight.pptx
|------ report.pdf
graphs/
|------ age_composition.png
|------ approval_rate.png
|------ ethnicity_composition.png
|------ feature_heatmap.png
|------ race_composition.png
|------ sex_composition.png
notebooks/
|------ models.ipynb
|------ preprocessing_demo.ipynb
|------ visualization.ipynb
references/
|------ Cherian_2014.pdf
|------ Gill_et_al_2020.pdf
|------ Hodges_et_al_2024.pdf
|------ Lee_2017.pdf
|------ Zou_et_al_2022.pdf
.gitignore
README.md
- Source: 2022 HMDA Public Loan/Application (LAR) Dataset.
- Attributes: Includes demographic variables (e.g., race, ethnicity, sex, age) and loan-specific details.
- Protected Attributes:
- Ethnicity: Hispanic or Latino (Unprivileged) vs. Not Hispanic or Latino (Privileged).
- Race: Minority races (Unprivileged) vs. White (Privileged).
- Sex: Female (Unprivileged) vs. Male (Privileged).
- Age: <25 or >74 (Unprivileged) vs. 25–74 (Privileged).
- Preprocessing: Includes cleaning, feature selection, imputation, and one-hot encoding.
- Machine Learning Models:
- Logistic Regression (Baseline)
- Random Forest
- XGBoost
- Fairness Assessment:
- IBM's AI Fairness 360 toolkit.
- Metrics: Disparate Impact, Statistical Parity Difference, Average Odds Difference, Equal Opportunity Difference.
- Performance:
- XGBoost achieved the highest accuracy (83.28%) with robust feature importance analysis.
- Random Forest performed similarly with slightly lower accuracy.
- Fairness Evaluation:
- Significant biases against racial minority groups were observed in both the dataset and model predictions.
- Minimal biases were found in terms of ethnicity, sex, and age.
Metrics used to evaluate fairness:
- Dataset Fairness:
- Significant bias found for race with Disparate Impact (0.84) and Statistical Parity Difference (-0.12).
- Model Fairness (XGBoost):
- Race-related bias persisted in model predictions (Disparate Impact: 0.77).
Mitigation strategies proposed:
- Preprocessing: Reweighing and optimized preprocessing.
- Inprocessing: Adversarial debiasing.
- Postprocessing: Equalized odds postprocessing.
- Dataset: HMDA Data
- Fairness Toolkit: AI Fairness 360
- Key references:
- Hodges et al. (2024)
- Cherian (2014)
- Gill et al. (2020)
- Lee (2017)
- Zou et al. (2022)
If you need any revisions or additional details, feel free to contact me here