In this project, we analyzed the bookcrossing dataset. The dataset contains three CSV files which are ratings, users, and books. We explored the data and preprocessed it. We applied a dimesionaltiy reduction technique (PCA), three classification algorithms (Logistic Regression, Decision Tree, K-Nearest Neighbors), and two clustering algorithms (K-Means and Hierarchical) to build models from the dataset. Our project contains the following parts:
-
Dataset
-
Exploratory data analysis
-
Visualization techniques
-
Imbalanced data set
-
Missing data imputation
-
Multicollinearity
-
Logistic Regression
-
PCA
- PCA with Logistic Regression
-
Clustering
- K-Means Clustering
- Hierarchical Clustering
- Missing Data with Hierarchical Clustering
-
Classification
- Decision Tree
- Decision tree with imbalanced data
- K-Nearest Neighbors (K-NN)
- Decision Tree