Data Analysis Project for DSC424: Advanced Data Analysis at DePaul University
Highlights: Feature Dimensionality Reduction with PCA and Factor Analysis
Analysis of breast cancer clincial features and gene expression features to explore their relationship using principal component analysis (PCA), factor analysis, and cluster analysis. In this group project, my focus was on preprocessing of the data and performed dimensionality reduction of the gene expression features using PCA from 663 to 25. Principle factor analysis and common factor analysis were performed to discover patterns in the gene expression features and lastly cluster analysis was also performed as an exploration.
The dataset is acquired through the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database through Kaggle. The dataset contains 1,904 instances and 693 features with each instance representing a breast cancer patient, the patient's clinicial attributes, and their gene expression attributes.