This project investigates breast cancer data using tools learned during the course "R for Bio Data Science" at DTU. It will contain the raw and tidy data, several ways to visualize the data and some analysis methods including PCA analysis, K-means and heatmap.
Raw data can be found in the folder /data/_raw and otherwise on this link.
The dataset contains three data files, but we chose only to use two of them:
File: 77_cancer_proteomes_CPTAC_itraq.csv
- Contains gene expression data (12553 observations) from 77 cancer patients and 3 healthy persons.
File: clinical_data_breast_cancer.csv
- Contains clinical data from the 77 cancer patients (not the 3 healthy), such as age, gender, tumor class etc.
File: PAM50_proteins.csv
- We decided not to use this data, since we did not need the information it could provide in order to make the analyses we wanted. It is somewhat a dictionary of the genes, their names and ID's.
In order to execute the full data analysis, generate final plots and the presentation, the script 00doit.R from folder /R should be run. The presentation will appear in the folder /doc.
Following packages has been used:
- tidyverse
- stringr
- broom
- purrr
- cowplot
- patchwork
- kableExtra
- modelr
- vroom
- Naja Bahrenscheer (najabahren)
- Sille Hendriksen (xxxsille)
- Sofia Otero (SofiaOtero)
- Mie Björk (MieDB)