Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Updated analysis: Unified color palette for plots #510

Closed
cansavvy opened this issue Feb 4, 2020 · 12 comments
Closed

Updated analysis: Unified color palette for plots #510

cansavvy opened this issue Feb 4, 2020 · 12 comments

Comments

@cansavvy
Copy link
Collaborator

cansavvy commented Feb 4, 2020

What analysis module should be updated and why?

All modules that have plots with colors.

We should probably prioritize plots that will be in the main document? But we probably want the unified color palette to also extend to non-main figures.

What changes need to be made? Please provide enough detail for another participant to make the update.

We should have a unified color palette. This helps interpretability and aesthetics.
simplecolors R package has some helpful tools and nice vignette:
https://cran.r-project.org/web/packages/simplecolors/vignettes/intro.html

For ggplot2 plots, colors can be designated using scale_fill_manual and scale_color manual.

Which colors do we generally want to default to?

  • For histology: There are many places where data are colored by short_histology, so having the colors for each group in particular would help readers follow along better. Can use an numeric approach to try to get ~36 colors as different as possible. I started implementing this
  • Could use colorblindr's palette selection for guidance on some variable color choices.
    Here's an example of what I mean, but I haven't yet tested these colors:
# Get values that can be used to make colors equi distant hues away for the 
# number of histology groups we have
col_val <- seq(from = 0, to = 1, 
               length.out = length(unique(df$short_histology)))

# Translate into colors
col_key <- hsv(h = col_val, s = col_val, v = 1)

# Make this named based on histology
names(col_key) <- unique(df$short_histology)

# Make the same order as the data.frame
col_key <-  as.character(dplyr::recode(df$short_histology, !!!col_key))

# Make the names 
names(col_key) <- as.character(rownames(df))
  • For heatmaps or other continuous numeric variable data, we should choose a general color palette. Some instances want color functions, so we can use colorRamp for these instances, but we should decide what hex codes/colors should be used (I'm not suggesting necessarily the ones I have below).
col_fun <- circlize::colorRamp2(
  c(0, .25, .5, 1, 3),
  c("#edf8fb", "#b2e2e2", "#66c2a4", "#2ca25f", "#006d2c")
)

Modules with plots that will need to be color palette unified:

I've tagged myself on the modules I will be responsible for updating the palette for, others can add themselves for other modules.

Module Person who will update the plots Plots in this module to be updated?
  • chromosomal-instability
@cansavvy breaks_cdf_plot.png, 3 heatmaps, tumor-type plots
  • cnv-chrom-plot
@cansavvy gistic.png and histology group plots
  • cnv-comparison
  • focal-cn-file-preparation
  • immune-deconv
  • interaction-plots
  • molecular-subtyping-ATRT
  • mutational-signatures
@cansavvy The bubble matrix plots, all cosmic/ and nature/ plots, individual and grouped barplots
  • oncoprint-landscape
@cbethell The 4 oncoprint plots (all_participants_ png plots)
  • sample-distribution-analysis
@cbethell
  • selection-strategy-comparison
  • sex-prediction-from-RNASeq
  • snv-callers
@cansavvy All comparison plots
  • ssgsea-hallmark
  • survival-analysis
@cansavvy survival_curve_gender.pdf
  • tmb-compare-tcga
@cansavvy Main TMB compare plot
  • tp53_nf1_score
  • transcriptomic-dimension-reduction

When do you expect the revised analysis will be completed?

? We should also better refine which plots are the priority before we can make this call.

@jaclyn-taroni
Copy link
Member

Tagging @sjspielman to weigh in about color palette.

@sjspielman
Copy link
Member

A strategy like this was suggested to me for color palette. Use an existing smaller color palette that is colorblind friendly and reasonable to distinguish, and then have different gradients. This would work out well if we have some kind of higher level grouping for subtypes? https://stackoverflow.com/questions/50163072/different-colors-with-gradient-for-subgroups-on-a-treemap-ggplot2-r/50164882#50164882

@jaclyn-taroni
Copy link
Member

Wow, that S.O. post makes me want to consider reworking our treemaps in sample-distribution-analysis, too! (Related to publication-ready figures: #571)

@jashapiro
Copy link
Member

Okay, this is not exactly what we want, but it was something I did in the past (and mentioned before in person) to generate a lot of colors that are pretty distinguishable. This example has 49 colors, which is definitely pushing it. It could be a place to start.

colorscheme = hsv(h = 1:49/49 * .85, v = c(.8,1,1), s = c(1,1, .6))

image

@jashapiro
Copy link
Member

jashapiro commented Mar 3, 2020

Ooh, I just found: http://phrogz.net/css/distinct-colors.html

Which allowed me to generate this set:
Screenshot 2020-03-03 13 53 27
#400000, #ffaa00, #bfffd0, #3370cc, #bf0099, #bf3030, #8c5e00, #005924, #0030b3, #731d4b, #d9a3a3, #f2da79, #2d5950, #1140, #ff0066, #ff4400, #8c8569, #3df2e6, #2b2633, #594943, #474d00, #00ccff, #a200f2, #bf6930, #cef23d, #b6def2, #796080, #331c0d, #00b330, #2d4459, #ffbffb

Not perfect, but not bad... Dropping some of the dark colors would help, I expect

@cansavvy
Copy link
Collaborator Author

@dvenprasad and I chatted a bit about color palettes. Here are the colorsets I believe we need:

  1. Color palette for each histology group in short_histology.
  2. A gradient color scale (for things like TMB).
  3. A divergent color scale (for things like seg.means) ...
  4. A binary color key (for things like CN status). The most extreme colors in the divergent color scale can be used for this binary color key.

@dvenprasad also found these two tools that we can use to poke around:
https://www.colorbox.io/
https://projects.susielu.com/viz-palette

Next steps:

  1. I'm going to attempt to pick a color palette for items 1 - 4 using the tools listed above and also the suggestions that have been placed on this issue.
  2. I will test them for colorblind friendliness with Color Oracle.
  3. I'll file a draft PR with suggested color palettes and options.
  4. I'll try to create R color key objects that we can use to apply to all plots and with instructions of how to apply them to our plots and put the HEX codes in table that can live in a README (not sure which README Is appropriate).

cbethell added a commit to cbethell/OpenPBTA-analysis that referenced this issue Mar 18, 2020
- convert final png figure to pdf
- add figure generation script to `run-figures.sh`
- use color palette generated in PR AlexsLemonade#510 for figure color scheme
- format treemap to have less redundant values (treemap as is in final figure 1 panel does not show redundant values and represents the `short_histology` and `integrated_diagnosis` values which I believe should be fine in this case)
- redundant text does show up on the treemap plot in `analyses/sample-distribution-analysis/plots` directory (still looking into this)
@cansavvy
Copy link
Collaborator Author

With #622 merged, we are ready to update figures to the unified color palette (See the README in figures for instructions). If there are any changes that need to be made to the color palette as we are starting to implement, you can note them here and I can help with that.

jaclyn-taroni pushed a commit that referenced this issue Mar 26, 2020
* Make sample distribution plot publication ready

- save treemap plot
- rerun module
- add `figures/pngs` and `figures/scripts` directories to hold pub ready plots and scripts

* Create and incorporate treemap into figure1

* Install `treemapify` package on docker

* add series of `dplyr::` as needed

* Add figure generation script to .circleCI

- convert final png figure to pdf
- add figure generation script to `run-figures.sh`
- use color palette generated in PR #510 for figure color scheme
- format treemap to have less redundant values (treemap as is in final figure 1 panel does not show redundant values and represents the `short_histology` and `integrated_diagnosis` values which I believe should be fine in this case)
- redundant text does show up on the treemap plot in `analyses/sample-distribution-analysis/plots` directory (still looking into this)

* update branch and rerun plots

* Fix represented proportions on treemap figures

- attempt 1 to fix overlapping text

* Add `dplyr::` to distinct and update comment

* Use `broad_histology` instead of `short_histology` in treemap

- change around the placement of labels and sizes to try to eliminate text overlapping in plot
- rerun plot

* remove old figure shell script

* add `scale_fill_identity` argument and rerun plot

* Decrease text size to try getting rid of overlapping text

* add subplot labels per @jashapiro suggestion

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* rerun plot with subplot label added

Co-authored-by: jashapiro <josh.shapiro@ccdatalab.org>
@jashapiro
Copy link
Member

In #622, colors are defined for short_histology, but not for other histology definitions. I propose we have colors defined for broad_histology as well and a table that assigns the same colors (or slightly modified versions) to integrated_diagnosis. Unfortunately, it appears that short_histology does not neatly nest in broad_histology, which could make this a challenge.

I am also unsure of the difference between na_color and Other, but from examining the results in #633, it appears that Other should probably be colored the same (or a similarly neutral grey) to na_color in that the Other short_histology includes tumors of quite varied types. With its current prominent reddish color, this "category" seems to indicate meaning where none is likely to exist.

As a minor clarification, I propose that the histology_color_palette.tsv use short_histology rather than color_names as the column header. (I also like singular column headers, but that is super minor and probably too late to change!).

What I might like to see is something like the following for histology_color_palette, but I fear this is not possible, given constraints above.

integrated_diagnosis short_histology broad_histology integrated_diagnosis_color short_histology_color broad_histology_color
Atypical Teratoid Rhabdoid Tumor ATRT Embryonal tumor
Medulloblastoma Medulloblastoma Embryonal tumor

Note: I will be filing a separate data issue about the short_histology labels, specifically Other which seems to include benign and metastatic tumors, as well as other broad histologies that do not seem like they should be collapsed for any reasonable grouping.

@cansavvy
Copy link
Collaborator Author

I am also unsure of the difference between na_color and Other, but from examining the results in #633, it appears that Other should probably be colored the same (or a similarly neutral grey) to na_color in that the Other short_histology includes tumors of quite varied types. With its current prominent reddish color, this "category" seems to indicate meaning where none is likely to exist.

Yes, I wasn't sure how Other vs no assignment for short_histology were being assigned, so I didn't want to merge that and lose the information, so I left it as is.

@jashapiro
Copy link
Member

I am also unsure of the difference between na_color and Other, but from examining the results in #633, it appears that Other should probably be colored the same (or a similarly neutral grey) to na_color in that the Other short_histology includes tumors of quite varied types. With its current prominent reddish color, this "category" seems to indicate meaning where none is likely to exist.

Yes, I wasn't sure how Other vs no assignment for short_histology were being assigned, so I didn't want to merge that and lose the information, so I left it as is.

Looks like NA is exclusively "non-tumor", which makes sense. But "Other" is a mix of unrelated things, as discussed in #647

@cbethell
Copy link
Contributor

cbethell commented Apr 1, 2020

The oncoprint landscape plots in the oncoprint-landscape module of this repository currently implement a color palette sourced from https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/oncoprint-landscape/util/oncoplot-palette.R.

This color palette contains hex codes for unique categories of SNVs, CNVs, and fusion data.

It is being implemented in the PR getting the oncoprint landscape figure publication ready (WIP PR #666), and as @cansavvy noted in a review comment, it should probably be adjusted (to be uniformed) and incorporated into the color palette strategy.

@jaclyn-taroni
Copy link
Member

We now have unified color palettes and their usage is documented here: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/figures#color-palette-usage

The majority of figures in figures/png use these palettes. I am going to close this issue in favor of more focused issues for individual figures as it comes up.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants