class: center, middle, inverse, title-slide # Introduction to ggplot2 ## download @ bit.ly/2ZOShd4 ### Victor Yuan ### 2020-07-09 --- # Set up Install these packages ```r install.packages(tidyverse) ``` Load libraries ```r library(tidyverse) ``` ``` ## -- Attaching packages --------------------------------------- tidyverse 1.3.0 -- ``` ``` ## v ggplot2 3.3.2 v purrr 0.3.4 ## v tibble 3.0.5 v dplyr 1.0.3 ## v tidyr 1.1.2 v stringr 1.4.0 ## v readr 1.4.0 v forcats 0.5.0 ``` ``` ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` --- # Load gene expression / methylation data ```r geo_data <- read_csv('https://raw.githubusercontent.com/wvictor14/TOG/master/data/GSE98224.csv') geo_data ``` ``` ## # A tibble: 48 x 159 ## expr_geo_id meth_geo_id diagnosis tissue maternal_age maternal_bmi ## <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 GSM1940495 GSM2589532 PE Place~ 37 19.5 ## 2 GSM1940496 GSM2589533 PE Place~ 40 25.7 ## 3 GSM1940499 GSM2589534 PE Place~ 37 25 ## 4 GSM1940500 GSM2589535 PE Place~ 38 26.2 ## 5 GSM1940501 GSM2589536 PE Place~ 33 31.2 ## 6 GSM1940502 GSM2589537 PE Place~ 26 31.2 ## 7 GSM1940505 GSM2589538 PE Place~ 31 18.6 ## 8 GSM1940506 GSM2589539 PE Place~ 37 25.2 ## 9 GSM1940507 GSM2589540 non-PE Place~ 35 18.6 ## 10 GSM1940508 GSM2589541 PE Place~ 32 26.6 ## # ... with 38 more rows, and 153 more variables: maternal_ethnicity <chr>, ## # ga_weeks <dbl>, ga_days <dbl>, transcript_8033795 <dbl>, ## # transcript_8103881 <dbl>, transcript_7904014 <dbl>, ## # transcript_8127692 <dbl>, transcript_7990031 <dbl>, ## # transcript_8121144 <dbl>, transcript_8150846 <dbl>, ## # transcript_7962246 <dbl>, transcript_7941890 <dbl>, ## # transcript_7896644 <dbl>, transcript_7992897 <dbl>, ## # transcript_7973002 <dbl>, transcript_7979800 <dbl>, ## # transcript_8112007 <dbl>, transcript_8036686 <dbl>, ## # transcript_8001325 <dbl>, transcript_8180328 <dbl>, ## # transcript_8109283 <dbl>, transcript_8041223 <dbl>, ## # transcript_8144703 <dbl>, transcript_7997556 <dbl>, ## # transcript_7955896 <dbl>, transcript_7939897 <dbl>, ## # transcript_8035078 <dbl>, transcript_8113094 <dbl>, ## # transcript_7893397 <dbl>, transcript_8110708 <dbl>, ## # transcript_8102610 <dbl>, transcript_8083407 <dbl>, ## # transcript_8174592 <dbl>, transcript_7922299 <dbl>, ## # transcript_7979269 <dbl>, transcript_8074593 <dbl>, ## # transcript_7967810 <dbl>, transcript_8052562 <dbl>, ## # transcript_7927775 <dbl>, transcript_8005601 <dbl>, ## # transcript_8129974 <dbl>, transcript_8070295 <dbl>, ## # transcript_7952795 <dbl>, transcript_8044743 <dbl>, ## # transcript_7896053 <dbl>, transcript_7894489 <dbl>, ## # transcript_8048889 <dbl>, transcript_7894063 <dbl>, ## # transcript_8171539 <dbl>, transcript_8011396 <dbl>, ## # transcript_7983157 <dbl>, transcript_8171848 <dbl>, ## # transcript_8097443 <dbl>, cg04950931 <dbl>, cg21697851 <dbl>, ## # cg20092728 <dbl>, cg12804791 <dbl>, cg11619216 <dbl>, cg07802350 <dbl>, ## # cg13175060 <dbl>, cg25632577 <dbl>, cg11811391 <dbl>, cg20981848 <dbl>, ## # cg14025883 <dbl>, cg25493658 <dbl>, cg01491071 <dbl>, cg03777414 <dbl>, ## # cg20586124 <dbl>, cg16175792 <dbl>, cg25961733 <dbl>, cg13912117 <dbl>, ## # cg27307465 <dbl>, cg23825057 <dbl>, cg17949440 <dbl>, cg04098985 <dbl>, ## # cg16886987 <dbl>, cg22860917 <dbl>, cg21594328 <dbl>, cg23903035 <dbl>, ## # cg14393923 <dbl>, cg25103160 <dbl>, cg04640920 <dbl>, cg01522692 <dbl>, ## # cg23249922 <dbl>, cg15903956 <dbl>, cg10688297 <dbl>, cg07989490 <dbl>, ## # cg16090790 <dbl>, cg01519765 <dbl>, cg18444702 <dbl>, cg16404259 <dbl>, ## # cg12077460 <dbl>, cg22517735 <dbl>, cg01713086 <dbl>, cg16734734 <dbl>, ## # cg00886182 <dbl>, cg07891440 <dbl>, cg15715892 <dbl>, cg21368161 <dbl>, ## # cg03766264 <dbl>, ... ``` --- layout: false class: inverse center middle text-white # 3 essential components ## to every ggplot2 graph ### **Data**, **Geom**etry, **Aes**thetics --- First step of every ggplot2 call is to *declare* the data. .pull-left[ ```r *ggplot(data = geo_data) ``` ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-1-out-1.png" width="100%" height="100%" /> ] --- Then, we can assign variables in our data to different *aesthetics* of the plot. .pull-left[ ```r ggplot(data = geo_data, * aes(x = ga_weeks, * y = cg20970886)) ``` This is referred to as *aesthetic mapping*. ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-2-out-1.png" width="100%" height="100%" /> ] --- Add **geometries (geoms)** to complete the plot. .pull-left[ ```r ggplot(data = geo_data, aes(x = ga_weeks, y = cg20970886)) + * geom_point() ``` Geoms are like saying what type of plot you want (e.g. scatterplot, boxplots, histograms... etc.) ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-3-out-1.png" width="100%" height="100%" /> ] --- There are many *geoms*. Sometimes it makes sense to combine several. .pull-left[ ```r ggplot(data = geo_data, aes(x = ga_weeks, y = cg20970886)) + * geom_point() + * geom_smooth(method = "lm") ``` ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-4-out-1.png" width="100%" height="100%" /> ] --- We can assign other variables to other aesthetics, e.g. color. .pull-left[ ```r ggplot(data = geo_data, aes(x = ga_weeks, y = cg20970886, * color = maternal_ethnicity)) + geom_point() + geom_smooth(method = "lm") ``` But note that this assigned maternal ethnicity to the color of both points and lines! ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-5-out-1.png" width="100%" height="100%" /> ] --- To assign color exclusively to points (and not lines), put inside specific geom: .pull-left[ ```r ggplot(data = geo_data, aes(x = ga_weeks, y = cg20970886)) + * geom_point(aes(color = maternal_ethnicity)) + geom_smooth(method = "lm") ``` ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-6-out-1.png" width="100%" height="100%" /> ] --- Can change the *shape* of points .pull-left[ ```r ggplot(data = geo_data, aes(x = ga_weeks, y = cg20970886)) + geom_point(aes(color = maternal_ethnicity), * shape = 3) + geom_smooth(method = "lm") ``` See [reference](https://ggplot2.tidyverse.org/reference/scale_shape.html) for complete list of shapes. ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-6-2-out-1.png" width="100%" height="100%" /> ] --- A common mistake is to forget the aesthetic call. .pull-left[ ```r ggplot(data = geo_data, aes(x = ga_weeks, y = cg20970886)) + * geom_point(color = "blue", shape = 3) + geom_smooth(method = "lm") ``` ] .pull-right[ <img src="intro-to-ggplot2_files/figure-html/our-first-plot-7-out-1.png" width="100%" height="100%" /> ] This assigns color to all the data --- At this point, we've covered the 3 essential components to any ggplot2 plot: 1. **Data** - declare with a `ggplot(data = ...)` call 2. **Aesthetics** - assign input to plot components with `aes()`, e.g. (x/y position, color) 3. **Geoms** - declare the type of geometry, e.g. `+ geom_point()` for points --- # There are so many geoms Each geom has their own required aesthetics, and optional ones - `geom_point` requires `x` and `y`, and that they be numeric variables - `geom_boxplot` requires `x` and `y`, but `x` must be categorical - `geom_histogram` and `geom_density` requires `x` - `geom_text` requires `x`, `y`, and `text` Check out [tidyverse site](https://ggplot2.tidyverse.org/reference/#section-geoms) for full list. You can visit help pages for more information on a specific geom's options (e.g. `?geom_point`) Now we know the basics, we can explore ways to customize our plots --- .left-code[ ```r *ggplot(data = geo_data) ``` We'll start by looking at the methylation of this CpG site between preeclamptic and non-preeclamptic samples First we declare the data. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-1-1-out-1.png" width="100%" height="100%" /> ] PE: diagnosed with preeclampsia --- .left-code[ ```r ggplot(data = geo_data, * aes(x = diagnosis, * y = cg20970886, * fill = diagnosis)) ``` Then we declare the mappings of our variables to aesthetics ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-1-2-out-1.png" width="100%" height="100%" /> ] PE: diagnosed with preeclampsia --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + * geom_boxplot() ``` To specify we want boxplots, we use `geom_boxplot` ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-1-3-out-1.png" width="100%" height="100%" /> ] PE: diagnosed with preeclampsia --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + * geom_point() ``` It can be informative to plot all individual data points over top of the boxplots. To add individual data points, we simply add another geometry, `geom_point` But it's a bit hard to see when the points overlap each other.. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-2-1-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + * geom_jitter() ``` `geom_jitter` adds "noise" so that the points are spread out horizontally. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-2-2-out-1.png" width="100%" height="100%" /> ] --- layout: false class: inverse center middle text-white # Customizing your graphs # Scales and themes --- # Scales `aes` determines which data variables are mapped to each component of the graph `scale_*_*` functions determine *how* this mapping is done `scale_<aes>_<type>` calls all start with "`scale_`" followed by the target aesthetic (e.g. x, y, color, fill), and finished by the type (e.g. discrete, continuous). For example, Want to change the limits on the y-axis? where the ticks appear? or maybe change to a log scale? Use `scale_y_continuous(limits = c(0,1))` or `scale_y_log10()` Want to change colors? Use `scale_color_discrete()` for categorical variables `scale_color_continuous()` for continuous variables --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + * scale_fill_manual(values = c("orange", "#7ED7F2")) ``` Here I assign specific colors to the categories of the diagnosis variable. I supplied a vector of colors (can be in hex code) of same length of the number of categories of the variable `diagnosis`. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-3-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + * scale_x_discrete(labels = c("Controls", * "Cases")) ``` Here I change the labels of my x-axis. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-4-1-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + * scale_x_discrete(labels = c("non-PE" = "Controls", * "PE" = "Cases")) ``` It's better to be explicit about which label corresponds to which category ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-4-2-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + * scale_y_continuous(limits = c(0, 1), * breaks = c(0, 0.5, 1)) ``` Here I expand the y axis to 0 and 1, the natural range of methylation. I also change where I want the ticks (i.e. "breaks") to appear. Note that the y axis is a numeric variable and x axis is categorical, and how the respective scale calls reflect that. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-5-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + * theme(axis.text = element_text(colour = 'blue')) ``` The **`theme()`** function call allows for a customization of the non-data components of a plot. Things like the title, labels, font size, gridlines, etc. Pull up `?theme` to see a full description of all options ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-6-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + * theme(axis.text = element_text(colour = 'blue'), * panel.grid.major= element_line(colour = 'black'), * panel.grid.minor = element_blank()) ``` Most `theme()` arguments will require an "`element_*`" as input. The type of element depends on the type of input (e.g. `element_text` for `axis.text`, `element_rect` for `panel.border`). `element_blank` to remove components. ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-7-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + * theme_bw(base_size = 20) ``` There are some predefined themes that look nice and easy to use. - `theme_gray` - default ggplot2 theme - `theme_classic` - minimal with no gridlines - `theme_bw` - clean look with white background [List of complete ggplot2 themes](https://ggplot2.tidyverse.org/reference/ggtheme.html) ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-8-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + theme_bw(base_size = 20) + * theme(legend.position = 'top') ``` You can customize these complete themes by calling `theme()` after e.g. `theme_bw()` ] .right-plot[ <img src="intro-to-ggplot2_files/figure-html/fine-tune-9-out-1.png" width="100%" height="100%" /> ] --- .left-code[ ```r *p <- ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + theme_bw(base_size = 20) + theme(legend.position = 'top') ``` ] .right-plot[ There are a couple of options to save plots in R. Probably the simplest way is to use `ggsave` from `ggplot2`. First thing to do is to assign your plot into an object. I assigned our plot to the object named `p` ] --- .left-code[ ```r p <- ggplot(data = geo_data, aes(x = diagnosis, y = cg20970886, fill = diagnosis)) + geom_boxplot() + geom_jitter() + scale_fill_manual(values = c("orange", "#7ED7F2")) + scale_x_discrete(labels = c("non-PE" = "Controls", "PE" = "Cases")) + scale_y_continuous(limits = c(0, 1), breaks = c(0, 0.5, 1)) + theme_bw(base_size = 20) + theme(legend.position = 'top') *ggsave(plot = p, * filename = "this-plot.png", * device = 'png', * dpi = 72, * height = 5, * width = 7) ``` ] .right-plot[ Then we can call `ggsave` on object `p`. I would recommend specifying the following options: - `filename`, the name and location where you want the plot to be saved - `device`, the type of image file (e.g. "pdf", "png", "tiff", etc...) - `height`, `width` - determines the dimensions of your plot - `dpi`, resolution After you run the code, check your local directory for the png file. ] --- # Resources - Stack exchange for online help - TOG study group / slack - [Past TOG workshops](https://github.com/BCCHR-trainee-omics-group/StudyGroup) - [ggplot2 extensions](https://exts.ggplot2.tidyverse.org/) - [ggplot2 cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) - [r 4 data science data visualization chapter](https://r4ds.had.co.nz/data-visualisation.html) - [Eva Maerey's ggplot2 grammar guide](https://evamaerey.github.io/ggplot2_grammar_guide/about) ---