Multiple Sequence Alignments with r-gt!

Author

Victor Yuan

Published

October 3, 2025

Multiple sequence alignment (MSA) is a computational technique to compare biological sequences, and identify regions of similarity and differences. This approach is used for identifying conserved functional domains, and understanding evolutionary relationships between related proteins.

For the 2025 Posit Table contest I wanted to explore how MSAs can be effectively visualized using the r package gt. This has been something I have wanted to do for a long time, and I’m excited to share this exploration.

The data we are working with is an MSA from (Wang et al. 2021), where they identified a potential universally conserved “weak spot” in Coronavirus spike proteins to specific cross-reactive monoclonal antibodies. The spike protein was the primary target for developing the COVID-19 vaccines, which crucially saved millions of lives during in the fight against the pandemic.

Figure 5 from their study presents an MSA comparing spike proteins from several coronavirus species: SARS-CoV, SARS-CoV-2, MERS-CoV, and HCoV-OC43 (which causes the common cold). This visualization left an immediate impression on me. Despite there being large divergence between these species, the authors were able to identify a region with enough similarity to serve as target for broadly reactive antibodies.

Let’s explore how we can leverage gt to better understand the evolutionary relationships between coronavirus protein sequences!

This section is to prepare for generating the visualizations. We keep this hidden in the output.

Libraries

Code

library(dplyr)
library(here)
library(tidyr)
library(gt)
library(stringr)
library(readr)
library(htmltools)
library(glue)

Data

Code

# palette
palette_aa <- readr::read_csv(here::here("data", "palette_amino_acids.csv"))

Rows: 27 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): symbol, Chemistry, Shapely, Zappo, Taylor, LETTER, Hydrophobicity

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# msa data
FILE_MSA <- here::here("data", "msa.csv")
msa <- readr::read_csv(FILE_MSA)

Rows: 8 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): group, name, seq
dbl (1): start

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# long format
msa_long <- msa |>
  mutate(seq = as.character(seq) |> str_split("")) |>
  tidyr::unnest(seq) |>
  mutate(position = row_number(), .by = name)

# table data
tbl_data <- msa_long |>
  pivot_wider(
    id_cols = c(group, name, start),
    names_from = "position",
    values_from = "seq",
    names_prefix = "pos_"
  )

Custom functions

Palette

Code

#' returns a coloring function based on chosen palette
apply_color_to_aa <- function(
    letter,
    palette = c(
      "Chemistry",
      "Shapely",
      "Zappo",
      "Taylor",
      "LETTER",
      "Hydrophobicity"
    )) {
  target_palette <- palette_aa[, c("symbol", palette[1])] |> tibble::deframe()

  color_fn <- function(letter) {
    color <- target_palette[letter]
    color[is.na(color)] <- "grey"
    return(color)
  }
}

Consensus

Code

#' This version sets the parent div to be 100% width and height
#' @param bar_height a number between 0 and 1 that sets the height of the bar as a percentage of the total height of the container (the parent) div
#' @param parent_div_height valid css unit for height of the parent div
#' @param parent_div_width valid css unit for width of the parent div
create_html_vertical_bar <- function(
    bar_height = .5,
    bar_color = "#4a6fa5",
    parent_div_height = "50px",
    parent_div_width = "100%") {
  .bar_height <- bar_height * 100
  div(
    style = glue("width:{parent_div_width};height:{parent_div_height};margin:0;padding:0;"),
    div(style = glue("height:{100-.bar_height}%;background:transparent;margin:0;padding:0;")),
    div(style = glue("height:{.bar_height}%;background:{bar_color};margin:0;padding:0;"))
  )
}

#' Get conensus frequency
#'
#' returns a 1-length named character vector of the most frequent amino acid, where:
#'  - the value is the frequency
#'  - the name is amino acid
#'
#' Does not handle ties (one of the ties will be returned)
#' @param x a vector, ideally of interesting biological sequences
get_consensus <- function(x) {
  # drop gaps
  frequency_tbl <- x |>
    table() |>
    sort(decreasing = TRUE)
  consensus_freq <- (frequency_tbl / sum(frequency_tbl))[1]

  return(consensus_freq)
}

#' gets the consensus of the vector x, and returns an html bar
get_consensus_return_bar <- function(x) {
  consensus_freq <- get_consensus(x)
  create_html_vertical_bar(consensus_freq)
}

# usage:
c("X", "X", "O") |>
  get_consensus_return_bar() |>
  browsable()

Code

c("X", "O") |>
  get_consensus_return_bar() |>
  browsable()

Breaks

generate_breaks takes the position columns and returns which ones to label. Uses scales::breaks_width to generate a break every 5 amino acids.

Code

#' Treats the column names as an axis, and generates breaks using `scales::breaks_width`
#'
#' @return a vector of column names at every `width` position
generate_breaks <- function(tbl, width = 5) {
  seq_cols <- tbl |> select(contains("pos"))
  n_col <- seq_cols |> ncol()

  breaks <- scales::breaks_width(width, 0)(c(1, n_col))

  # handle case where breaks extend beyond data
  breaks <- intersect(breaks, 1:n_col)

  col_breaks <- colnames(seq_cols[, breaks])
  return(col_breaks)
}

breaks <- generate_breaks(tbl_data)

Highlight a region with a rectangle

Annotates the sequences with a rectangle around a specific region. Only works on ungrouped gt.

This function is not well generalized.

Code

#' Draws a rectangle around a section in a gt table
#'
#' @param .gt_tbl a gt table
#' @param start start column
#' @param end end column
annotate_rectangle <- function(.gt_tbl, start, end) {
  .gt_tbl |>
    # add a rectangle to highlight region
    tab_style(
      style = cell_borders(sides = "left", weight = px(2)),
      locations = cells_body(columns = {{ start }})
    ) |>
    tab_style(
      style = cell_borders(sides = "right", weight = px(2)),
      locations = cells_body(columns = {{ end }})
    ) |>
    tab_style(
      style = cell_borders(sides = c("top"), weight = px(2)),
      locations = cells_body(columns = {{ start }}:{{ end }}, rows = 1)
    ) |>
    # For the bottom border of the rectangle, need to target the top border of grand summary, if present
    # if grand summary is not present, need to target the bottom of the cells_body
    tab_style(
      style = cell_borders(sides = c("top"), weight = px(2), style = "solid"),
      locations = cells_grand_summary(columns = {{ start }}:{{ end }}, rows = 1)
    )
}

Visualizing Conserved Regions Across Coronvirus Spike Proteins

Click through the tabs below to explore each region of the coronavirus spike protein alignment in detail.

This is my re-creation of the multiple sequence alignment of coronavirus spike proteins

Code

base_table <- tbl_data |>
  gt(rowname_col = c("name"), groupname_col = "group") |>
  # add consensus sequence
  grand_summary_rows(
    columns = contains("pos"),
    fns = list(
      Consensus ~ get_consensus_return_bar(.) |>
        div(style = "width:100%;height:50px;") |>
        as.character(),
      Sequence ~ names(get_consensus(.)) |>
        # change to double dash, otherwise fmt_markdown turns it into a list (ul)
        stringr::str_replace("-", "--")
    ),
    fmt = list(
      bar ~ fmt_markdown(.)
    ),
    missing = ""
  ) |>
  ## style consensus sequence
  ### make sure that the consensus sequence elements are centered
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_grand_summary(columns = contains("pos_"))
  ) |>
  tab_style(
    style = cell_borders(sides = "bottom", style = "hidden"),
    locations = list(cells_grand_summary(rows = 1), cells_stub_grand_summary(rows = 1))
  ) |>
  # style the sequence elements: center elements, adjust size
  tab_style(
    style = cell_text(
      size = "small",
      align = "center",
      indent = 0
    ),
    locations = list(cells_body(columns = contains("pos_")), cells_grand_summary(columns = contains("pos_")))
  ) |>
  cols_width(
    1 ~ px(60),
    name ~ px(50),
    start ~ px(40),
    everything() ~ px(13)
  ) |>
  cols_align("right", group:start) |>
  # breaks
  cols_label_with(
    fn = ~ ifelse(. %in% breaks, ., "") |> str_remove("pos_"),
    columns = contains("pos_")
  ) |>
  # remove borders from the table body
  tab_style(
    style = list(
      cell_borders(
        sides = "all",
        weight = px(0)
      )
    ),
    locations = list(
      cells_body()
    )
  ) |>
  # annotation regions in the alignment
  tab_spanner(
    columns = pos_1:pos_21,
    label = "Stem helix"
  ) |>
  tab_spanner(
    columns = pos_33:pos_80,
    label = "HR2 region"
  ) |>
  tab_spanner(
    columns = pos_91:pos_95,
    label = "TM region"
  ) |>
  # epitope
  cols_label(
    pos_14 ~ "*",
    pos_15 ~ "",
    pos_16 ~ "*",
    pos_17 ~ "*",
    pos_19 ~ "!",
  ) |>
  # borders, style
  tab_options(
    row_group.as_column = TRUE,
    table.font.size = 14,

    # adjust padding in the cell body
    data_row.padding.horizontal = px(2),
    data_row.padding = px(2),

    # adjust padding in the grand summary
    # Noting that padding creates space between the bars (which have 100% width)
    grand_summary_row.padding.horizontal = px(0),
    grand_summary_row.padding = px(2),

    # # remove borders
    table.border.top.style = "hidden",
    grand_summary_row.border.width = px(2)
  )

base_table |>
  gt::data_color(columns = contains("pos_"), fn = apply_color_to_aa(palette = "Chemistry"))

		start	Stem helix																								25					30			HR2 region																																																				85					90	TM region
		start					5					10				*		*	*		!	20					25					30					35					40					45					50					55					60					65					70					75					80					85					90					95
beta	OC43	1225	T	S	I	P	N	L	P	D	F	K	E	E	L	D	Q	W	F	K	N	Q	T	S	-	V	A	P	D	L	S	L	D	Y	-	-	I	N	V	T	F	L	D	L	Q	V	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	L	Q	E	A	I	K	V	L	N	Q	S	Y	I	N	L	K	D	I	G	T	Y	E	Y	Y	V	K	W	P	W	Y	V	W	L
	MHV	1191	T	S	I	P	N	P	P	D	F	K	E	E	L	D	Q	W	F	K	K	Q	T	S	-	I	A	P	D	L	S	L	D	F	E	K	L	N	V	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	I	Q	D	A	I	K	K	L	N	E	S	Y	I	N	L	K	E	V	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	HKU1	1226	H	S	V	P	K	L	S	D	F	E	S	E	L	S	H	W	F	K	N	Q	T	S	-	I	A	P	N	L	T	L	N	L	H	T	I	N	A	T	F	L	D	L	Y	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	L	I	Q	E	S	K	L	S	L	N	N	S	Y	I	N	L	K	D	I	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	SARS	1122	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	F	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	L	I	D	L	Q	E	L	G	K	Y	E	Q	Y	I	K	W	P	W	Y	V	W	L
	SARS2	1140	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	L	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
	MERS	1223	L	G	N	S	T	G	I	D	F	Q	D	E	L	D	E	Y	F	K	N	V	S	T	-	S	I	P	N	F	G	-	S	L	T	Q	I	N	T	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	L	S	L	Q	Q	V	V	K	A	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
alpha	229E	1025	T	I	V	P	E	Y	I	D	V	N	K	T	L	Q	E	L	S	Y	K	L	P	N	Y	T	V	P	D	L	-	-	V	V	E	Q	Y	N	Q	T	I	L	N	L	T	S	E	I	S	T	L	E	N	K	S	A	E	L	N	Y	T	V	Q	K	L	Q	T	L	I	D	N	I	N	S	T	L	V	D	L	K	W	L	N	R	V	E	T	Y	I	K	W	P	W	Y	V	W	V
alpha	NL63	1208	T	V	I	P	D	Y	V	D	V	N	K	T	L	Q	E	F	A	Q	N	L	P	K	Y	V	K	P	N	F	-	-	D	L	T	P	F	N	L	T	Y	L	N	L	S	S	E	L	K	Q	L	E	A	K	T	A	S	L	F	Q	T	T	V	E	L	Q	G	L	I	D	Q	I	N	S	T	Y	V	D	L	K	L	L	N	R	F	E	N	Y	I	K	W	P	W	Y	V	W	V
Consensus
Sequence			T	S	I	P	E	L	D	D	F	K	E	E	L	D	E	W	F	K	N	Q	T	S	–	I	A	P	D	L	G	–	D	L	E	G	I	N	A	T	F	L	D	L	Q	Y	–	–	–	–	–	–	–	–	–	–	–	–	–	–	E	M	N	R	L	Q	E	V	I	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	T	Y	E	Y	Y	I	K	W	P	W	Y	V	W	L

This 95-amino-acid section contains several functionally important regions: the epitope where antibodies bind, the stem helix, heptad repeat region 2 (HR2), and the transmembrane domain (TM).

Click through the rest of these sections, as I use gt to help understand the amino acid composition of these regions in detail.

A note on the color palettes

MSAs commonly utilize colors to help readers better understand the variability in amino acid composition in biological sequences. Amino acids have all sorts of properties, such as hydrophobicity, size, and 3D structure. By coloring amino acids according to their biochemical properties, we can better understand the the variation in amino acid composition.

This color palette is from ggmsa that groups chemically-similar amino acids together:

Acidic amino acids glutamic (E) and aspartic acid (D) are red
Nonpolar/hydrophobic amino acids proline (P), alanine (A), valine (V), methionine (M), leucine (L), isoleucine (I), and glycine (G) are orange
Basic amino acids lysine (K) and arginine (R) are blue
Polar amino acids asparagine (N), threonine (T), cysteine (C), glutamine (Q), and serine (S) are green
Aromatic amino acids phenylalanine (F), tyrosine (Y), and tryptophan (W) are yellow

In this analysis, I implemented a several alternative palettes and explore how effective they can be used to understand the amino acid composition of protein sequences.

The region where the antibodies bind

Code

base_table |>
  gt::data_color(columns = pos_11:pos_19, fn = apply_color_to_aa(palette = "Chemistry")) |>
  annotate_rectangle(start = pos_11, end = pos_18)

		start	Stem helix																								25					30			HR2 region																																																				85					90	TM region
		start					5					10				*		*	*		!	20					25					30					35					40					45					50					55					60					65					70					75					80					85					90					95
beta	OC43	1225	T	S	I	P	N	L	P	D	F	K	E	E	L	D	Q	W	F	K	N	Q	T	S	-	V	A	P	D	L	S	L	D	Y	-	-	I	N	V	T	F	L	D	L	Q	V	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	L	Q	E	A	I	K	V	L	N	Q	S	Y	I	N	L	K	D	I	G	T	Y	E	Y	Y	V	K	W	P	W	Y	V	W	L
	MHV	1191	T	S	I	P	N	P	P	D	F	K	E	E	L	D	Q	W	F	K	K	Q	T	S	-	I	A	P	D	L	S	L	D	F	E	K	L	N	V	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	I	Q	D	A	I	K	K	L	N	E	S	Y	I	N	L	K	E	V	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	HKU1	1226	H	S	V	P	K	L	S	D	F	E	S	E	L	S	H	W	F	K	N	Q	T	S	-	I	A	P	N	L	T	L	N	L	H	T	I	N	A	T	F	L	D	L	Y	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	L	I	Q	E	S	K	L	S	L	N	N	S	Y	I	N	L	K	D	I	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	SARS	1122	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	F	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	L	I	D	L	Q	E	L	G	K	Y	E	Q	Y	I	K	W	P	W	Y	V	W	L
	SARS2	1140	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	L	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
	MERS	1223	L	G	N	S	T	G	I	D	F	Q	D	E	L	D	E	Y	F	K	N	V	S	T	-	S	I	P	N	F	G	-	S	L	T	Q	I	N	T	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	L	S	L	Q	Q	V	V	K	A	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
alpha	229E	1025	T	I	V	P	E	Y	I	D	V	N	K	T	L	Q	E	L	S	Y	K	L	P	N	Y	T	V	P	D	L	-	-	V	V	E	Q	Y	N	Q	T	I	L	N	L	T	S	E	I	S	T	L	E	N	K	S	A	E	L	N	Y	T	V	Q	K	L	Q	T	L	I	D	N	I	N	S	T	L	V	D	L	K	W	L	N	R	V	E	T	Y	I	K	W	P	W	Y	V	W	V
alpha	NL63	1208	T	V	I	P	D	Y	V	D	V	N	K	T	L	Q	E	F	A	Q	N	L	P	K	Y	V	K	P	N	F	-	-	D	L	T	P	F	N	L	T	Y	L	N	L	S	S	E	L	K	Q	L	E	A	K	T	A	S	L	F	Q	T	T	V	E	L	Q	G	L	I	D	Q	I	N	S	T	Y	V	D	L	K	L	L	N	R	F	E	N	Y	I	K	W	P	W	Y	V	W	V
Consensus
Sequence			T	S	I	P	E	L	D	D	F	K	E	E	L	D	E	W	F	K	N	Q	T	S	–	I	A	P	D	L	G	–	D	L	E	G	I	N	A	T	F	L	D	L	Q	Y	–	–	–	–	–	–	–	–	–	–	–	–	–	–	E	M	N	R	L	Q	E	V	I	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	T	Y	E	Y	Y	I	K	W	P	W	Y	V	W	L

At the end of the Stem Helix lies the Epitope region, which is a region where the antibodies 28D9/1.6C7 (the very creative names of the antibodies) bind to. This is where the magic happens. Referred to also as the “core epitope”, asterisk indicate where key amino acid positions are for this interaction. If these positions are replaced with other amino acids, the antibodies can no longer bind effectively.

How did the researchers determine this? By quite literally systematically changing each amino acid one at a time and measuring antibody binding. This mutagensis experiment importantly revealed that some amino acids in this region can change without affecting binding too much, but these 3 amino acids are critical.

The exclamation point marks an amino acid adjacent to the epitope, which the authors refer to as a “conserved glycosylation sequon”, or “NxS/T”. This means this is an amino acid that sugar can attach to, and it follows the pattern of “the amino acid N, followed by any amino acid X, and then either an S or a T”. Apparently this is a conserved pattern among coronavirus species. What’s significant here is the authors suggest that this sugar molecule may be potentially important for the binding of these antibodies to this epitope.

This type of domain forms a 3-dimensional spiral like structure

Code

base_table |>
  gt::data_color(columns = pos_1:pos_21, fn = apply_color_to_aa(palette = "Shapely")) |>
  annotate_rectangle(start = pos_1, end = pos_21)

		start	Stem helix																								25					30			HR2 region																																																				85					90	TM region
		start					5					10				*		*	*		!	20					25					30					35					40					45					50					55					60					65					70					75					80					85					90					95
beta	OC43	1225	T	S	I	P	N	L	P	D	F	K	E	E	L	D	Q	W	F	K	N	Q	T	S	-	V	A	P	D	L	S	L	D	Y	-	-	I	N	V	T	F	L	D	L	Q	V	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	L	Q	E	A	I	K	V	L	N	Q	S	Y	I	N	L	K	D	I	G	T	Y	E	Y	Y	V	K	W	P	W	Y	V	W	L
	MHV	1191	T	S	I	P	N	P	P	D	F	K	E	E	L	D	Q	W	F	K	K	Q	T	S	-	I	A	P	D	L	S	L	D	F	E	K	L	N	V	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	I	Q	D	A	I	K	K	L	N	E	S	Y	I	N	L	K	E	V	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	HKU1	1226	H	S	V	P	K	L	S	D	F	E	S	E	L	S	H	W	F	K	N	Q	T	S	-	I	A	P	N	L	T	L	N	L	H	T	I	N	A	T	F	L	D	L	Y	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	L	I	Q	E	S	K	L	S	L	N	N	S	Y	I	N	L	K	D	I	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	SARS	1122	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	F	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	L	I	D	L	Q	E	L	G	K	Y	E	Q	Y	I	K	W	P	W	Y	V	W	L
	SARS2	1140	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	L	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
	MERS	1223	L	G	N	S	T	G	I	D	F	Q	D	E	L	D	E	Y	F	K	N	V	S	T	-	S	I	P	N	F	G	-	S	L	T	Q	I	N	T	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	L	S	L	Q	Q	V	V	K	A	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
alpha	229E	1025	T	I	V	P	E	Y	I	D	V	N	K	T	L	Q	E	L	S	Y	K	L	P	N	Y	T	V	P	D	L	-	-	V	V	E	Q	Y	N	Q	T	I	L	N	L	T	S	E	I	S	T	L	E	N	K	S	A	E	L	N	Y	T	V	Q	K	L	Q	T	L	I	D	N	I	N	S	T	L	V	D	L	K	W	L	N	R	V	E	T	Y	I	K	W	P	W	Y	V	W	V
alpha	NL63	1208	T	V	I	P	D	Y	V	D	V	N	K	T	L	Q	E	F	A	Q	N	L	P	K	Y	V	K	P	N	F	-	-	D	L	T	P	F	N	L	T	Y	L	N	L	S	S	E	L	K	Q	L	E	A	K	T	A	S	L	F	Q	T	T	V	E	L	Q	G	L	I	D	Q	I	N	S	T	Y	V	D	L	K	L	L	N	R	F	E	N	Y	I	K	W	P	W	Y	V	W	V
Consensus
Sequence			T	S	I	P	E	L	D	D	F	K	E	E	L	D	E	W	F	K	N	Q	T	S	–	I	A	P	D	L	G	–	D	L	E	G	I	N	A	T	F	L	D	L	Q	Y	–	–	–	–	–	–	–	–	–	–	–	–	–	–	E	M	N	R	L	Q	E	V	I	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	T	Y	E	Y	Y	I	K	W	P	W	Y	V	W	L

The Stem Helix is the orange part highlighted in this protein modeling figure.

A color palette from the bioinformatic software Rasmol is shown here, which is based on Bob Fletterick’s “Shapely Models”.

HR2 stands for “Heptad Repeat Region 2”

Code

base_table |>
  gt::data_color(columns = pos_33:pos_80, fn = apply_color_to_aa(palette = "LETTER")) |>
  annotate_rectangle(start = pos_33, end = pos_80)

		start	Stem helix																								25					30			HR2 region																																																				85					90	TM region
		start					5					10				*		*	*		!	20					25					30					35					40					45					50					55					60					65					70					75					80					85					90					95
beta	OC43	1225	T	S	I	P	N	L	P	D	F	K	E	E	L	D	Q	W	F	K	N	Q	T	S	-	V	A	P	D	L	S	L	D	Y	-	-	I	N	V	T	F	L	D	L	Q	V	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	L	Q	E	A	I	K	V	L	N	Q	S	Y	I	N	L	K	D	I	G	T	Y	E	Y	Y	V	K	W	P	W	Y	V	W	L
	MHV	1191	T	S	I	P	N	P	P	D	F	K	E	E	L	D	Q	W	F	K	K	Q	T	S	-	I	A	P	D	L	S	L	D	F	E	K	L	N	V	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	I	Q	D	A	I	K	K	L	N	E	S	Y	I	N	L	K	E	V	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	HKU1	1226	H	S	V	P	K	L	S	D	F	E	S	E	L	S	H	W	F	K	N	Q	T	S	-	I	A	P	N	L	T	L	N	L	H	T	I	N	A	T	F	L	D	L	Y	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	L	I	Q	E	S	K	L	S	L	N	N	S	Y	I	N	L	K	D	I	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	SARS	1122	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	F	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	L	I	D	L	Q	E	L	G	K	Y	E	Q	Y	I	K	W	P	W	Y	V	W	L
	SARS2	1140	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	L	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
	MERS	1223	L	G	N	S	T	G	I	D	F	Q	D	E	L	D	E	Y	F	K	N	V	S	T	-	S	I	P	N	F	G	-	S	L	T	Q	I	N	T	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	L	S	L	Q	Q	V	V	K	A	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
alpha	229E	1025	T	I	V	P	E	Y	I	D	V	N	K	T	L	Q	E	L	S	Y	K	L	P	N	Y	T	V	P	D	L	-	-	V	V	E	Q	Y	N	Q	T	I	L	N	L	T	S	E	I	S	T	L	E	N	K	S	A	E	L	N	Y	T	V	Q	K	L	Q	T	L	I	D	N	I	N	S	T	L	V	D	L	K	W	L	N	R	V	E	T	Y	I	K	W	P	W	Y	V	W	V
alpha	NL63	1208	T	V	I	P	D	Y	V	D	V	N	K	T	L	Q	E	F	A	Q	N	L	P	K	Y	V	K	P	N	F	-	-	D	L	T	P	F	N	L	T	Y	L	N	L	S	S	E	L	K	Q	L	E	A	K	T	A	S	L	F	Q	T	T	V	E	L	Q	G	L	I	D	Q	I	N	S	T	Y	V	D	L	K	L	L	N	R	F	E	N	Y	I	K	W	P	W	Y	V	W	V
Consensus
Sequence			T	S	I	P	E	L	D	D	F	K	E	E	L	D	E	W	F	K	N	Q	T	S	–	I	A	P	D	L	G	–	D	L	E	G	I	N	A	T	F	L	D	L	Q	Y	–	–	–	–	–	–	–	–	–	–	–	–	–	–	E	M	N	R	L	Q	E	V	I	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	T	Y	E	Y	Y	I	K	W	P	W	Y	V	W	L

The epitope occurs upstream of the Heptad Repeat Region 2 (HR2). Why is it called “Heptad Repeat”, because this is region where every 7 amino acids tends to be repeated.

Can you spot the pattern? (If you can, do let me know because I can’t)

We can adjust the palette to better identify the repeat pattern.

Code

base_table |>
  gt::data_color(columns = pos_33:pos_80, fn = apply_color_to_aa(palette = "Hydrophobicity")) |>
  annotate_rectangle(start = pos_33, end = pos_80)

		start	Stem helix																								25					30			HR2 region																																																				85					90	TM region
		start					5					10				*		*	*		!	20					25					30					35					40					45					50					55					60					65					70					75					80					85					90					95
beta	OC43	1225	T	S	I	P	N	L	P	D	F	K	E	E	L	D	Q	W	F	K	N	Q	T	S	-	V	A	P	D	L	S	L	D	Y	-	-	I	N	V	T	F	L	D	L	Q	V	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	L	Q	E	A	I	K	V	L	N	Q	S	Y	I	N	L	K	D	I	G	T	Y	E	Y	Y	V	K	W	P	W	Y	V	W	L
	MHV	1191	T	S	I	P	N	P	P	D	F	K	E	E	L	D	Q	W	F	K	K	Q	T	S	-	I	A	P	D	L	S	L	D	F	E	K	L	N	V	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	I	Q	D	A	I	K	K	L	N	E	S	Y	I	N	L	K	E	V	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	HKU1	1226	H	S	V	P	K	L	S	D	F	E	S	E	L	S	H	W	F	K	N	Q	T	S	-	I	A	P	N	L	T	L	N	L	H	T	I	N	A	T	F	L	D	L	Y	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	L	I	Q	E	S	K	L	S	L	N	N	S	Y	I	N	L	K	D	I	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	SARS	1122	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	F	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	L	I	D	L	Q	E	L	G	K	Y	E	Q	Y	I	K	W	P	W	Y	V	W	L
	SARS2	1140	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	L	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
	MERS	1223	L	G	N	S	T	G	I	D	F	Q	D	E	L	D	E	Y	F	K	N	V	S	T	-	S	I	P	N	F	G	-	S	L	T	Q	I	N	T	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	L	S	L	Q	Q	V	V	K	A	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
alpha	229E	1025	T	I	V	P	E	Y	I	D	V	N	K	T	L	Q	E	L	S	Y	K	L	P	N	Y	T	V	P	D	L	-	-	V	V	E	Q	Y	N	Q	T	I	L	N	L	T	S	E	I	S	T	L	E	N	K	S	A	E	L	N	Y	T	V	Q	K	L	Q	T	L	I	D	N	I	N	S	T	L	V	D	L	K	W	L	N	R	V	E	T	Y	I	K	W	P	W	Y	V	W	V
alpha	NL63	1208	T	V	I	P	D	Y	V	D	V	N	K	T	L	Q	E	F	A	Q	N	L	P	K	Y	V	K	P	N	F	-	-	D	L	T	P	F	N	L	T	Y	L	N	L	S	S	E	L	K	Q	L	E	A	K	T	A	S	L	F	Q	T	T	V	E	L	Q	G	L	I	D	Q	I	N	S	T	Y	V	D	L	K	L	L	N	R	F	E	N	Y	I	K	W	P	W	Y	V	W	V
Consensus
Sequence			T	S	I	P	E	L	D	D	F	K	E	E	L	D	E	W	F	K	N	Q	T	S	–	I	A	P	D	L	G	–	D	L	E	G	I	N	A	T	F	L	D	L	Q	Y	–	–	–	–	–	–	–	–	–	–	–	–	–	–	E	M	N	R	L	Q	E	V	I	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	T	Y	E	Y	Y	I	K	W	P	W	Y	V	W	L

Heptad repeat regions are typically identified via 3D structural modelling. Meaning, the repetitive-ness of these regions are difficult to be observed when viewing linear amino acid sequences. The exact amino acid length of each repeat is also imperfect, and not necessarily is the same amino acid repeated, but just one that is highly similar - usually a hydrophobic amino acid (e.g. leucine, valine, phenylalanine). These typically occur at position 1 and 4 of each seven amino acid unit.

We can try to see this pattern using a color palette that highlights the relative hydrophobicity between amino acids.

This region anchors the protein to the outer-most membrane of the virus.

Code

base_table |>
  gt::data_color(columns = pos_91:pos_95, fn = apply_color_to_aa(palette = "Taylor")) |>
  annotate_rectangle(start = pos_91, end = pos_95)

		start	Stem helix																								25					30			HR2 region																																																				85					90	TM region
		start					5					10				*		*	*		!	20					25					30					35					40					45					50					55					60					65					70					75					80					85					90					95
beta	OC43	1225	T	S	I	P	N	L	P	D	F	K	E	E	L	D	Q	W	F	K	N	Q	T	S	-	V	A	P	D	L	S	L	D	Y	-	-	I	N	V	T	F	L	D	L	Q	V	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	L	Q	E	A	I	K	V	L	N	Q	S	Y	I	N	L	K	D	I	G	T	Y	E	Y	Y	V	K	W	P	W	Y	V	W	L
	MHV	1191	T	S	I	P	N	P	P	D	F	K	E	E	L	D	Q	W	F	K	K	Q	T	S	-	I	A	P	D	L	S	L	D	F	E	K	L	N	V	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	R	I	Q	D	A	I	K	K	L	N	E	S	Y	I	N	L	K	E	V	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	HKU1	1226	H	S	V	P	K	L	S	D	F	E	S	E	L	S	H	W	F	K	N	Q	T	S	-	I	A	P	N	L	T	L	N	L	H	T	I	N	A	T	F	L	D	L	Y	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	N	L	I	Q	E	S	K	L	S	L	N	N	S	Y	I	N	L	K	D	I	G	T	Y	E	M	Y	V	K	W	P	W	Y	V	W	L
	SARS	1122	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	F	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	L	I	D	L	Q	E	L	G	K	Y	E	Q	Y	I	K	W	P	W	Y	V	W	L
	SARS2	1140	P	L	Q	P	E	L	D	S	F	K	E	E	L	D	K	Y	F	K	N	H	T	S	-	P	D	V	D	L	G	-	D	I	S	G	I	N	A	S	V	V	N	I	Q	K	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	I	D	R	L	N	E	V	A	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
	MERS	1223	L	G	N	S	T	G	I	D	F	Q	D	E	L	D	E	Y	F	K	N	V	S	T	-	S	I	P	N	F	G	-	S	L	T	Q	I	N	T	T	L	L	D	L	T	Y	-	-	-	-	-	-	-	-	-	-	-	-	-	-	E	M	L	S	L	Q	Q	V	V	K	A	L	N	E	S	Y	I	D	L	K	E	L	G	N	Y	T	Y	Y	N	K	W	P	W	Y	I	W	L
alpha	229E	1025	T	I	V	P	E	Y	I	D	V	N	K	T	L	Q	E	L	S	Y	K	L	P	N	Y	T	V	P	D	L	-	-	V	V	E	Q	Y	N	Q	T	I	L	N	L	T	S	E	I	S	T	L	E	N	K	S	A	E	L	N	Y	T	V	Q	K	L	Q	T	L	I	D	N	I	N	S	T	L	V	D	L	K	W	L	N	R	V	E	T	Y	I	K	W	P	W	Y	V	W	V
alpha	NL63	1208	T	V	I	P	D	Y	V	D	V	N	K	T	L	Q	E	F	A	Q	N	L	P	K	Y	V	K	P	N	F	-	-	D	L	T	P	F	N	L	T	Y	L	N	L	S	S	E	L	K	Q	L	E	A	K	T	A	S	L	F	Q	T	T	V	E	L	Q	G	L	I	D	Q	I	N	S	T	Y	V	D	L	K	L	L	N	R	F	E	N	Y	I	K	W	P	W	Y	V	W	V
Consensus
Sequence			T	S	I	P	E	L	D	D	F	K	E	E	L	D	E	W	F	K	N	Q	T	S	–	I	A	P	D	L	G	–	D	L	E	G	I	N	A	T	F	L	D	L	Q	Y	–	–	–	–	–	–	–	–	–	–	–	–	–	–	E	M	N	R	L	Q	E	V	I	K	N	L	N	E	S	Y	I	D	L	K	E	L	G	T	Y	E	Y	Y	I	K	W	P	W	Y	V	W	L

The final region is called the Transmembrane (TM) region, which is the part of the protein that spans across a phospholipid bilayer. It is an important structural component that anchors the spike protein to the viral envelope.

Because of the lipid-rich environment of this layer, TM regions contain many hyrophobic and nonpolar residues - such as tryptophan (W), tyrosine (Y), valine (V), and leucine (L).

This color palette Taylor is taken from the popular MSA program Jalview.

If you enjoyed this article and want to use some of these functions yourself, check out the development of gtseq

References

Wang, Chunyan, Rien van Haperen, Javier Gutiérrez-Álvarez, Wentao Li, Nisreen M. A. Okba, Irina Albulescu, Ivy Widjaja, et al. 2021. “A Conserved Immunogenic and Vulnerable Site on the Coronavirus Spike Protein Delineated by Cross-Reactive Monoclonal Antibodies.” Nature Communications 12 (1): 1715. https://doi.org/10.1038/s41467-021-21968-w.

Setup

Libraries

Data

Custom functions

Palette

Consensus

Breaks

Highlight a region with a rectangle

Visualizing Conserved Regions Across Coronvirus Spike Proteins

References