Speeding up UMAP plots for single cell gene expression analysis

data visualization

bioinformatics

Published

October 21, 2025

Analyzing single cell data often requires visualizing thousands to millions of data point on a graph. Current R packages such as Seurat::DimPlot are limited by long plotting times, impeding efficient exploratory analysis.

For example, this is how long it takes to visualize 10 genes on a 14,000 single cell RNAseq (scRNAseq) dataset.

Setup

Libraries

Code

library(Seurat)
library(ggplot2)
library(dplyr)
library(patchwork)
library(bench)
library(here)
library(purrr)
library(glue)
library(gt)
library(tidyr)
theme_custom <- function() {
  ggplot2::theme_void() +
    theme(
      axis.title.y = element_text(angle = 90),
      axis.title.x = element_text(),
      panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5, linetype = "solid"),
      strip.text.x = element_text(size = rel(1.5)),
      plot.margin = margin_auto(6, unit = "pt")
    )
}
ggplot2::theme_set(
  theme_custom()
)

Parameters

Datasets

Load seurat

Code

seu <- readRDS(here("posts", "2025-10-13-fast-umap-plots-in-r", "ifnb.rds"))
rownames(seu)

 [1] "S100A9"   "FCER1A"   "FCGR3A"   "SELL"     "CACYBP"   "GNLY"    
 [7] "CD8A"     "CD8B"     "IGJ"      "PPBP"     "CD14"     "HLA-DQA1"
[13] "TSPAN13"  "GNG11"    "GIMAP5"   "IL3RA"    "FOXP3"    "CREM"    
[19] "HBB"      "MS4A1"    "CD3E"     "CD3D"     "CD4"      "LYZ"     
[25] "HSPH1"    "GPR183"   "HBA2"     "VMO1"     "CCL2"     "CCL5"    
[31] "NME1"     "PRSS57"   "CD79A"    "NKG7"     "MIR155HG"

Code

xy <- FetchData(seu, vars = c("umap_1", "umap_2", "seurat_annotations", rownames(seu)))

Code

bnch |> select(expression, min, median, n_itr)

# A tibble: 1 × 3
  expression               min   median
  <bch:expr>          <bch:tm> <bch:tm>
1 Seurat, not sampled    5.31s    6.32s

It takes 6.3216542 seconds to plot 10 features with a 14,000 single cell dataset (number of cells = 14,000). This dataset is on the smaller side - considering that single cell datasets often reach the hundreds of thousands, the speed of plotting is a significant hamper on single cell analysis.

Sampling to speed up plotting

Plotting tens to hundreds of thousands of cells is likely not completely necessary. We can explore whether plotting a sample of the dataset will be sufficient to maintain a faithful representation of the entire dataset, while improving speed.

Code

seu_sampled <- seu[, sample(1:ncol(seu), size = 0.1 * ncol(seu))]
plots <- list(
  seu |> Seurat::FeaturePlot(features = c("CD8A", "FOXP3"), order = TRUE) & coord_equal(),
  seu_sampled |> Seurat::FeaturePlot(features = c("CD8A", "FOXP3"), order = TRUE) & coord_equal()
)
(plots[[1]] & labs(subtitle = "Not sampled")) /
  (plots[[2]] & labs(subtitle = "Sampled 10%"))

Code

bind_rows(bnch, bnch_sampled) |> select(expression, min, median, n_itr)

# A tibble: 2 × 3
  expression                         min   median
  <bch:expr>                    <bch:tm> <bch:tm>
1 Seurat, not sampled              5.31s    6.32s
2 Seurat, sampled 10% (n=1,400)    3.61s     3.8s

The improvement in speed is also slightly better, but still very slow. Seurat::FeaturePlot may have some processes that are slow. Let’s try a naive solution:

Code

naive_plot(seu_sampled, features)

Code

bnch_naive |> select(expression, min, median, n_itr)

# A tibble: 1 × 3
  expression      min   median
  <bch:expr> <bch:tm> <bch:tm>
1 naive         1.24s    1.25s

The naive plot takes 1.25 seconds or 5.05x faster.

But there are drawbacks with this naive solution. Notably it’s missing some of features that Seurat smartly incorporates:

point sizing based on number of points. When plotting larger datasets, the optimal point size is smaller to avoid overplotting
Hard to see sparsely and low-expressed genes e.g. FOXP3 and CD4. This is introduced by the combined color scale, mapping low/highe xpression to one common color scale across all genes. Seurat maintains an independent color scale for each gene

Let’s see if we can address these shortcomings without trading off speed.

Point sizing

Seurat::FeaturePlot uses a simple formula to calculate point size in relation to number of cells. But it doesn’t take into account when visualizing multiple features.

Here we adjust this over total number of cells * total number of features.

Code

naive_plot2 <- function(seu, features, ptsize = NULL) {
  expr_long <- extract_and_pivot(seu, features)

  # From Seurat:
  if (is.null(ptsize)) {
    ptsize <- min(3000 / nrow(expr_long), 1)
  }

  p <- expr_long |>
    filter(.abundance > 0) |>
    ggplot(aes(x = umap_1, y = umap_2)) +
    geom_point(
      data = expr_long |> filter(.abundance == 0),
      size = ptsize,
      color = "lightgrey"
    ) +
    geom_point(
      size = ptsize,
      aes(color = .abundance)
    ) +
    scale_color_viridis_c() +
    facet_wrap(vars(feature))
  p
}

Code

naive_plot2(seu_sampled, features, ptsize = min(1583 / nrow(seu_sampled), 1))

Code

naive_plot2(seu_sampled, features)

Lowly expressed genes

To address missing lowly expressed genes when sampling, we can write some logic where when a gene is sparsely expressed, we keep all cells.

The question is at what level of sparseness do we decide to keep all cells.

Code

prepare_plot_data <- function(seu, features, .n = 1400) {
  seu_expr <- FetchData(seu, vars = c("umap_1", "umap_2", features))
  background_data <- seu_expr[, c("umap_1", "umap_2")] |>
    slice_sample(n = .n)

  seu_expr_long <- seu_expr |>
    pivot_longer(
      all_of(features),
      names_to = "feature",
      values_to = ".abundance"
    ) |>
    filter(.abundance != 0) |>
    slice_sample(by = feature, n = .n / 2)

  return(
    list(
      background = background_data, abundance = seu_expr_long
    )
  )
}

naive_plot3 <- function(seu, features, ptsize = NULL, .n = 1400) {
  plot_data <- prepare_plot_data(seu, features, .n = .n)

  # From Seurat:
  if (is.null(ptsize)) {
    ptsize <- min(3000 / nrow(plot_data$abundance), 1)
  }
  plot_data$background |>
    ggplot(aes(x = umap_1, y = umap_2)) +
    geom_point(size = ptsize, color = "grey") +
    geom_point(
      data = plot_data$abundance,
      size = ptsize,
      aes(color = .abundance)
    ) +
    scale_color_viridis_c() +
    facet_wrap(vars(feature))
}

naive_plot3(seu, features)

We implemented some conditional sampling based on the total number of expressed genes, and the desired sample size. All cells are retained for genes that have less than half of the desired sample (1400/2 = 700). This results in a visualization where highly expressed genes are sampled proportionally, and sparsely expressed genes are retained. This can be helpful in identifying cells that express these lowly expressed genes.

Here are the exact number of cells sampled for each gene:

Code

list(
  with_sampling = prepare_plot_data(seu, features, .n = 1400)$abundance,
  without_sampling = prepare_plot_data(seu, features, .n = 14000)$abundance
) |>
  purrr::map(~ count(.x, feature)) |>
  bind_rows(.id = "type") |>
  pivot_wider(
    names_from = type, values_from = n
  ) |>
  gt() |>
  grand_summary_rows(-feature, fns = list(sum ~ sum(.)))

	feature	with_sampling	without_sampling
	CD14	700	2888
	CD3E	700	2198
	CD4	251	251
	CD8B	700	925
	FCER1A	101	101
	FOXP3	13	13
	GNLY	700	1299
	LYZ	700	4973
	MS4A1	700	742
sum	—	4565	13390

Final comparison

Finally, let’s compare our solution with Seurat::FeaturePlot over 10 genes for 14000 cells:

Code

bnch_final <- bench::mark(
  `Seurat, not sampled` = Seurat::FeaturePlot(seu, features = features) |> plot(),
  `Custom solution` = naive_plot3(seu, features) |> plot(),
  memory = FALSE,
  check = FALSE,
  iterations = 4
) |> select(expression, min, median, n_itr)

We improved the speed by 5.05 times for a 14,000 cell dataset. I expect relative performance to be even greater for larger single cell datasets, since our sampling approach plots the same number of points regardless of dataset size.

We accomplish while improving the ability to detect lowly expressed / sparse genes. See the results yourself:

Code

Seurat::FeaturePlot(seu, features = features)

Code

naive_plot3(seu, features)

The drawback of this increased sensitivity is there is more noise. Especially for the higher expressed genes, there’s a lot of lowly expressed cells that that appear highlighted, which might be distracting.