Skip to content

STPuppeteer

STPuppeteer generates synthetic spatial transcriptomics datasets with ground-truth cell annotations, transcript locations, and expression profiles. Let spatial transcriptomics data be the puppets that you can design, update and play around. Given the complexity and noisy nature of real spatial transcriptomics data, STPuppeteer opts for a clean, flexible and interpretable design to generate synthetic dataset. With tangible ground truth in hand, the package can be used to benchmark and stress-test deconvolution, cell segmentation, and transcript-to-cell assignment methods.


How it works

The simulator builds a synthetic tissue section in four stages:

SimulationConfig
1. Gene Parameters     — sample μ (expression level) and θ (overdispersion)
                          per gene per cell type using Gamma priors
2. Cell Geometry       — place nuclei via Poisson-disk sampling,
                          grow log-normal polygons, tile boundaries with Voronoi
3. Count Matrix        — draw transcript counts from NegBinom(μ·scale, θ)
                          per cell per gene
4. Transcript Locations — place each transcript inside its cell polygon;
                           a configurable fraction leaks outside (leakage model)

Key design choices:

  • Cell-type-specific marker, housekeeping, and silent gene classes
  • Per-cell size scaling so larger cells receive more transcripts proportionally
  • Per-cell-type leakage probability for realistic cross-boundary contamination
  • Shapely 2.x vector geometry throughout — no rasterisation

Installation

First, clone the repo

git clone git@github.com:Jieran-S/STpuppeteer.git

Then you can create a workable environment given the recipe file in the repo

cd STpuppeteer
conda env create -f environment.yml
conda activate STpuppeteer

If you would like your output to be spatialdata compatible format, install spatialdata additionally

pip install spatialdata

Quickstart

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

config = SimulationConfig(
    n_cells=200,
    n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    n_genes=500,
    n_markers=[100, 80, 60],           # marker genes per cell type
    leakage_by_celltype=[0.1, 0.15, 0.05],
    seed=42,
)

sim = SpotlessSimulator(config)
sim.run_full_simulation()

sim.save_simple("output/")           # CSV, Parquet, NPY
sim.save_spatialdata("output.zarr")  # SpatialData/Zarr
sim.save_xenium("output_xenium/")    # 10x Xenium-compatible format

Tutorials

Tutorial Description
Quick Start Run a complete simulation in under a minute; covers all four pipeline steps and built-in data-input options
Step 1 — Gene Expression Parameters Gamma priors for μ and θ, gene classes (marker / housekeeping / silence), and parameter effects
Step 2 — Cell Generation Background positions, cell-type assignment, nucleus polygons, Voronoi expansion, and prototype insertion
Step 3 — Simulate Counts Negative-Binomial count model, count matrix overview, and effect of key parameters
Step 4 — Simulate Transcripts Spatial transcript placement, leakage model, and parameter scan
Configuration Building configs from minimal to complex multi-prototype tissue architectures

Deep Dives

Notebook Description
Cell Parameters & Morphology Exhaustive parameter scans for spatial patterns (cluster, ring, chain), nucleus morphology, continuity, and fuzziness
Prototype Patterns Full prototype reference: PrototypeSpec, PrototypeScene, multi-prototype layouts, and generation paths