STPuppeteer¶

STPuppeteer generates synthetic spatial transcriptomics datasets with ground-truth cell annotations, transcript locations, and expression profiles. Let spatial transcriptomics data be the puppets that you can design, update and play around. Given the complexity and noisy nature of real spatial transcriptomics data, STPuppeteer opts for a clean, flexible and interpretable design to generate synthetic dataset. With tangible ground truth in hand, the package can be used to benchmark and stress-test deconvolution, cell segmentation, and transcript-to-cell assignment methods.

How it works¶

The simulator builds a synthetic tissue section in four stages:

SimulationConfig
      │
      ▼
1. Gene Parameters     — sample μ (expression level) and θ (overdispersion)
                          per gene per cell type using Gamma priors
      │
      ▼
2. Cell Geometry       — place nuclei via Poisson-disk sampling,
                          grow log-normal polygons, tile boundaries with Voronoi
      │
      ▼
3. Count Matrix        — draw transcript counts from NegBinom(μ·scale, θ)
                          per cell per gene
      │
      ▼
4. Transcript Locations — place each transcript inside its cell polygon;
                           a configurable fraction leaks outside (leakage model)

Key design choices:

Cell-type-specific marker, housekeeping, and silent gene classes
Per-cell size scaling so larger cells receive more transcripts proportionally
Per-cell-type leakage probability for realistic cross-boundary contamination
Shapely 2.x vector geometry throughout — no rasterisation

Installation¶

First, clone the repo

git clone git@github.com:Jieran-S/STpuppeteer.git

Then you can create a workable environment given the recipe file in the repo

cd STpuppeteer
conda env create -f environment.yml
conda activate STpuppeteer

If you would like your output to be spatialdata compatible format, install spatialdata additionally

pip install spatialdata

Quickstart¶

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

config = SimulationConfig(
    n_cells=200,
    n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    n_genes=500,
    n_markers=[100, 80, 60],           # marker genes per cell type
    leakage_by_celltype=[0.1, 0.15, 0.05],
    seed=42,
)

sim = SpotlessSimulator(config)
sim.run_full_simulation()

sim.save_simple("output/")           # CSV, Parquet, NPY
sim.save_spatialdata("output.zarr")  # SpatialData/Zarr
sim.save_xenium("output_xenium/")    # 10x Xenium-compatible format

Tutorials¶

Tutorial	Description
Quick Start	Run a complete simulation in under a minute; covers all four pipeline steps and built-in data-input options
Step 1 — Gene Expression Parameters	Gamma priors for μ and θ, gene classes (marker / housekeeping / silence), and parameter effects
Step 2 — Cell Generation	Background positions, cell-type assignment, nucleus polygons, Voronoi expansion, and prototype insertion
Step 3 — Simulate Counts	Negative-Binomial count model, count matrix overview, and effect of key parameters
Step 4 — Simulate Transcripts	Spatial transcript placement, leakage model, and parameter scan
Configuration	Building configs from minimal to complex multi-prototype tissue architectures

Deep Dives¶

Notebook	Description
Cell Parameters & Morphology	Exhaustive parameter scans for spatial patterns (cluster, ring, chain), nucleus morphology, continuity, and fuzziness
Prototype Patterns	Full prototype reference: `PrototypeSpec`, `PrototypeScene`, multi-prototype layouts, and generation paths