STPuppeteer¶
STPuppeteer generates synthetic spatial transcriptomics datasets with ground-truth cell annotations, transcript locations, and expression profiles. Let spatial transcriptomics data be the puppets that you can design, update and play around. Given the complexity and noisy nature of real spatial transcriptomics data, STPuppeteer opts for a clean, flexible and interpretable design to generate synthetic dataset. With tangible ground truth in hand, the package can be used to benchmark and stress-test deconvolution, cell segmentation, and transcript-to-cell assignment methods.
How it works¶
The simulator builds a synthetic tissue section in four stages:
SimulationConfig
│
▼
1. Gene Parameters — sample μ (expression level) and θ (overdispersion)
per gene per cell type using Gamma priors
│
▼
2. Cell Geometry — place nuclei via Poisson-disk sampling,
grow log-normal polygons, tile boundaries with Voronoi
│
▼
3. Count Matrix — draw transcript counts from NegBinom(μ·scale, θ)
per cell per gene
│
▼
4. Transcript Locations — place each transcript inside its cell polygon;
a configurable fraction leaks outside (leakage model)
Key design choices:
- Cell-type-specific marker, housekeeping, and silent gene classes
- Per-cell size scaling so larger cells receive more transcripts proportionally
- Per-cell-type leakage probability for realistic cross-boundary contamination
- Shapely 2.x vector geometry throughout — no rasterisation
Installation¶
First, clone the repo
Then you can create a workable environment given the recipe file in the repo
If you would like your output to be spatialdata compatible format, install spatialdata additionally
Quickstart¶
from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator
config = SimulationConfig(
n_cells=200,
n_celltype=3,
celltype_proportion=[0.5, 0.3, 0.2],
n_genes=500,
n_markers=[100, 80, 60], # marker genes per cell type
leakage_by_celltype=[0.1, 0.15, 0.05],
seed=42,
)
sim = SpotlessSimulator(config)
sim.run_full_simulation()
sim.save_simple("output/") # CSV, Parquet, NPY
sim.save_spatialdata("output.zarr") # SpatialData/Zarr
sim.save_xenium("output_xenium/") # 10x Xenium-compatible format
Tutorials¶
| Tutorial | Description |
|---|---|
| Quick Start | Run a complete simulation in under a minute; covers all four pipeline steps and built-in data-input options |
| Step 1 — Gene Expression Parameters | Gamma priors for μ and θ, gene classes (marker / housekeeping / silence), and parameter effects |
| Step 2 — Cell Generation | Background positions, cell-type assignment, nucleus polygons, Voronoi expansion, and prototype insertion |
| Step 3 — Simulate Counts | Negative-Binomial count model, count matrix overview, and effect of key parameters |
| Step 4 — Simulate Transcripts | Spatial transcript placement, leakage model, and parameter scan |
| Configuration | Building configs from minimal to complex multi-prototype tissue architectures |
Deep Dives¶
| Notebook | Description |
|---|---|
| Cell Parameters & Morphology | Exhaustive parameter scans for spatial patterns (cluster, ring, chain), nucleus morphology, continuity, and fuzziness |
| Prototype Patterns | Full prototype reference: PrototypeSpec, PrototypeScene, multi-prototype layouts, and generation paths |