STpuppeteer Simulation — Quick Start¶
Run a complete spatial transcriptomics simulation in under a minute with a handful of lines.
This notebook provides an overview of the workflow, showcasing a minimial example to start simulating the datasets. For each individual steps and their relevant parameters, please check the individual chapters on the left.
Pipeline in one picture
SimulationConfig
│
├─ Step 1 │ Gene parameters (μ, θ per cell type)
├─ Step 2 │ Cell layout (nucleus + Voronoi cell polygons)
├─ Step 3 │ Count matrix (Negative-Binomial sampling)
└─ Step 4 │ Transcript locs (spatial placement + leakage)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator
CT_PALETTE = {"ct_0": "#49997c", "ct_1": "#1ebecd", "ct_2": "#ae3918"}
plt.rcParams.update({"figure.dpi": 110, "axes.spines.top": False, "axes.spines.right": False})
print("Imports OK")
Imports OK
1. Define a minimal configuration¶
SimulationConfig holds every parameter. Almost all have sensible defaults, and only a handful need to be set for a basic run. You can run the model without changing any parameters, but for our cases, we modify the cell number and gene number to allow for containable runtime.
config = SimulationConfig(
# simulation playground
canvas_size=100.0, # µm × µm tissue canvas
# Cell structure
n_cells=200,
# Gene expression
n_genes=100,
n_markers=20, # marker genes per cell type
)
print(config.summary())
============================================================
STpuppeteer Simulation Configuration
============================================================
Random Seed: 42
Cell/Tissue Structure:
Cells: 200
Cell types: 3
Cell type distribution: {'ct_0': 0.4, 'ct_1': 0.3, 'ct_2': 0.3}
Spatial continuity: 0.2
Boundary fuzziness: 0.05
Default morphology: MorphologySpec(mean_area=24.0, cv_area=0.5, elongation=1.2, boundary_noise_apt=0.1, n_vertices=12, orientation_mode='random', orientation_noise=0.1, orientation_axis=None, expansion_ratio=1.8)
Gene Expression:
Total genes: 100
Marker genes per type: {'ct_0': 20, 'ct_1': 20, 'ct_2': 20}
Housekeeping genes: 40
Marker expression: µ=0.75, CV=0.6
Silence expression: µ=0.05, CV=1.0
Theta prior: α=2.0, rate=1.0
Transcript Behavior:
Leakage by celltype: {'ct_0': 0.1, 'ct_1': 0.1, 'ct_2': 0.1}
Leakage distance factor: 1.0
============================================================
2. Run the full simulation¶
One call — runs all four steps and stores results on the simulator object.
sim = SpotlessSimulator(config)
sim.run_full_simulation()
sim
Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 200 Genes: 100 Transcripts: 5102
SpotlessSimulator
══════════════════════════════════════════════════════════════
Config seed=42 n_cells=200 n_celltype=3 n_genes=100
continuity=0.2 fuzziness=0.05 canvas=100.0
──────────────────────────────────────────────────────────────
gpar_df ✓ (100 × 9) cols: feature_name, gene_type, ct_0_theta, ct_0_mu, ct_1_theta, … +4 more
cell_gdf ✓ (200 × 13) cols: cell_id, celltype, prototype_id, centroid_x, centroid_y, … +8 more
celltypes: ct_0=80 ct_1=63 ct_2=57
count_array ✓ (200 × 100) total=5,102 mean/cell=25.5 zeros=82.8%
trs_df ✓ (5,102 × 9) cols: transcript_id, cell_id, feature_name, x_location, y_location, … +4 more
leaked=509 (10.0%) nuclear=200 (3.9%)
══════════════════════════════════════════════════════════════
Notice: run_full_simulation() is a convenience wrapper.
Call the four steps individually when you need to inspect or modify intermediate results.
sim2 = SpotlessSimulator(config)
sim2.generate_gene_parameters() # → sim2.gpar_df
sim2.initialize_cells() # → sim2.cell_gdf
sim2.simulate_counts() # → sim2.count_array
sim2.simulate_transcript_locations() # → sim2.trs_df
print(sim2)
SpotlessSimulator
══════════════════════════════════════════════════════════════
Config seed=42 n_cells=200 n_celltype=3 n_genes=100
continuity=0.2 fuzziness=0.05 canvas=100.0
──────────────────────────────────────────────────────────────
gpar_df ✓ (100 × 9) cols: feature_name, gene_type, ct_0_theta, ct_0_mu, ct_1_theta, … +4 more
cell_gdf ✓ (200 × 13) cols: cell_id, celltype, prototype_id, centroid_x, centroid_y, … +8 more
celltypes: ct_0=80 ct_1=63 ct_2=57
count_array ✓ (200 × 100) total=5,102 mean/cell=25.5 zeros=82.8%
trs_df ✓ (5,102 × 9) cols: transcript_id, cell_id, feature_name, x_location, y_location, … +4 more
leaked=509 (10.0%) nuclear=200 (3.9%)
══════════════════════════════════════════════════════════════
3. Inspect outputs¶
Four objects are populated after the run:
| Attribute | Content |
|---|---|
sim.gpar_df |
Gene parameter table (μ, θ per cell type) |
sim.cell_gdf |
Cell GeoDataFrame (nucleus + cell polygons, cell type, …) |
sim.count_array |
Count matrix — shape (n_cells, n_genes) |
sim.trs_df |
Transcript DataFrame (x, y, gene, cell, is_leaked, …) |
print(f"Gene params : {sim.gpar_df.shape}")
print(f"Cells : {sim.cell_gdf.shape}")
print(f"Count matrix : {sim.count_array.shape}")
print(f"Transcripts : {len(sim.trs_df):,} | leaked: {sim.trs_df['is_leaked'].sum():,}")
sim.cell_gdf[["cell_id", "celltype", "nucleus_area", "cell_area"]].head()
Gene params : (100, 9) Cells : (200, 13) Count matrix : (200, 100) Transcripts : 5,102 | leaked: 509
| cell_id | celltype | nucleus_area | cell_area | |
|---|---|---|---|---|
| 0 | cell_1 | ct_2 | 23.617470 | 31.488871 |
| 1 | cell_2 | ct_1 | 17.498407 | 27.472178 |
| 2 | cell_3 | ct_2 | 15.063900 | 40.401505 |
| 3 | cell_4 | ct_1 | 13.521986 | 26.823121 |
| 4 | cell_5 | ct_0 | 19.456904 | 55.732535 |
4. Visualise¶
Two essential plots: cell layout (coloured by cell type) and the full transcript map.
cell_gdf = sim.cell_gdf
cell_gdf["_color"] = cell_gdf["celltype"].map(CT_PALETTE)
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
# ── Left: cell layout ─────────────────────────────────────────────────────────
ax = axs[0]
cell_gdf.set_geometry("cell_geometry").plot(
ax=ax, color=cell_gdf["_color"], alpha=0.4, edgecolor="k", linewidth=0.3)
cell_gdf.set_geometry("nucleus_geometry").plot(
ax=ax, color=cell_gdf["_color"], edgecolor="none", alpha=0.8)
ax.set_aspect("equal")
ax.set_title("Cell layout (nucleus + Voronoi cell polygon)")
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
handles = [mpatches.Patch(facecolor=c, label=k) for k, c in CT_PALETTE.items()]
ax.legend(handles=handles, title="Cell type")
# ── Right: transcript map ─────────────────────────────────────────────────────
ax2 = axs[1]
trs_df = sim.trs_df
ok = trs_df[trs_df["is_leaked"] == 0]
lk = trs_df[trs_df["is_leaked"] == 1]
sns.scatterplot(data=ok, x="x_location", y="y_location",
hue="celltype_sim", palette=CT_PALETTE,
s=1.5, alpha=0.4, edgecolor="none", ax=ax2, legend=False)
ax2.scatter(lk["x_location"], lk["y_location"],
s=3, c="black", marker="x", alpha=0.3, linewidths=0.4,
label=f"leaked ({len(lk):,})")
ax2.set_aspect("equal")
ax2.set_title("Transcript spatial locations")
ax2.set_xlabel("x (µm)"); ax2.set_ylabel("y (µm)")
ct_handles = [mpatches.Patch(facecolor=c, alpha=0.7, label=k) for k, c in CT_PALETTE.items()]
lk_handle = plt.Line2D([0], [0], marker="x", color="black", linestyle="none",
markersize=6, label=f"leaked ({len(lk):,})")
ax2.legend(handles=ct_handles + [lk_handle], title="Cell type / status", fontsize=8)
fig.suptitle("STpuppeteer — Quick-Start Simulation", fontsize=13)
fig.tight_layout()
plt.show()
5. Customised data input¶
STpuppeteer can replace any synthetic step with your own real data. Two common scenarios:
| Scenario | What you bring | Which step to replace |
|---|---|---|
| A — Existing counts | A real scRNA-seq count matrix (cells × genes) | Skip / replace simulate_counts() |
| B — Existing polygons | Cell segmentation polygons (e.g. from Xenium, MERFISH) | Skip initialize_cells() geometry generation |
Both scenarios still end with simulate_transcript_locations() to place transcripts spatially.
Scenario A — Existing counts¶
Two sub-options depending on what you want:
- A1
fit_gene_parameters_from_counts— estimates NB (μ, θ) parameters per gene per cell type from your data, then simulates new counts that follow those distributions. Good for augmenting or re-sampling with the same expression profile. - A2
set_counts— injects your count matrix directly;simulate_counts()is skipped entirely. Good for placing an exact observed count table into a spatial context.
# ── Scenario A: existing count matrix ────────────────────────────────────────
# Simulate a stand-in "real" count matrix (replace with your own data).
rng = np.random.default_rng(7)
N_CELLS, N_GENES = 120, 60
real_counts = rng.negative_binomial(n=2, p=0.4, size=(N_CELLS, N_GENES))
real_ctypes = np.array(["ct_0"] * 48 + ["ct_1"] * 36 + ["ct_2"] * 36)
real_genes = [f"gene_{i}" for i in range(N_GENES)]
# Config needs to match the shape of the incoming data
cfg_A = SimulationConfig(seed=0, n_cells=N_CELLS, n_genes=N_GENES,
n_markers=0, # no synthetic markers — expression comes from data
canvas_size=100.0)
# ── A1: fit NB parameters, then simulate new counts ──────────────────────────
sim_A1 = SpotlessSimulator(cfg_A)
sim_A1.fit_gene_parameters_from_counts(real_counts, real_ctypes, gene_names=real_genes)
sim_A1.initialize_cells() # generates spatial layout
sim_A1.simulate_counts() # draws new counts from the fitted NB distributions
sim_A1.simulate_transcript_locations()
print("A1 — re-simulated counts:", sim_A1.count_array.shape,
" | transcripts:", len(sim_A1.trs_df))
# ── A2: inject counts directly, skip simulate_counts ─────────────────────────
sim_A2 = SpotlessSimulator(cfg_A)
sim_A2.initialize_cells(input_cell_types=real_ctypes) # use provided cell-type labels
sim_A2.set_counts(real_counts, gene_names=real_genes) # bypass NB simulation entirely
sim_A2.simulate_transcript_locations()
print("A2 — injected counts :", sim_A2.count_array.shape,
" | transcripts:", len(sim_A2.trs_df))
A1 — re-simulated counts: (120, 60) | transcripts: 21926 A2 — injected counts : (120, 60) | transcripts: 21335
Scenario B — Existing cell segmentation polygons¶
Pass your real segmentation masks directly to initialize_cells() via input_cell_polygons. STpuppeteer will skip all position / nucleus-generation steps and use your shapes as-is.
Optionally supply:
input_nucleus_polygons— if you have nucleus segmentations; otherwise each nucleus is estimated by scaling the cell polygon to ~50 % of its area.input_cell_types— if you already have cell type annotations; otherwise the spatial assignment algorithm runs on the supplied geometry.
After cell initialisation, gene parameters and counts can come from either synthetic generation (generate_gene_parameters + simulate_counts) or from real data (Scenario A).
from shapely.geometry import Polygon as ShapelyPolygon
# ── Scenario B: existing segmentation polygons ───────────────────────────────
# Build stand-in polygons on a 10×10 grid (replace with your real shapes).
def _hex_polygon(cx, cy, r=2.8):
"""Return a regular hexagon centred at (cx, cy) with radius r."""
angles = np.linspace(0, 2 * np.pi, 7)[:-1]
return ShapelyPolygon(zip(cx + r * np.cos(angles), cy + r * np.sin(angles)))
grid_size = 10 # 10×10 = 100 cells
xs = np.arange(grid_size) * 9.0
ys = np.arange(grid_size) * 9.0
cell_polygons = [
_hex_polygon(x + (3.0 if j % 2 else 0.0), y)
for j, y in enumerate(ys)
for x in xs
]
poly_ctypes = np.array(
["ct_0"] * 40 + ["ct_1"] * 30 + ["ct_2"] * 30
)
# Config only needs to know about genes; cell geometry comes from outside.
cfg_B = SimulationConfig(seed=1, n_cells=len(cell_polygons),
n_genes=80, n_markers=15, canvas_size=100.0)
sim_B = SpotlessSimulator(cfg_B)
# Pass real polygons + cell-type labels — spatial generation is skipped
sim_B.initialize_cells(
input_cell_polygons=cell_polygons,
input_cell_types=poly_ctypes,
# input_nucleus_polygons=nucleus_polygons, # optional; scaled automatically if omitted
)
# Gene parameters and counts are still synthetic here; swap with Scenario A if needed
sim_B.generate_gene_parameters()
sim_B.simulate_counts()
sim_B.simulate_transcript_locations()
print(f"Cells from external polygons : {len(sim_B.cell_gdf)}")
print(f"Transcripts placed : {len(sim_B.trs_df):,}")
sim_B.cell_gdf[["cell_id", "celltype", "nucleus_area", "cell_area"]].head()
Cells from external polygons : 100 Transcripts placed : 1,972
| cell_id | celltype | nucleus_area | cell_area | |
|---|---|---|---|---|
| 0 | cell_1 | ct_0 | 10.184459 | 20.368917 |
| 1 | cell_2 | ct_0 | 10.184459 | 20.368917 |
| 2 | cell_3 | ct_0 | 10.184459 | 20.368917 |
| 3 | cell_4 | ct_0 | 10.184459 | 20.368917 |
| 4 | cell_5 | ct_0 | 10.184459 | 20.368917 |
6. Save outputs¶
Three export formats are supported — uncomment the one you need.
# Simple CSV / Parquet
# paths = sim.save_simple("output/simple/", prefix="qs")
# 10x Xenium format
# paths = sim.save_xenium("output/xenium/", prefix="qs", save_boundaries=False)
# SpatialData (zarr) — requires spatialdata package
# sdata = sim.save_spatialdata("output/sim.zarr", overwrite=True)
print("Uncomment a save call above and set your output path.")