STpuppeteer Simulation — Quick Start¶

Run a complete spatial transcriptomics simulation in under a minute with a handful of lines.

This notebook provides an overview of the workflow, showcasing a minimial example to start simulating the datasets. For each individual steps and their relevant parameters, please check the individual chapters on the left.

Pipeline in one picture

SimulationConfig
      │
      ├─ Step 1 │ Gene parameters   (μ, θ per cell type)
      ├─ Step 2 │ Cell layout       (nucleus + Voronoi cell polygons)
      ├─ Step 3 │ Count matrix      (Negative-Binomial sampling)
      └─ Step 4 │ Transcript locs   (spatial placement + leakage)

In [1]:

Copied!





import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

CT_PALETTE = {"ct_0": "#49997c", "ct_1": "#1ebecd", "ct_2": "#ae3918"}
plt.rcParams.update({"figure.dpi": 110, "axes.spines.top": False, "axes.spines.right": False})
print("Imports OK")
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

CT_PALETTE = {"ct_0": "#49997c", "ct_1": "#1ebecd", "ct_2": "#ae3918"}
plt.rcParams.update({"figure.dpi": 110, "axes.spines.top": False, "axes.spines.right": False})
print("Imports OK")

Imports OK

1. Define a minimal configuration¶

SimulationConfig holds every parameter. Almost all have sensible defaults, and only a handful need to be set for a basic run. You can run the model without changing any parameters, but for our cases, we modify the cell number and gene number to allow for containable runtime.

In [2]:

Copied!





config = SimulationConfig(
    # simulation playground
    canvas_size=100.0,      # µm × µm tissue canvas

    # Cell structure
    n_cells=200,

    # Gene expression
    n_genes=100,
    n_markers=20,           # marker genes per cell type
)

print(config.summary())
config = SimulationConfig(
    # simulation playground
    canvas_size=100.0,      # µm × µm tissue canvas

    # Cell structure
    n_cells=200,

    # Gene expression
    n_genes=100,
    n_markers=20,           # marker genes per cell type
)

print(config.summary())

============================================================
STpuppeteer Simulation Configuration
============================================================
Random Seed: 42

Cell/Tissue Structure:
  Cells: 200
  Cell types: 3
  Cell type distribution: {'ct_0': 0.4, 'ct_1': 0.3, 'ct_2': 0.3}
  Spatial continuity: 0.2
  Boundary fuzziness: 0.05
  Default morphology: MorphologySpec(mean_area=24.0, cv_area=0.5, elongation=1.2, boundary_noise_apt=0.1, n_vertices=12, orientation_mode='random', orientation_noise=0.1, orientation_axis=None, expansion_ratio=1.8)

Gene Expression:
  Total genes: 100
  Marker genes per type: {'ct_0': 20, 'ct_1': 20, 'ct_2': 20}
  Housekeeping genes: 40
  Marker expression: µ=0.75, CV=0.6
  Silence expression: µ=0.05, CV=1.0
  Theta prior: α=2.0, rate=1.0

Transcript Behavior:
  Leakage by celltype: {'ct_0': 0.1, 'ct_1': 0.1, 'ct_2': 0.1}
  Leakage distance factor: 1.0
============================================================

2. Run the full simulation¶

One call — runs all four steps and stores results on the simulator object.

In [3]:

Copied!

sim = SpotlessSimulator(config)
sim.run_full_simulation()
sim
sim = SpotlessSimulator(config)
sim.run_full_simulation()
sim

Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 200
  Genes: 100
  Transcripts: 5102

Out[3]:

SpotlessSimulator
══════════════════════════════════════════════════════════════
  Config      seed=42  n_cells=200  n_celltype=3  n_genes=100
              continuity=0.2  fuzziness=0.05  canvas=100.0
──────────────────────────────────────────────────────────────
  gpar_df      ✓  (100 × 9)  cols: feature_name, gene_type, ct_0_theta, ct_0_mu, ct_1_theta, … +4 more
  cell_gdf     ✓  (200 × 13)  cols: cell_id, celltype, prototype_id, centroid_x, centroid_y, … +8 more
                   celltypes: ct_0=80  ct_1=63  ct_2=57
  count_array  ✓  (200 × 100)  total=5,102  mean/cell=25.5  zeros=82.8%
  trs_df       ✓  (5,102 × 9)  cols: transcript_id, cell_id, feature_name, x_location, y_location, … +4 more
                   leaked=509 (10.0%)  nuclear=200 (3.9%)
══════════════════════════════════════════════════════════════

Notice: run_full_simulation() is a convenience wrapper.
Call the four steps individually when you need to inspect or modify intermediate results.

In [4]:

Copied!





sim2 = SpotlessSimulator(config)

sim2.generate_gene_parameters()       # → sim2.gpar_df
sim2.initialize_cells()               # → sim2.cell_gdf
sim2.simulate_counts()                # → sim2.count_array
sim2.simulate_transcript_locations()  # → sim2.trs_df

print(sim2)
sim2 = SpotlessSimulator(config)

sim2.generate_gene_parameters()       # → sim2.gpar_df
sim2.initialize_cells()               # → sim2.cell_gdf
sim2.simulate_counts()                # → sim2.count_array
sim2.simulate_transcript_locations()  # → sim2.trs_df

print(sim2)

SpotlessSimulator
══════════════════════════════════════════════════════════════
  Config      seed=42  n_cells=200  n_celltype=3  n_genes=100
              continuity=0.2  fuzziness=0.05  canvas=100.0
──────────────────────────────────────────────────────────────
  gpar_df      ✓  (100 × 9)  cols: feature_name, gene_type, ct_0_theta, ct_0_mu, ct_1_theta, … +4 more
  cell_gdf     ✓  (200 × 13)  cols: cell_id, celltype, prototype_id, centroid_x, centroid_y, … +8 more
                   celltypes: ct_0=80  ct_1=63  ct_2=57
  count_array  ✓  (200 × 100)  total=5,102  mean/cell=25.5  zeros=82.8%
  trs_df       ✓  (5,102 × 9)  cols: transcript_id, cell_id, feature_name, x_location, y_location, … +4 more
                   leaked=509 (10.0%)  nuclear=200 (3.9%)
══════════════════════════════════════════════════════════════

3. Inspect outputs¶

Four objects are populated after the run:

Attribute	Content
`sim.gpar_df`	Gene parameter table (μ, θ per cell type)
`sim.cell_gdf`	Cell GeoDataFrame (nucleus + cell polygons, cell type, …)
`sim.count_array`	Count matrix — shape `(n_cells, n_genes)`
`sim.trs_df`	Transcript DataFrame (x, y, gene, cell, is_leaked, …)

In [5]:

Copied!





print(f"Gene params  : {sim.gpar_df.shape}")
print(f"Cells        : {sim.cell_gdf.shape}")
print(f"Count matrix : {sim.count_array.shape}")
print(f"Transcripts  : {len(sim.trs_df):,}  |  leaked: {sim.trs_df['is_leaked'].sum():,}")
sim.cell_gdf[["cell_id", "celltype", "nucleus_area", "cell_area"]].head()
print(f"Gene params  : {sim.gpar_df.shape}")
print(f"Cells        : {sim.cell_gdf.shape}")
print(f"Count matrix : {sim.count_array.shape}")
print(f"Transcripts  : {len(sim.trs_df):,}  |  leaked: {sim.trs_df['is_leaked'].sum():,}")
sim.cell_gdf[["cell_id", "celltype", "nucleus_area", "cell_area"]].head()

Gene params  : (100, 9)
Cells        : (200, 13)
Count matrix : (200, 100)
Transcripts  : 5,102  |  leaked: 509

Out[5]:

	cell_id	celltype	nucleus_area	cell_area
0	cell_1	ct_2	23.617470	31.488871
1	cell_2	ct_1	17.498407	27.472178
2	cell_3	ct_2	15.063900	40.401505
3	cell_4	ct_1	13.521986	26.823121
4	cell_5	ct_0	19.456904	55.732535

4. Visualise¶

Two essential plots: cell layout (coloured by cell type) and the full transcript map.

In [6]:

Copied!





cell_gdf = sim.cell_gdf
cell_gdf["_color"] = cell_gdf["celltype"].map(CT_PALETTE)

fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# ── Left: cell layout ─────────────────────────────────────────────────────────
ax = axs[0]
cell_gdf.set_geometry("cell_geometry").plot(
    ax=ax, color=cell_gdf["_color"], alpha=0.4, edgecolor="k", linewidth=0.3)
cell_gdf.set_geometry("nucleus_geometry").plot(
    ax=ax, color=cell_gdf["_color"], edgecolor="none", alpha=0.8)
ax.set_aspect("equal")
ax.set_title("Cell layout (nucleus + Voronoi cell polygon)")
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
handles = [mpatches.Patch(facecolor=c, label=k) for k, c in CT_PALETTE.items()]
ax.legend(handles=handles, title="Cell type")

# ── Right: transcript map ─────────────────────────────────────────────────────
ax2 = axs[1]
trs_df = sim.trs_df
ok = trs_df[trs_df["is_leaked"] == 0]
lk = trs_df[trs_df["is_leaked"] == 1]
sns.scatterplot(data=ok, x="x_location", y="y_location",
                hue="celltype_sim", palette=CT_PALETTE,
                s=1.5, alpha=0.4, edgecolor="none", ax=ax2, legend=False)
ax2.scatter(lk["x_location"], lk["y_location"],
            s=3, c="black", marker="x", alpha=0.3, linewidths=0.4,
            label=f"leaked ({len(lk):,})")
ax2.set_aspect("equal")
ax2.set_title("Transcript spatial locations")
ax2.set_xlabel("x (µm)"); ax2.set_ylabel("y (µm)")
ct_handles = [mpatches.Patch(facecolor=c, alpha=0.7, label=k) for k, c in CT_PALETTE.items()]
lk_handle  = plt.Line2D([0], [0], marker="x", color="black", linestyle="none",
                         markersize=6, label=f"leaked ({len(lk):,})")
ax2.legend(handles=ct_handles + [lk_handle], title="Cell type / status", fontsize=8)

fig.suptitle("STpuppeteer — Quick-Start Simulation", fontsize=13)
fig.tight_layout()
plt.show()
cell_gdf = sim.cell_gdf
cell_gdf["_color"] = cell_gdf["celltype"].map(CT_PALETTE)

fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# ── Left: cell layout ─────────────────────────────────────────────────────────
ax = axs[0]
cell_gdf.set_geometry("cell_geometry").plot(
    ax=ax, color=cell_gdf["_color"], alpha=0.4, edgecolor="k", linewidth=0.3)
cell_gdf.set_geometry("nucleus_geometry").plot(
    ax=ax, color=cell_gdf["_color"], edgecolor="none", alpha=0.8)
ax.set_aspect("equal")
ax.set_title("Cell layout (nucleus + Voronoi cell polygon)")
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
handles = [mpatches.Patch(facecolor=c, label=k) for k, c in CT_PALETTE.items()]
ax.legend(handles=handles, title="Cell type")

# ── Right: transcript map ─────────────────────────────────────────────────────
ax2 = axs[1]
trs_df = sim.trs_df
ok = trs_df[trs_df["is_leaked"] == 0]
lk = trs_df[trs_df["is_leaked"] == 1]
sns.scatterplot(data=ok, x="x_location", y="y_location",
                hue="celltype_sim", palette=CT_PALETTE,
                s=1.5, alpha=0.4, edgecolor="none", ax=ax2, legend=False)
ax2.scatter(lk["x_location"], lk["y_location"],
            s=3, c="black", marker="x", alpha=0.3, linewidths=0.4,
            label=f"leaked ({len(lk):,})")
ax2.set_aspect("equal")
ax2.set_title("Transcript spatial locations")
ax2.set_xlabel("x (µm)"); ax2.set_ylabel("y (µm)")
ct_handles = [mpatches.Patch(facecolor=c, alpha=0.7, label=k) for k, c in CT_PALETTE.items()]
lk_handle  = plt.Line2D([0], [0], marker="x", color="black", linestyle="none",
                         markersize=6, label=f"leaked ({len(lk):,})")
ax2.legend(handles=ct_handles + [lk_handle], title="Cell type / status", fontsize=8)

fig.suptitle("STpuppeteer — Quick-Start Simulation", fontsize=13)
fig.tight_layout()
plt.show()

No description has been provided for this image

5. Customised data input¶

STpuppeteer can replace any synthetic step with your own real data. Two common scenarios:

Scenario	What you bring	Which step to replace
A — Existing counts	A real scRNA-seq count matrix (cells × genes)	Skip / replace `simulate_counts()`
B — Existing polygons	Cell segmentation polygons (e.g. from Xenium, MERFISH)	Skip `initialize_cells()` geometry generation

Both scenarios still end with simulate_transcript_locations() to place transcripts spatially.

Scenario A — Existing counts¶

Two sub-options depending on what you want:

A1 fit_gene_parameters_from_counts — estimates NB (μ, θ) parameters per gene per cell type from your data, then simulates new counts that follow those distributions. Good for augmenting or re-sampling with the same expression profile.
A2 set_counts — injects your count matrix directly; simulate_counts() is skipped entirely. Good for placing an exact observed count table into a spatial context.

In [8]:

Copied!





# ── Scenario A: existing count matrix ────────────────────────────────────────
# Simulate a stand-in "real" count matrix (replace with your own data).
rng = np.random.default_rng(7)
N_CELLS, N_GENES = 120, 60
real_counts  = rng.negative_binomial(n=2, p=0.4, size=(N_CELLS, N_GENES))
real_ctypes  = np.array(["ct_0"] * 48 + ["ct_1"] * 36 + ["ct_2"] * 36)
real_genes   = [f"gene_{i}" for i in range(N_GENES)]

# Config needs to match the shape of the incoming data
cfg_A = SimulationConfig(seed=0, n_cells=N_CELLS, n_genes=N_GENES,
                          n_markers=0,   # no synthetic markers — expression comes from data
                          canvas_size=100.0)

# ── A1: fit NB parameters, then simulate new counts ──────────────────────────
sim_A1 = SpotlessSimulator(cfg_A)
sim_A1.fit_gene_parameters_from_counts(real_counts, real_ctypes, gene_names=real_genes)
sim_A1.initialize_cells()          # generates spatial layout
sim_A1.simulate_counts()           # draws new counts from the fitted NB distributions
sim_A1.simulate_transcript_locations()
print("A1 — re-simulated counts:", sim_A1.count_array.shape,
      " | transcripts:", len(sim_A1.trs_df))

# ── A2: inject counts directly, skip simulate_counts ─────────────────────────
sim_A2 = SpotlessSimulator(cfg_A)
sim_A2.initialize_cells(input_cell_types=real_ctypes)  # use provided cell-type labels
sim_A2.set_counts(real_counts, gene_names=real_genes)   # bypass NB simulation entirely
sim_A2.simulate_transcript_locations()
print("A2 — injected counts  :", sim_A2.count_array.shape,
      " | transcripts:", len(sim_A2.trs_df))
# ── Scenario A: existing count matrix ────────────────────────────────────────
# Simulate a stand-in "real" count matrix (replace with your own data).
rng = np.random.default_rng(7)
N_CELLS, N_GENES = 120, 60
real_counts  = rng.negative_binomial(n=2, p=0.4, size=(N_CELLS, N_GENES))
real_ctypes  = np.array(["ct_0"] * 48 + ["ct_1"] * 36 + ["ct_2"] * 36)
real_genes   = [f"gene_{i}" for i in range(N_GENES)]

# Config needs to match the shape of the incoming data
cfg_A = SimulationConfig(seed=0, n_cells=N_CELLS, n_genes=N_GENES,
                          n_markers=0,   # no synthetic markers — expression comes from data
                          canvas_size=100.0)

# ── A1: fit NB parameters, then simulate new counts ──────────────────────────
sim_A1 = SpotlessSimulator(cfg_A)
sim_A1.fit_gene_parameters_from_counts(real_counts, real_ctypes, gene_names=real_genes)
sim_A1.initialize_cells()          # generates spatial layout
sim_A1.simulate_counts()           # draws new counts from the fitted NB distributions
sim_A1.simulate_transcript_locations()
print("A1 — re-simulated counts:", sim_A1.count_array.shape,
      " | transcripts:", len(sim_A1.trs_df))

# ── A2: inject counts directly, skip simulate_counts ─────────────────────────
sim_A2 = SpotlessSimulator(cfg_A)
sim_A2.initialize_cells(input_cell_types=real_ctypes)  # use provided cell-type labels
sim_A2.set_counts(real_counts, gene_names=real_genes)   # bypass NB simulation entirely
sim_A2.simulate_transcript_locations()
print("A2 — injected counts  :", sim_A2.count_array.shape,
      " | transcripts:", len(sim_A2.trs_df))

A1 — re-simulated counts: (120, 60)  | transcripts: 21926
A2 — injected counts  : (120, 60)  | transcripts: 21335

Scenario B — Existing cell segmentation polygons¶

Pass your real segmentation masks directly to initialize_cells() via input_cell_polygons. STpuppeteer will skip all position / nucleus-generation steps and use your shapes as-is.

Optionally supply:

input_nucleus_polygons — if you have nucleus segmentations; otherwise each nucleus is estimated by scaling the cell polygon to ~50 % of its area.
input_cell_types — if you already have cell type annotations; otherwise the spatial assignment algorithm runs on the supplied geometry.

After cell initialisation, gene parameters and counts can come from either synthetic generation (generate_gene_parameters + simulate_counts) or from real data (Scenario A).

In [9]:

Copied!





from shapely.geometry import Polygon as ShapelyPolygon

# ── Scenario B: existing segmentation polygons ───────────────────────────────
# Build stand-in polygons on a 10×10 grid (replace with your real shapes).
def _hex_polygon(cx, cy, r=2.8):
    """Return a regular hexagon centred at (cx, cy) with radius r."""
    angles = np.linspace(0, 2 * np.pi, 7)[:-1]
    return ShapelyPolygon(zip(cx + r * np.cos(angles), cy + r * np.sin(angles)))

grid_size = 10   # 10×10 = 100 cells
xs = np.arange(grid_size) * 9.0
ys = np.arange(grid_size) * 9.0
cell_polygons = [
    _hex_polygon(x + (3.0 if j % 2 else 0.0), y)
    for j, y in enumerate(ys)
    for x in xs
]
poly_ctypes = np.array(
    ["ct_0"] * 40 + ["ct_1"] * 30 + ["ct_2"] * 30
)

# Config only needs to know about genes; cell geometry comes from outside.
cfg_B = SimulationConfig(seed=1, n_cells=len(cell_polygons),
                          n_genes=80, n_markers=15, canvas_size=100.0)

sim_B = SpotlessSimulator(cfg_B)

# Pass real polygons + cell-type labels — spatial generation is skipped
sim_B.initialize_cells(
    input_cell_polygons=cell_polygons,
    input_cell_types=poly_ctypes,
    # input_nucleus_polygons=nucleus_polygons,  # optional; scaled automatically if omitted
)

# Gene parameters and counts are still synthetic here; swap with Scenario A if needed
sim_B.generate_gene_parameters()
sim_B.simulate_counts()
sim_B.simulate_transcript_locations()

print(f"Cells from external polygons : {len(sim_B.cell_gdf)}")
print(f"Transcripts placed           : {len(sim_B.trs_df):,}")
sim_B.cell_gdf[["cell_id", "celltype", "nucleus_area", "cell_area"]].head()
from shapely.geometry import Polygon as ShapelyPolygon

# ── Scenario B: existing segmentation polygons ───────────────────────────────
# Build stand-in polygons on a 10×10 grid (replace with your real shapes).
def _hex_polygon(cx, cy, r=2.8):
    """Return a regular hexagon centred at (cx, cy) with radius r."""
    angles = np.linspace(0, 2 * np.pi, 7)[:-1]
    return ShapelyPolygon(zip(cx + r * np.cos(angles), cy + r * np.sin(angles)))

grid_size = 10   # 10×10 = 100 cells
xs = np.arange(grid_size) * 9.0
ys = np.arange(grid_size) * 9.0
cell_polygons = [
    _hex_polygon(x + (3.0 if j % 2 else 0.0), y)
    for j, y in enumerate(ys)
    for x in xs
]
poly_ctypes = np.array(
    ["ct_0"] * 40 + ["ct_1"] * 30 + ["ct_2"] * 30
)

# Config only needs to know about genes; cell geometry comes from outside.
cfg_B = SimulationConfig(seed=1, n_cells=len(cell_polygons),
                          n_genes=80, n_markers=15, canvas_size=100.0)

sim_B = SpotlessSimulator(cfg_B)

# Pass real polygons + cell-type labels — spatial generation is skipped
sim_B.initialize_cells(
    input_cell_polygons=cell_polygons,
    input_cell_types=poly_ctypes,
    # input_nucleus_polygons=nucleus_polygons,  # optional; scaled automatically if omitted
)

# Gene parameters and counts are still synthetic here; swap with Scenario A if needed
sim_B.generate_gene_parameters()
sim_B.simulate_counts()
sim_B.simulate_transcript_locations()

print(f"Cells from external polygons : {len(sim_B.cell_gdf)}")
print(f"Transcripts placed           : {len(sim_B.trs_df):,}")
sim_B.cell_gdf[["cell_id", "celltype", "nucleus_area", "cell_area"]].head()

Cells from external polygons : 100
Transcripts placed           : 1,972

Out[9]:

	cell_id	celltype	nucleus_area	cell_area
0	cell_1	ct_0	10.184459	20.368917
1	cell_2	ct_0	10.184459	20.368917
2	cell_3	ct_0	10.184459	20.368917
3	cell_4	ct_0	10.184459	20.368917
4	cell_5	ct_0	10.184459	20.368917

6. Save outputs¶

Three export formats are supported — uncomment the one you need.

In [ ]:

Copied!

# Simple CSV / Parquet
# paths = sim.save_simple("output/simple/", prefix="qs")

# 10x Xenium format
# paths = sim.save_xenium("output/xenium/", prefix="qs", save_boundaries=False)

# SpatialData (zarr) — requires spatialdata package
# sdata = sim.save_spatialdata("output/sim.zarr", overwrite=True)

print("Uncomment a save call above and set your output path.")
# Simple CSV / Parquet
# paths = sim.save_simple("output/simple/", prefix="qs")

# 10x Xenium format
# paths = sim.save_xenium("output/xenium/", prefix="qs", save_boundaries=False)

# SpatialData (zarr) — requires spatialdata package
# sdata = sim.save_spatialdata("output/sim.zarr", overwrite=True)

print("Uncomment a save call above and set your output path.")