Step 4 — Simulate Transcript Locations¶
This notebook explains how discrete count values are converted to spatial (x, y) coordinates, including the leakage model that places a fraction of transcripts outside their cell boundary.
Two placement modes
| Mode | Placement | Fraction controlled by |
|---|---|---|
| In-cell | Uniform rejection-sampled inside the cell polygon | 1 - p_effective |
| Leaked | Truncated exponential distance outside the cell boundary | p_effective |
The leaked transcript distance follows a truncated exponential with rate λ, clipped at
max_dist = leak_dist_factor × r_eff (where r_eff = √(cell_area / π)):
P(distance > d) ∝ exp(−λ · d) for d ∈ [0, max_dist]
λ = −ln(1 − coverage) / max_dist
= −ln(1 − coverage) / (leak_dist_factor × r_eff)
With the default coverage = 0.97, this means 97 % of leaked transcripts land within max_dist,
giving λ ≈ 3.507 / (leak_dist_factor × r_eff). A larger leak_dist_factor or a larger cell
both reduce λ and spread transcripts further from the boundary.
Effective leak probability combines cell-type and gene-level rates via a union rule:
p_effective = 1 − (1 − p_celltype) × (1 − p_gene)
Key parameters:
| Parameter | Effect |
|---|---|
leakage_by_celltype |
Per-cell-type fraction of transcripts that leak |
leak_dist_factor |
Diffusion scale: multiplier on effective cell radius |
leakage_by_gene |
Per-gene additional leak probability (additive via union rule) |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from scipy.stats import expon
from shapely.geometry import Point
from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator
CT_PALETTE = {"ct_0": "#1f77b4", "ct_1": "#ff7f0e", "ct_2": "#2ca02c"}
plt.rcParams.update({"figure.dpi": 110, "axes.spines.top": False, "axes.spines.right": False})
print("Imports OK")
Imports OK
4.1 Run a full simulation¶
config = SimulationConfig(
seed=42, canvas_size=200.0,
n_cells=500, n_celltype=3,
celltype_proportion=[0.5, 0.3, 0.2],
continuity=0.25,
n_genes=80, n_markers=20,
marker_mu=0.75,
leakage_by_celltype=[0.08, 0.15, 0.05], # different rates per cell type
leak_dist_factor=1.0,
)
sim = SpotlessSimulator(config)
sim.run_full_simulation()
trs_df = sim.trs_df
cell_gdf = sim.cell_gdf
cell_gdf["_color"] = cell_gdf["celltype"].map(CT_PALETTE)
n_leaked = trs_df["is_leaked"].sum()
print(f"Total transcripts : {len(trs_df):,}")
print(f"Leaked : {n_leaked:,} ({100*n_leaked/len(trs_df):.1f}%)")
trs_df.head(3)
Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 500 Genes: 80 Transcripts: 11087 Total transcripts : 11,087 Leaked : 1,016 (9.2%)
| transcript_id | cell_id | feature_name | x_location | y_location | z_location | overlaps_nucleus | is_leaked | celltype_sim | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | tr_0 | cell_1 | ct_2_mkr_4 | 101.892839 | 16.868313 | 0.773956 | 1 | 0 | ct_2 |
| 1 | tr_1 | cell_1 | ct_2_mkr_13 | 103.062105 | 10.618926 | 0.438878 | 1 | 0 | ct_2 |
| 2 | tr_2 | cell_1 | ct_2_mkr_13 | 101.971864 | 14.124372 | 0.858598 | 1 | 0 | ct_2 |
4.2 Full spatial transcript map¶
In-cell transcripts are coloured by cell type; leaked transcripts are shown as small black crosses.
fig, ax = plt.subplots(figsize=(9, 9))
cell_gdf.set_geometry("cell_geometry").plot(
ax=ax, color=cell_gdf["_color"], alpha=0.25, edgecolor="k", linewidth=0.25)
cell_gdf.set_geometry("nucleus_geometry").plot(
ax=ax, color=cell_gdf["_color"], alpha=0.55, edgecolor="none")
ok = trs_df[trs_df["is_leaked"] == 0]
lk = trs_df[trs_df["is_leaked"] == 1]
sns.scatterplot(data=ok, x="x_location", y="y_location",
hue="celltype_sim", palette=CT_PALETTE,
s=2, alpha=0.5, edgecolor="none", ax=ax, legend=False)
ax.scatter(lk["x_location"], lk["y_location"],
s=5, c="black", marker="x", alpha=0.4, linewidths=0.4,
label=f"leaked ({len(lk):,})")
ct_handles = [mpatches.Patch(facecolor=c, alpha=0.7, label=k) for k, c in CT_PALETTE.items()]
lk_handle = plt.Line2D([0],[0], marker="x", color="black", linestyle="none",
markersize=6, label=f"leaked ({len(lk):,})")
ax.legend(handles=ct_handles + [lk_handle], title="Cell type / status", fontsize=8)
ax.set_aspect("equal")
ax.set_title("Transcript spatial locations", fontsize=13)
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
fig.tight_layout()
plt.show()
4.3 Leakage analysis¶
Two diagnostics: per-cell-type leakage rate, and the distance distribution of leaked transcripts from their cell boundary.
fig, axs = plt.subplots(1, 3, figsize=(17, 4.5))
# ── Left: leakage rate per cell type ─────────────────────────────────────────
leak_rate = (
trs_df.groupby("celltype_sim")["is_leaked"]
.agg(["sum", "count"])
.assign(rate=lambda d: d["sum"] / d["count"] * 100)
.reset_index()
)
bars = axs[0].bar(leak_rate["celltype_sim"], leak_rate["rate"],
color=[CT_PALETTE[c] for c in leak_rate["celltype_sim"]], width=0.5)
axs[0].axhline(leak_rate["rate"].mean(), color="k", ls="--", lw=1, label="mean")
axs[0].set_title("Leakage rate by cell type")
axs[0].set_ylabel("% leaked transcripts")
axs[0].legend()
for bar, val in zip(bars, leak_rate["rate"]):
axs[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.2,
f"{val:.1f}%", ha="center", va="bottom", fontsize=9)
# ── Compute per-transcript distance and max_dist ──────────────────────────────
COVERAGE = 0.97
leaked_df = trs_df[trs_df["is_leaked"] == 1].copy()
cell_polys = dict(zip(cell_gdf["cell_id"], cell_gdf["cell_geometry"]))
leaked_df["dist_to_poly"] = [
cell_polys[cid].exterior.distance(Point(x, y))
for cid, x, y in zip(leaked_df["cell_id"], leaked_df["x_location"], leaked_df["y_location"])
]
# Per-transcript max_dist = leak_dist_factor × r_eff (r_eff = √(cell_area/π))
r_eff_map = {cid: np.sqrt(poly.area / np.pi) for cid, poly in cell_polys.items()}
leaked_df["r_eff"] = leaked_df["cell_id"].map(r_eff_map)
leaked_df["max_dist"] = config.leak_dist_factor * leaked_df["r_eff"]
leaked_df["dist_norm"] = (leaked_df["dist_to_poly"] / leaked_df["max_dist"]).clip(0, None)
d = leaked_df["dist_to_poly"].values
d = d[d > 0]
# ── Middle: raw distance + fitted exp + theoretical λ line ───────────────────
_, scale_fit = expon.fit(d, floc=0)
lam_fit = 1.0 / scale_fit
# Theoretical λ uses each cell's own max_dist; approximate with median max_dist
med_max_dist = leaked_df["max_dist"].median()
lam_theory = -np.log(1.0 - COVERAGE) / med_max_dist
x_fit = np.linspace(0, np.percentile(d, 99), 200)
axs[1].hist(d, bins=40, density=True, color="#6acc65", edgecolor="white",
linewidth=0.4, alpha=0.85, label="Observed")
axs[1].plot(x_fit, expon.pdf(x_fit, 0, 1 / lam_fit),
"k-", lw=2, label=f"Fitted Exp (λ = {lam_fit:.3f} µm⁻¹)")
axs[1].plot(x_fit, expon.pdf(x_fit, 0, 1 / lam_theory),
"r--", lw=2, label=f"Theoretical Exp (λ = {lam_theory:.3f} µm⁻¹)")
axs[1].set_title("Distance to cell boundary")
axs[1].set_xlabel("Distance (µm)")
axs[1].legend(fontsize=8)
# ── Right: distance normalised by max_dist ────────────────────────────────────
# After normalisation s = d/max_dist the distribution should collapse to
# f(s) = λ_norm · exp(−λ_norm · s) / coverage with λ_norm = −ln(1−coverage)
# regardless of cell size or dist_factor.
s = leaked_df["dist_norm"].values
s = s[s > 0]
lam_norm = -np.log(1.0 - COVERAGE) # ≈ 3.507
x_norm = np.linspace(0, 1.1, 200)
pdf_norm = lam_norm * np.exp(-lam_norm * x_norm) / COVERAGE
axs[2].hist(s, bins=40, density=True, color="#7c9dcc", edgecolor="white",
linewidth=0.4, alpha=0.85, label="Observed")
axs[2].plot(x_norm, pdf_norm, "r--", lw=2,
label=f"Theoretical (λ = {lam_norm:.2f})")
axs[2].axvline(1.0, color="k", ls=":", lw=1, label="max_dist (coverage=97%)")
axs[2].set_title("Normalised distance (d / max_dist)")
axs[2].set_xlabel("d / (leak_dist_factor × r_eff)")
axs[2].legend(fontsize=8)
fig.suptitle("4.3 Leakage analysis", fontsize=12)
fig.tight_layout()
plt.show()
4.3.2 Gene-level effects: leakage_by_gene¶
leakage_by_gene adds a per-gene leak probability on top of the cell-type rate:
p_effective = 1 − (1 − p_celltype) × (1 − p_gene)
Below we run a matched simulation where two housekeeping genes (hk_0, hk_1) have an elevated
p_gene = 0.5, while all marker genes stay at zero. The bar chart confirms the union-rule lift;
the zoom then shows the resulting spatial difference between a no-leakage marker gene and a
high-leakage housekeeping gene.
# Re-run with gene-level leakage isolated: hk_0 and hk_1 get p_gene=0.5, markers get 0
cfg_g = SimulationConfig(
seed=42, canvas_size=200.0,
n_cells=500, n_celltype=3,
celltype_proportion=[0.5, 0.3, 0.2],
continuity=0.25,
n_genes=80, n_markers=20,
marker_mu=0.75,
leakage_by_celltype=0.05, # uniform baseline
leakage_by_gene={"hk_0": 0.5, "hk_1": 0.5}, # extra leakage for two hk genes
leak_dist_factor=1.2,
)
sim_g = SpotlessSimulator(cfg_g)
sim_g.run_full_simulation()
tg = sim_g.trs_df
cg = sim_g.cell_gdf
def _gene_cat(name):
if name in ("hk_0", "hk_1"): return "hk_0/hk_1\n(p_gene=0.5)"
if name.startswith("hk_"): return "other hk\n(p_gene=0.0)"
return "marker\n(p_gene=0.0)"
tg["_cat"] = tg["feature_name"].apply(_gene_cat)
rate_g = (
tg.groupby("_cat")["is_leaked"]
.agg(n_leaked="sum", n_total="count")
.assign(pct=lambda d: 100 * d["n_leaked"] / d["n_total"])
.reset_index()
)
fig, axs = plt.subplots(1, 2, figsize=(13, 5))
# ── Left: leak rate per gene category ────────────────────────────────────────
pal_g = {"marker\n(p_gene=0.0)": "#1f77b4",
"other hk\n(p_gene=0.0)": "#aec7e8",
"hk_0/hk_1\n(p_gene=0.5)": "#ff7f0e"}
cats = rate_g["_cat"].tolist()
axs[0].bar(range(len(cats)), rate_g["pct"].values,
color=[pal_g[c] for c in cats], width=0.45)
axs[0].set_xticks(range(len(cats)))
axs[0].set_xticklabels(cats, fontsize=9)
for i, val in enumerate(rate_g["pct"].values):
axs[0].text(i, val + 0.4, f"{val:.1f}%", ha="center", va="bottom", fontsize=9)
axs[0].set_ylabel("% leaked transcripts")
axs[0].set_title("Leak rate by gene category\n"
"(leakage_by_celltype=0.05 baseline; p_eff = 1−(1−p_ct)(1−p_gene))")
# ── Right: zoom — marker gene (no gene leak) vs hk_0 (p_gene=0.5) ────────────
xlo, xhi, ylo, yhi = 60.0, 140.0, 60.0, 140.0
zm = ((tg["x_location"] > xlo) & (tg["x_location"] < xhi) &
(tg["y_location"] > ylo) & (tg["y_location"] < yhi))
tg_z = tg[zm]
ax = axs[1]
cg.set_geometry("cell_geometry").plot(ax=ax, color="#eeeeee", edgecolor="k",
linewidth=0.3, alpha=0.6)
for gene, col in [("ct_0_mkr_0", "#1f77b4"), ("hk_0", "#ff7f0e")]:
sub = tg_z[tg_z["feature_name"] == gene]
inside = sub[sub["is_leaked"] == 0]
outside = sub[sub["is_leaked"] == 1]
ax.scatter(inside["x_location"], inside["y_location"],
s=12, c=col, alpha=0.9, marker="o", zorder=3,
label=f"{gene} in-cell ({len(inside)})")
ax.scatter(outside["x_location"], outside["y_location"],
s=18, c=col, alpha=0.9, marker="x", linewidths=1.4, zorder=4,
label=f"{gene} leaked ({len(outside)})")
ax.set_xlim(xlo, xhi); ax.set_ylim(ylo, yhi)
ax.set_aspect("equal")
ax.set_title("Zoom: ct_0_mkr_0 (marker, p_gene=0) vs hk_0 (p_gene=0.5)\n"
"crosses = leaked outside cell boundary")
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
ax.legend(fontsize=8, loc="upper right", framealpha=0.8)
fig.suptitle("Gene-level leakage via leakage_by_gene", fontsize=12)
fig.tight_layout()
plt.show()
Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 500 Genes: 80 Transcripts: 11087
4.4 Parameter scan: leakage_by_celltype × leak_dist_factor¶
A 5 × 5 sweep shows how the two parameters act largely independently:
leakage_by_celltype(columns) → controls the fraction of transcripts that escapeleak_dist_factor(rows) → controls how far leaked transcripts diffuse from the boundary
Results are summarised as two heatmaps: actual leak percentage and median leaked-transcript distance from the cell boundary (computed from the theoretical truncated-exponential median).
Note:
leak_dist_factorscalesmax_dist = dist_factor × r_effper cell, so larger cells will always show greater absolute distances even at the samedist_factor. The right heatmap therefore reflects bothdist_factorand the typical cell size of the simulated tissue (here 150 µm canvas, 150 cells). The theoretical median distance is≈ 0.189 × max_dist(from the truncated exponential withcoverage = 0.97).
import warnings; warnings.filterwarnings("ignore")
# Theoretical median fraction for truncated Exp(λ) on [0, max_dist] with coverage=0.97:
# F(median) = 0.5 → median = −ln(1 − 0.5·coverage) / λ = MED_FRAC × max_dist
COVERAGE = 0.97
MED_FRAC = -np.log(1.0 - 0.5 * COVERAGE) / (-np.log(1.0 - COVERAGE)) # ≈ 0.189
leak_vals = [0.02, 0.08, 0.15, 0.25, 0.40]
dist_vals = [0.3, 0.7, 1.2, 2.0, 3.5]
pct_grid = np.zeros((len(dist_vals), len(leak_vals)))
med_grid = np.zeros((len(dist_vals), len(leak_vals)))
for i, df in enumerate(dist_vals):
for j, lk in enumerate(leak_vals):
cfg_s = SimulationConfig(
seed=7, n_cells=150, n_celltype=3, n_genes=20, n_markers=6,
continuity=0.2, canvas_size=150.0,
leakage_by_celltype=lk, leak_dist_factor=df,
)
s = SpotlessSimulator(cfg_s)
s.run_full_simulation()
pct_grid[i, j] = s.trs_df["is_leaked"].mean() * 100
# Theoretical median distance from boundary
median_r = np.sqrt(s.cell_gdf["cell_geometry"].area.median() / np.pi)
med_grid[i, j] = MED_FRAC * df * median_r
print(f"Scan complete ({len(dist_vals)*len(leak_vals)} configurations)")
# ── Two heatmaps ──────────────────────────────────────────────────────────────
row_labels = [f"{v:.1f}" for v in dist_vals]
col_labels = [f"{v:.2f}" for v in leak_vals]
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
for ax, data, cmap, title, unit in [
(axs[0], pct_grid, "YlOrRd", "% leaked transcripts (actual)", "%"),
(axs[1], med_grid, "Blues", "Median leak distance from boundary (µm)", "µm"),
]:
im = ax.imshow(data, aspect="auto", cmap=cmap)
ax.set_xticks(range(len(leak_vals))); ax.set_xticklabels(col_labels, fontsize=9)
ax.set_yticks(range(len(dist_vals))); ax.set_yticklabels(row_labels, fontsize=9)
ax.set_xlabel("leakage_by_celltype", fontsize=10)
ax.set_ylabel("leak_dist_factor", fontsize=10)
ax.set_title(title, fontsize=11)
plt.colorbar(im, ax=ax, label=unit, shrink=0.85)
vmax = data.max()
for r in range(len(dist_vals)):
for c in range(len(leak_vals)):
txt = "white" if data[r, c] > 0.60 * vmax else "black"
ax.text(c, r, f"{data[r, c]:.1f}", ha="center", va="center",
fontsize=8, color=txt)
fig.suptitle("Parameter scan: leakage_by_celltype × leak_dist_factor (5 × 5 grid)",
fontsize=13, fontweight="bold")
fig.tight_layout()
plt.show()
Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Generating gene parameters... Initializing cells... Simulating transcript counts... Simulating transcript locations... Simulation complete! Cells: 150 Genes: 20 Transcripts: 677 Scan complete (25 configurations)
Summary¶
| Parameter | Effect | Typical range |
|---|---|---|
leakage_by_celltype |
Fraction leaked — can be per-celltype dict | 0.0 – 0.4 |
leak_dist_factor |
Diffusion scale (multiplier on cell radius) | 0.5 – 3.0 |
Transcript DataFrame columns:
| Column | Description |
|---|---|
x_location, y_location |
Spatial coordinates (µm) |
feature_name |
Gene name |
cell_id |
Originating cell |
is_leaked |
1 if placed outside cell boundary |
overlaps_nucleus |
1 if inside the nucleus polygon |
celltype_sim |
Cell type of the originating cell |
Next: 05_configuration.ipynb — how to build configs from simple to complex.