Step 4 — Simulate Transcript Locations¶

This notebook explains how discrete count values are converted to spatial (x, y) coordinates, including the leakage model that places a fraction of transcripts outside their cell boundary.

Two placement modes

Mode	Placement	Fraction controlled by
In-cell	Uniform rejection-sampled inside the cell polygon	`1 - p_effective`
Leaked	Truncated exponential distance outside the cell boundary	`p_effective`

The leaked transcript distance follows a truncated exponential with rate λ, clipped at max_dist = leak_dist_factor × r_eff (where r_eff = √(cell_area / π)):

P(distance > d) ∝ exp(−λ · d)      for  d ∈ [0, max_dist]

λ = −ln(1 − coverage) / max_dist
  = −ln(1 − coverage) / (leak_dist_factor × r_eff)

With the default coverage = 0.97, this means 97 % of leaked transcripts land within max_dist, giving λ ≈ 3.507 / (leak_dist_factor × r_eff). A larger leak_dist_factor or a larger cell both reduce λ and spread transcripts further from the boundary.

Effective leak probability combines cell-type and gene-level rates via a union rule:

p_effective = 1 − (1 − p_celltype) × (1 − p_gene)

Key parameters:

Parameter	Effect
`leakage_by_celltype`	Per-cell-type fraction of transcripts that leak
`leak_dist_factor`	Diffusion scale: multiplier on effective cell radius
`leakage_by_gene`	Per-gene additional leak probability (additive via union rule)

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from scipy.stats import expon
from shapely.geometry import Point

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

CT_PALETTE = {"ct_0": "#1f77b4", "ct_1": "#ff7f0e", "ct_2": "#2ca02c"}
plt.rcParams.update({"figure.dpi": 110, "axes.spines.top": False, "axes.spines.right": False})
print("Imports OK")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from scipy.stats import expon
from shapely.geometry import Point

from STpuppeteer.simulation import SimulationConfig, SpotlessSimulator

CT_PALETTE = {"ct_0": "#1f77b4", "ct_1": "#ff7f0e", "ct_2": "#2ca02c"}
plt.rcParams.update({"figure.dpi": 110, "axes.spines.top": False, "axes.spines.right": False})
print("Imports OK")

Imports OK

4.1 Run a full simulation¶

In [2]:

Copied!





config = SimulationConfig(
    seed=42, canvas_size=200.0,
    n_cells=500, n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    continuity=0.25,
    n_genes=80, n_markers=20,
    marker_mu=0.75,
    leakage_by_celltype=[0.08, 0.15, 0.05],  # different rates per cell type
    leak_dist_factor=1.0,
)

sim = SpotlessSimulator(config)
sim.run_full_simulation()

trs_df   = sim.trs_df
cell_gdf = sim.cell_gdf
cell_gdf["_color"] = cell_gdf["celltype"].map(CT_PALETTE)

n_leaked = trs_df["is_leaked"].sum()
print(f"Total transcripts : {len(trs_df):,}")
print(f"Leaked            : {n_leaked:,}  ({100*n_leaked/len(trs_df):.1f}%)")
trs_df.head(3)
config = SimulationConfig(
    seed=42, canvas_size=200.0,
    n_cells=500, n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    continuity=0.25,
    n_genes=80, n_markers=20,
    marker_mu=0.75,
    leakage_by_celltype=[0.08, 0.15, 0.05],  # different rates per cell type
    leak_dist_factor=1.0,
)

sim = SpotlessSimulator(config)
sim.run_full_simulation()

trs_df   = sim.trs_df
cell_gdf = sim.cell_gdf
cell_gdf["_color"] = cell_gdf["celltype"].map(CT_PALETTE)

n_leaked = trs_df["is_leaked"].sum()
print(f"Total transcripts : {len(trs_df):,}")
print(f"Leaked            : {n_leaked:,}  ({100*n_leaked/len(trs_df):.1f}%)")
trs_df.head(3)

Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 500
  Genes: 80
  Transcripts: 11087
Total transcripts : 11,087
Leaked            : 1,016  (9.2%)

Out[2]:

	transcript_id	cell_id	feature_name	x_location	y_location	z_location	overlaps_nucleus	celltype_sim
0	tr_0	cell_1	ct_2_mkr_4	101.892839	16.868313	0.773956	1	ct_2
1	tr_1	cell_1	ct_2_mkr_13	103.062105	10.618926	0.438878	1	ct_2
2	tr_2	cell_1	ct_2_mkr_13	101.971864	14.124372	0.858598	1	ct_2

4.2 Full spatial transcript map¶

In-cell transcripts are coloured by cell type; leaked transcripts are shown as small black crosses.

In [3]:

Copied!





fig, ax = plt.subplots(figsize=(9, 9))

cell_gdf.set_geometry("cell_geometry").plot(
    ax=ax, color=cell_gdf["_color"], alpha=0.25, edgecolor="k", linewidth=0.25)
cell_gdf.set_geometry("nucleus_geometry").plot(
    ax=ax, color=cell_gdf["_color"], alpha=0.55, edgecolor="none")

ok = trs_df[trs_df["is_leaked"] == 0]
lk = trs_df[trs_df["is_leaked"] == 1]

sns.scatterplot(data=ok, x="x_location", y="y_location",
                hue="celltype_sim", palette=CT_PALETTE,
                s=2, alpha=0.5, edgecolor="none", ax=ax, legend=False)
ax.scatter(lk["x_location"], lk["y_location"],
           s=5, c="black", marker="x", alpha=0.4, linewidths=0.4,
           label=f"leaked ({len(lk):,})")

ct_handles = [mpatches.Patch(facecolor=c, alpha=0.7, label=k) for k, c in CT_PALETTE.items()]
lk_handle  = plt.Line2D([0],[0], marker="x", color="black", linestyle="none",
                         markersize=6, label=f"leaked ({len(lk):,})")
ax.legend(handles=ct_handles + [lk_handle], title="Cell type / status", fontsize=8)
ax.set_aspect("equal")
ax.set_title("Transcript spatial locations", fontsize=13)
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
fig.tight_layout()
plt.show()
fig, ax = plt.subplots(figsize=(9, 9))

cell_gdf.set_geometry("cell_geometry").plot(
    ax=ax, color=cell_gdf["_color"], alpha=0.25, edgecolor="k", linewidth=0.25)
cell_gdf.set_geometry("nucleus_geometry").plot(
    ax=ax, color=cell_gdf["_color"], alpha=0.55, edgecolor="none")

ok = trs_df[trs_df["is_leaked"] == 0]
lk = trs_df[trs_df["is_leaked"] == 1]

sns.scatterplot(data=ok, x="x_location", y="y_location",
                hue="celltype_sim", palette=CT_PALETTE,
                s=2, alpha=0.5, edgecolor="none", ax=ax, legend=False)
ax.scatter(lk["x_location"], lk["y_location"],
           s=5, c="black", marker="x", alpha=0.4, linewidths=0.4,
           label=f"leaked ({len(lk):,})")

ct_handles = [mpatches.Patch(facecolor=c, alpha=0.7, label=k) for k, c in CT_PALETTE.items()]
lk_handle  = plt.Line2D([0],[0], marker="x", color="black", linestyle="none",
                         markersize=6, label=f"leaked ({len(lk):,})")
ax.legend(handles=ct_handles + [lk_handle], title="Cell type / status", fontsize=8)
ax.set_aspect("equal")
ax.set_title("Transcript spatial locations", fontsize=13)
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
fig.tight_layout()
plt.show()

No description has been provided for this image

4.3 Leakage analysis¶

Two diagnostics: per-cell-type leakage rate, and the distance distribution of leaked transcripts from their cell boundary.

In [4]:

Copied!





fig, axs = plt.subplots(1, 3, figsize=(17, 4.5))

# ── Left: leakage rate per cell type ─────────────────────────────────────────
leak_rate = (
    trs_df.groupby("celltype_sim")["is_leaked"]
    .agg(["sum", "count"])
    .assign(rate=lambda d: d["sum"] / d["count"] * 100)
    .reset_index()
)
bars = axs[0].bar(leak_rate["celltype_sim"], leak_rate["rate"],
                  color=[CT_PALETTE[c] for c in leak_rate["celltype_sim"]], width=0.5)
axs[0].axhline(leak_rate["rate"].mean(), color="k", ls="--", lw=1, label="mean")
axs[0].set_title("Leakage rate by cell type")
axs[0].set_ylabel("% leaked transcripts")
axs[0].legend()
for bar, val in zip(bars, leak_rate["rate"]):
    axs[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.2,
                f"{val:.1f}%", ha="center", va="bottom", fontsize=9)

# ── Compute per-transcript distance and max_dist ──────────────────────────────
COVERAGE   = 0.97
leaked_df  = trs_df[trs_df["is_leaked"] == 1].copy()
cell_polys = dict(zip(cell_gdf["cell_id"], cell_gdf["cell_geometry"]))

leaked_df["dist_to_poly"] = [
    cell_polys[cid].exterior.distance(Point(x, y))
    for cid, x, y in zip(leaked_df["cell_id"], leaked_df["x_location"], leaked_df["y_location"])
]

# Per-transcript max_dist = leak_dist_factor × r_eff  (r_eff = √(cell_area/π))
r_eff_map            = {cid: np.sqrt(poly.area / np.pi) for cid, poly in cell_polys.items()}
leaked_df["r_eff"]   = leaked_df["cell_id"].map(r_eff_map)
leaked_df["max_dist"] = config.leak_dist_factor * leaked_df["r_eff"]
leaked_df["dist_norm"] = (leaked_df["dist_to_poly"] / leaked_df["max_dist"]).clip(0, None)

d = leaked_df["dist_to_poly"].values
d = d[d > 0]

# ── Middle: raw distance + fitted exp + theoretical λ line ───────────────────
_, scale_fit  = expon.fit(d, floc=0)
lam_fit       = 1.0 / scale_fit
# Theoretical λ uses each cell's own max_dist; approximate with median max_dist
med_max_dist  = leaked_df["max_dist"].median()
lam_theory    = -np.log(1.0 - COVERAGE) / med_max_dist

x_fit = np.linspace(0, np.percentile(d, 99), 200)
axs[1].hist(d, bins=40, density=True, color="#6acc65", edgecolor="white",
            linewidth=0.4, alpha=0.85, label="Observed")
axs[1].plot(x_fit, expon.pdf(x_fit, 0, 1 / lam_fit),
            "k-",  lw=2, label=f"Fitted Exp (λ = {lam_fit:.3f} µm⁻¹)")
axs[1].plot(x_fit, expon.pdf(x_fit, 0, 1 / lam_theory),
            "r--", lw=2, label=f"Theoretical Exp (λ = {lam_theory:.3f} µm⁻¹)")
axs[1].set_title("Distance to cell boundary")
axs[1].set_xlabel("Distance (µm)")
axs[1].legend(fontsize=8)

# ── Right: distance normalised by max_dist ────────────────────────────────────
# After normalisation s = d/max_dist the distribution should collapse to
#   f(s) = λ_norm · exp(−λ_norm · s) / coverage   with λ_norm = −ln(1−coverage)
# regardless of cell size or dist_factor.
s        = leaked_df["dist_norm"].values
s        = s[s > 0]
lam_norm = -np.log(1.0 - COVERAGE)          # ≈ 3.507
x_norm   = np.linspace(0, 1.1, 200)
pdf_norm = lam_norm * np.exp(-lam_norm * x_norm) / COVERAGE

axs[2].hist(s, bins=40, density=True, color="#7c9dcc", edgecolor="white",
            linewidth=0.4, alpha=0.85, label="Observed")
axs[2].plot(x_norm, pdf_norm, "r--", lw=2,
            label=f"Theoretical (λ = {lam_norm:.2f})")
axs[2].axvline(1.0, color="k", ls=":", lw=1, label="max_dist  (coverage=97%)")
axs[2].set_title("Normalised distance  (d / max_dist)")
axs[2].set_xlabel("d  /  (leak_dist_factor × r_eff)")
axs[2].legend(fontsize=8)

fig.suptitle("4.3  Leakage analysis", fontsize=12)
fig.tight_layout()
plt.show()
fig, axs = plt.subplots(1, 3, figsize=(17, 4.5))

# ── Left: leakage rate per cell type ─────────────────────────────────────────
leak_rate = (
    trs_df.groupby("celltype_sim")["is_leaked"]
    .agg(["sum", "count"])
    .assign(rate=lambda d: d["sum"] / d["count"] * 100)
    .reset_index()
)
bars = axs[0].bar(leak_rate["celltype_sim"], leak_rate["rate"],
                  color=[CT_PALETTE[c] for c in leak_rate["celltype_sim"]], width=0.5)
axs[0].axhline(leak_rate["rate"].mean(), color="k", ls="--", lw=1, label="mean")
axs[0].set_title("Leakage rate by cell type")
axs[0].set_ylabel("% leaked transcripts")
axs[0].legend()
for bar, val in zip(bars, leak_rate["rate"]):
    axs[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.2,
                f"{val:.1f}%", ha="center", va="bottom", fontsize=9)

# ── Compute per-transcript distance and max_dist ──────────────────────────────
COVERAGE   = 0.97
leaked_df  = trs_df[trs_df["is_leaked"] == 1].copy()
cell_polys = dict(zip(cell_gdf["cell_id"], cell_gdf["cell_geometry"]))

leaked_df["dist_to_poly"] = [
    cell_polys[cid].exterior.distance(Point(x, y))
    for cid, x, y in zip(leaked_df["cell_id"], leaked_df["x_location"], leaked_df["y_location"])
]

# Per-transcript max_dist = leak_dist_factor × r_eff  (r_eff = √(cell_area/π))
r_eff_map            = {cid: np.sqrt(poly.area / np.pi) for cid, poly in cell_polys.items()}
leaked_df["r_eff"]   = leaked_df["cell_id"].map(r_eff_map)
leaked_df["max_dist"] = config.leak_dist_factor * leaked_df["r_eff"]
leaked_df["dist_norm"] = (leaked_df["dist_to_poly"] / leaked_df["max_dist"]).clip(0, None)

d = leaked_df["dist_to_poly"].values
d = d[d > 0]

# ── Middle: raw distance + fitted exp + theoretical λ line ───────────────────
_, scale_fit  = expon.fit(d, floc=0)
lam_fit       = 1.0 / scale_fit
# Theoretical λ uses each cell's own max_dist; approximate with median max_dist
med_max_dist  = leaked_df["max_dist"].median()
lam_theory    = -np.log(1.0 - COVERAGE) / med_max_dist

x_fit = np.linspace(0, np.percentile(d, 99), 200)
axs[1].hist(d, bins=40, density=True, color="#6acc65", edgecolor="white",
            linewidth=0.4, alpha=0.85, label="Observed")
axs[1].plot(x_fit, expon.pdf(x_fit, 0, 1 / lam_fit),
            "k-",  lw=2, label=f"Fitted Exp (λ = {lam_fit:.3f} µm⁻¹)")
axs[1].plot(x_fit, expon.pdf(x_fit, 0, 1 / lam_theory),
            "r--", lw=2, label=f"Theoretical Exp (λ = {lam_theory:.3f} µm⁻¹)")
axs[1].set_title("Distance to cell boundary")
axs[1].set_xlabel("Distance (µm)")
axs[1].legend(fontsize=8)

# ── Right: distance normalised by max_dist ────────────────────────────────────
# After normalisation s = d/max_dist the distribution should collapse to
#   f(s) = λ_norm · exp(−λ_norm · s) / coverage   with λ_norm = −ln(1−coverage)
# regardless of cell size or dist_factor.
s        = leaked_df["dist_norm"].values
s        = s[s > 0]
lam_norm = -np.log(1.0 - COVERAGE)          # ≈ 3.507
x_norm   = np.linspace(0, 1.1, 200)
pdf_norm = lam_norm * np.exp(-lam_norm * x_norm) / COVERAGE

axs[2].hist(s, bins=40, density=True, color="#7c9dcc", edgecolor="white",
            linewidth=0.4, alpha=0.85, label="Observed")
axs[2].plot(x_norm, pdf_norm, "r--", lw=2,
            label=f"Theoretical (λ = {lam_norm:.2f})")
axs[2].axvline(1.0, color="k", ls=":", lw=1, label="max_dist  (coverage=97%)")
axs[2].set_title("Normalised distance  (d / max_dist)")
axs[2].set_xlabel("d  /  (leak_dist_factor × r_eff)")
axs[2].legend(fontsize=8)

fig.suptitle("4.3  Leakage analysis", fontsize=12)
fig.tight_layout()
plt.show()

4.3.2 Gene-level effects: `leakage_by_gene`¶

leakage_by_gene adds a per-gene leak probability on top of the cell-type rate:

p_effective = 1 − (1 − p_celltype) × (1 − p_gene)

Below we run a matched simulation where two housekeeping genes (hk_0, hk_1) have an elevated p_gene = 0.5, while all marker genes stay at zero. The bar chart confirms the union-rule lift; the zoom then shows the resulting spatial difference between a no-leakage marker gene and a high-leakage housekeeping gene.

In [5]:

Copied!





# Re-run with gene-level leakage isolated: hk_0 and hk_1 get p_gene=0.5, markers get 0
cfg_g = SimulationConfig(
    seed=42, canvas_size=200.0,
    n_cells=500, n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    continuity=0.25,
    n_genes=80, n_markers=20,
    marker_mu=0.75,
    leakage_by_celltype=0.05,                    # uniform baseline
    leakage_by_gene={"hk_0": 0.5, "hk_1": 0.5}, # extra leakage for two hk genes
    leak_dist_factor=1.2,
)
sim_g = SpotlessSimulator(cfg_g)
sim_g.run_full_simulation()
tg    = sim_g.trs_df
cg    = sim_g.cell_gdf

def _gene_cat(name):
    if name in ("hk_0", "hk_1"): return "hk_0/hk_1\n(p_gene=0.5)"
    if name.startswith("hk_"):   return "other hk\n(p_gene=0.0)"
    return "marker\n(p_gene=0.0)"

tg["_cat"] = tg["feature_name"].apply(_gene_cat)

rate_g = (
    tg.groupby("_cat")["is_leaked"]
    .agg(n_leaked="sum", n_total="count")
    .assign(pct=lambda d: 100 * d["n_leaked"] / d["n_total"])
    .reset_index()
)

fig, axs = plt.subplots(1, 2, figsize=(13, 5))

# ── Left: leak rate per gene category ────────────────────────────────────────
pal_g = {"marker\n(p_gene=0.0)": "#1f77b4",
         "other hk\n(p_gene=0.0)": "#aec7e8",
         "hk_0/hk_1\n(p_gene=0.5)": "#ff7f0e"}
cats = rate_g["_cat"].tolist()
axs[0].bar(range(len(cats)), rate_g["pct"].values,
           color=[pal_g[c] for c in cats], width=0.45)
axs[0].set_xticks(range(len(cats)))
axs[0].set_xticklabels(cats, fontsize=9)
for i, val in enumerate(rate_g["pct"].values):
    axs[0].text(i, val + 0.4, f"{val:.1f}%", ha="center", va="bottom", fontsize=9)
axs[0].set_ylabel("% leaked transcripts")
axs[0].set_title("Leak rate by gene category\n"
                 "(leakage_by_celltype=0.05 baseline;  p_eff = 1−(1−p_ct)(1−p_gene))")

# ── Right: zoom — marker gene (no gene leak) vs hk_0 (p_gene=0.5) ────────────
xlo, xhi, ylo, yhi = 60.0, 140.0, 60.0, 140.0
zm = ((tg["x_location"] > xlo) & (tg["x_location"] < xhi) &
      (tg["y_location"] > ylo) & (tg["y_location"] < yhi))
tg_z = tg[zm]

ax = axs[1]
cg.set_geometry("cell_geometry").plot(ax=ax, color="#eeeeee", edgecolor="k",
                                       linewidth=0.3, alpha=0.6)

for gene, col in [("ct_0_mkr_0", "#1f77b4"), ("hk_0", "#ff7f0e")]:
    sub     = tg_z[tg_z["feature_name"] == gene]
    inside  = sub[sub["is_leaked"] == 0]
    outside = sub[sub["is_leaked"] == 1]
    ax.scatter(inside["x_location"],  inside["y_location"],
               s=12, c=col, alpha=0.9, marker="o", zorder=3,
               label=f"{gene}  in-cell ({len(inside)})")
    ax.scatter(outside["x_location"], outside["y_location"],
               s=18, c=col, alpha=0.9, marker="x", linewidths=1.4, zorder=4,
               label=f"{gene}  leaked ({len(outside)})")

ax.set_xlim(xlo, xhi); ax.set_ylim(ylo, yhi)
ax.set_aspect("equal")
ax.set_title("Zoom: ct_0_mkr_0 (marker, p_gene=0) vs hk_0 (p_gene=0.5)\n"
             "crosses = leaked outside cell boundary")
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
ax.legend(fontsize=8, loc="upper right", framealpha=0.8)

fig.suptitle("Gene-level leakage via leakage_by_gene", fontsize=12)
fig.tight_layout()
plt.show()
# Re-run with gene-level leakage isolated: hk_0 and hk_1 get p_gene=0.5, markers get 0
cfg_g = SimulationConfig(
    seed=42, canvas_size=200.0,
    n_cells=500, n_celltype=3,
    celltype_proportion=[0.5, 0.3, 0.2],
    continuity=0.25,
    n_genes=80, n_markers=20,
    marker_mu=0.75,
    leakage_by_celltype=0.05,                    # uniform baseline
    leakage_by_gene={"hk_0": 0.5, "hk_1": 0.5}, # extra leakage for two hk genes
    leak_dist_factor=1.2,
)
sim_g = SpotlessSimulator(cfg_g)
sim_g.run_full_simulation()
tg    = sim_g.trs_df
cg    = sim_g.cell_gdf

def _gene_cat(name):
    if name in ("hk_0", "hk_1"): return "hk_0/hk_1\n(p_gene=0.5)"
    if name.startswith("hk_"):   return "other hk\n(p_gene=0.0)"
    return "marker\n(p_gene=0.0)"

tg["_cat"] = tg["feature_name"].apply(_gene_cat)

rate_g = (
    tg.groupby("_cat")["is_leaked"]
    .agg(n_leaked="sum", n_total="count")
    .assign(pct=lambda d: 100 * d["n_leaked"] / d["n_total"])
    .reset_index()
)

fig, axs = plt.subplots(1, 2, figsize=(13, 5))

# ── Left: leak rate per gene category ────────────────────────────────────────
pal_g = {"marker\n(p_gene=0.0)": "#1f77b4",
         "other hk\n(p_gene=0.0)": "#aec7e8",
         "hk_0/hk_1\n(p_gene=0.5)": "#ff7f0e"}
cats = rate_g["_cat"].tolist()
axs[0].bar(range(len(cats)), rate_g["pct"].values,
           color=[pal_g[c] for c in cats], width=0.45)
axs[0].set_xticks(range(len(cats)))
axs[0].set_xticklabels(cats, fontsize=9)
for i, val in enumerate(rate_g["pct"].values):
    axs[0].text(i, val + 0.4, f"{val:.1f}%", ha="center", va="bottom", fontsize=9)
axs[0].set_ylabel("% leaked transcripts")
axs[0].set_title("Leak rate by gene category\n"
                 "(leakage_by_celltype=0.05 baseline;  p_eff = 1−(1−p_ct)(1−p_gene))")

# ── Right: zoom — marker gene (no gene leak) vs hk_0 (p_gene=0.5) ────────────
xlo, xhi, ylo, yhi = 60.0, 140.0, 60.0, 140.0
zm = ((tg["x_location"] > xlo) & (tg["x_location"] < xhi) &
      (tg["y_location"] > ylo) & (tg["y_location"] < yhi))
tg_z = tg[zm]

ax = axs[1]
cg.set_geometry("cell_geometry").plot(ax=ax, color="#eeeeee", edgecolor="k",
                                       linewidth=0.3, alpha=0.6)

for gene, col in [("ct_0_mkr_0", "#1f77b4"), ("hk_0", "#ff7f0e")]:
    sub     = tg_z[tg_z["feature_name"] == gene]
    inside  = sub[sub["is_leaked"] == 0]
    outside = sub[sub["is_leaked"] == 1]
    ax.scatter(inside["x_location"],  inside["y_location"],
               s=12, c=col, alpha=0.9, marker="o", zorder=3,
               label=f"{gene}  in-cell ({len(inside)})")
    ax.scatter(outside["x_location"], outside["y_location"],
               s=18, c=col, alpha=0.9, marker="x", linewidths=1.4, zorder=4,
               label=f"{gene}  leaked ({len(outside)})")

ax.set_xlim(xlo, xhi); ax.set_ylim(ylo, yhi)
ax.set_aspect("equal")
ax.set_title("Zoom: ct_0_mkr_0 (marker, p_gene=0) vs hk_0 (p_gene=0.5)\n"
             "crosses = leaked outside cell boundary")
ax.set_xlabel("x (µm)"); ax.set_ylabel("y (µm)")
ax.legend(fontsize=8, loc="upper right", framealpha=0.8)

fig.suptitle("Gene-level leakage via leakage_by_gene", fontsize=12)
fig.tight_layout()
plt.show()

Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 500
  Genes: 80
  Transcripts: 11087

4.4 Parameter scan: `leakage_by_celltype` × `leak_dist_factor`¶

A 5 × 5 sweep shows how the two parameters act largely independently:

leakage_by_celltype (columns) → controls the fraction of transcripts that escape
leak_dist_factor (rows) → controls how far leaked transcripts diffuse from the boundary

Results are summarised as two heatmaps: actual leak percentage and median leaked-transcript distance from the cell boundary (computed from the theoretical truncated-exponential median).

Note: leak_dist_factor scales max_dist = dist_factor × r_eff per cell, so larger cells will always show greater absolute distances even at the same dist_factor. The right heatmap therefore reflects both dist_factor and the typical cell size of the simulated tissue (here 150 µm canvas, 150 cells). The theoretical median distance is ≈ 0.189 × max_dist (from the truncated exponential with coverage = 0.97).

In [6]:

Copied!





import warnings; warnings.filterwarnings("ignore")

# Theoretical median fraction for truncated Exp(λ) on [0, max_dist] with coverage=0.97:
#   F(median) = 0.5  →  median = −ln(1 − 0.5·coverage) / λ = MED_FRAC × max_dist
COVERAGE = 0.97
MED_FRAC = -np.log(1.0 - 0.5 * COVERAGE) / (-np.log(1.0 - COVERAGE))  # ≈ 0.189

leak_vals = [0.02, 0.08, 0.15, 0.25, 0.40]
dist_vals = [0.3,  0.7,  1.2,  2.0,  3.5]

pct_grid = np.zeros((len(dist_vals), len(leak_vals)))
med_grid = np.zeros((len(dist_vals), len(leak_vals)))

for i, df in enumerate(dist_vals):
    for j, lk in enumerate(leak_vals):
        cfg_s = SimulationConfig(
            seed=7, n_cells=150, n_celltype=3, n_genes=20, n_markers=6,
            continuity=0.2, canvas_size=150.0,
            leakage_by_celltype=lk, leak_dist_factor=df,
        )
        s = SpotlessSimulator(cfg_s)
        s.run_full_simulation()
        pct_grid[i, j] = s.trs_df["is_leaked"].mean() * 100
        # Theoretical median distance from boundary
        median_r        = np.sqrt(s.cell_gdf["cell_geometry"].area.median() / np.pi)
        med_grid[i, j]  = MED_FRAC * df * median_r

print(f"Scan complete  ({len(dist_vals)*len(leak_vals)} configurations)")

# ── Two heatmaps ──────────────────────────────────────────────────────────────
row_labels = [f"{v:.1f}" for v in dist_vals]
col_labels = [f"{v:.2f}" for v in leak_vals]

fig, axs = plt.subplots(1, 2, figsize=(14, 5))

for ax, data, cmap, title, unit in [
    (axs[0], pct_grid, "YlOrRd", "% leaked transcripts (actual)",            "%"),
    (axs[1], med_grid, "Blues",  "Median leak distance from boundary (µm)", "µm"),
]:
    im = ax.imshow(data, aspect="auto", cmap=cmap)
    ax.set_xticks(range(len(leak_vals))); ax.set_xticklabels(col_labels, fontsize=9)
    ax.set_yticks(range(len(dist_vals))); ax.set_yticklabels(row_labels, fontsize=9)
    ax.set_xlabel("leakage_by_celltype", fontsize=10)
    ax.set_ylabel("leak_dist_factor", fontsize=10)
    ax.set_title(title, fontsize=11)
    plt.colorbar(im, ax=ax, label=unit, shrink=0.85)
    vmax = data.max()
    for r in range(len(dist_vals)):
        for c in range(len(leak_vals)):
            txt = "white" if data[r, c] > 0.60 * vmax else "black"
            ax.text(c, r, f"{data[r, c]:.1f}", ha="center", va="center",
                    fontsize=8, color=txt)

fig.suptitle("Parameter scan: leakage_by_celltype × leak_dist_factor  (5 × 5 grid)",
             fontsize=13, fontweight="bold")
fig.tight_layout()
plt.show()
import warnings; warnings.filterwarnings("ignore")

# Theoretical median fraction for truncated Exp(λ) on [0, max_dist] with coverage=0.97:
#   F(median) = 0.5  →  median = −ln(1 − 0.5·coverage) / λ = MED_FRAC × max_dist
COVERAGE = 0.97
MED_FRAC = -np.log(1.0 - 0.5 * COVERAGE) / (-np.log(1.0 - COVERAGE))  # ≈ 0.189

leak_vals = [0.02, 0.08, 0.15, 0.25, 0.40]
dist_vals = [0.3,  0.7,  1.2,  2.0,  3.5]

pct_grid = np.zeros((len(dist_vals), len(leak_vals)))
med_grid = np.zeros((len(dist_vals), len(leak_vals)))

for i, df in enumerate(dist_vals):
    for j, lk in enumerate(leak_vals):
        cfg_s = SimulationConfig(
            seed=7, n_cells=150, n_celltype=3, n_genes=20, n_markers=6,
            continuity=0.2, canvas_size=150.0,
            leakage_by_celltype=lk, leak_dist_factor=df,
        )
        s = SpotlessSimulator(cfg_s)
        s.run_full_simulation()
        pct_grid[i, j] = s.trs_df["is_leaked"].mean() * 100
        # Theoretical median distance from boundary
        median_r        = np.sqrt(s.cell_gdf["cell_geometry"].area.median() / np.pi)
        med_grid[i, j]  = MED_FRAC * df * median_r

print(f"Scan complete  ({len(dist_vals)*len(leak_vals)} configurations)")

# ── Two heatmaps ──────────────────────────────────────────────────────────────
row_labels = [f"{v:.1f}" for v in dist_vals]
col_labels = [f"{v:.2f}" for v in leak_vals]

fig, axs = plt.subplots(1, 2, figsize=(14, 5))

for ax, data, cmap, title, unit in [
    (axs[0], pct_grid, "YlOrRd", "% leaked transcripts (actual)",            "%"),
    (axs[1], med_grid, "Blues",  "Median leak distance from boundary (µm)", "µm"),
]:
    im = ax.imshow(data, aspect="auto", cmap=cmap)
    ax.set_xticks(range(len(leak_vals))); ax.set_xticklabels(col_labels, fontsize=9)
    ax.set_yticks(range(len(dist_vals))); ax.set_yticklabels(row_labels, fontsize=9)
    ax.set_xlabel("leakage_by_celltype", fontsize=10)
    ax.set_ylabel("leak_dist_factor", fontsize=10)
    ax.set_title(title, fontsize=11)
    plt.colorbar(im, ax=ax, label=unit, shrink=0.85)
    vmax = data.max()
    for r in range(len(dist_vals)):
        for c in range(len(leak_vals)):
            txt = "white" if data[r, c] > 0.60 * vmax else "black"
            ax.text(c, r, f"{data[r, c]:.1f}", ha="center", va="center",
                    fontsize=8, color=txt)

fig.suptitle("Parameter scan: leakage_by_celltype × leak_dist_factor  (5 × 5 grid)",
             fontsize=13, fontweight="bold")
fig.tight_layout()
plt.show()

Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Generating gene parameters...
Initializing cells...
Simulating transcript counts...
Simulating transcript locations...
Simulation complete!
  Cells: 150
  Genes: 20
  Transcripts: 677
Scan complete  (25 configurations)

Summary¶

Parameter	Effect	Typical range
`leakage_by_celltype`	Fraction leaked — can be per-celltype dict	0.0 – 0.4
`leak_dist_factor`	Diffusion scale (multiplier on cell radius)	0.5 – 3.0

Transcript DataFrame columns:

Column	Description
`x_location`, `y_location`	Spatial coordinates (µm)
`feature_name`	Gene name
`cell_id`	Originating cell
`is_leaked`	1 if placed outside cell boundary
`overlaps_nucleus`	1 if inside the nucleus polygon
`celltype_sim`	Cell type of the originating cell

Next: 05_configuration.ipynb — how to build configs from simple to complex.