Model

Architecture

X-Cell is a set-level diffusion transformer that predicts perturbed gene expression profiles from control cell populations. Unlike autoregressive single-cell foundation models, X-Cell operates on sets of cells and is trained explicitly on interventional data via distribution-matching objectives.

X-Cell Architecture

Key design choices

Diffusion-based training. Each training sample has a random fraction (25%, 50%, 75%, or 100%) of control gene expression positions replaced with ground-truth perturbed values. The model learns to predict the full perturbed profile from this partially revealed input. At inference, predictions are iteratively refined across 4 steps (coarse-to-fine).

Multi-modal biological priors via cross-attention. At every third self-attention layer, Flamingo-style cross-attention conditions gene representations on six prior knowledge tokens per perturbation:

Source	Content	Dimension
ESM-2	Protein language model embeddings	5120
STRING	Protein-protein interaction network	512
GenePT	LLM gene representations	3072
DepMap	Genetic dependency profiles	1150
JUMP-Cell Painting	Morphological features	259
Gene identity	Stop-gradient gene embedding	—

Tied output embeddings. The output head projects back through the shared gene embedding matrix (PaLM-style 1/√d scaling), acting as an implicit regularizer against conservative collapse.

X-Cell Mini

	X-Cell Mini
Parameters	55M
Layers	12
Hidden dim	512
Attention heads	8
FFN	ReLU, 1×
Normalization	Post-LN (LayerNorm)
Cross-attn layers	4
Init	scGPT
Training	Replogle-Nadig
Min GPU VRAM	8 GB (1 GPU)

Scaling

X-Cell follows power-law scaling consistent with large language models. Train loss scales as L(N) ∝ N⁻⁰·³² (α = 0.32, R² = 0.96) across five model sizes from 83M to 3.1B parameters.

Weights

Model weights are hosted on HuggingFace:

Xaira-Therapeutics/X-Cell
└── mini/     # X-Cell Mini (55M)

Open on HuggingFace