CellDAO: an open initiative to standardize high-value cellular data and build a Virtual Cell

We standardize cell data, publish it openly, and tie model usage to on‑chain rewards for the datasets that trained it. Ship plates, earn attribution, and help a biology foundation model learn faster.

Virtual Cell At a Glance

Integrate: Cross‑modal encoders align morphology, omics, perturbation descriptors.
Simulate: Latent dynamics predict state evolution under dose & time.
Explain: Pathway / feature attributions highlight mechanism candidates.
Detect: Uncertainty & OOD heads surface when to measure instead of trust.
Guide: Active learning proposes high information‑gain experiments.

Think “flight simulator” for cells: explore interventions safely before committing bench time.

The Stack

OMS — Open Morphology Standard

A strict‑where‑it‑matters schema so morphology + perturbation data “just ingests”. Unit of exchange is the plate (optionally bundled with per‑cell features). Captures provenance, perturbations, acquisition, processing trace, QC, and lineage with controlled vocabularies for channel roles & dose units.

Registry

Public index of OMS bundles pinned to decentralized storage. Each dataset has a verifiable on‑chain attestation of content hashes, license, and beneficiary, enabling transparent lineage & attribution.

SDK & Tooling

Read / validate / export OMS, auto‑normalize channel semantics, compute hashes & QC, generate manifests, and interact with the registry + training pipeline. Works as a CLI or library to layer onto existing LIMS.

Virtual Cell

A continuously trained, multimodal foundation model of cellular state. Encoders fuse imaging + omics + perturbation context into a shared latent space; a dynamics module simulates dose‑time response; decoders reconstruct modalities; uncertainty heads flag out‑of‑distribution inputs.

Use a hosted endpoint (SLAs, compliance) or run open weights locally. Every served model version publishes a manifest of training datasets + proportional contribution weights for transparent reward calculation.

Why not rely on generic foundation models?

Biology is multimodal, perturbation‑centric, temporal, sparse, and metadata‑sensitive. Generic text/vision models lack explicit dose/time conditioning, causal awareness, acquisition context, and calibrated uncertainty required for experimental decisioning.

Incentive Loop

Rewards are simple: only included datasets in a served model version accrue proportional payouts. Attribution is grounded in content hashes; each checkpoint stores a vector of dataset contribution weights (e.g. gradient / usage share) that parameterize distribution.

    ┌───────────┐               ┌──────────────┐
    │   Data    │ ───────────▶ │   Registry   │
    └───────────┘               └──────────────┘
        ▲                          │ 
        │                          │ 
        │                          ▼
    ┌───────────┐               ┌──────────────┐
    │  Revenue  │ ◀─────────── │ Virtual Cell │
    └───────────┘               └──────────────┘

Build With Us

Map & Export: Add a manifest generator beside your current LIMS / pipeline; emit OMS JSON referencing raw & derived assets.
Validate: Run the validator; fix missing provenance / ambiguous doses / channel roles; iterate to a clean compliance report.
Capture Processing: Wrap segmentation + feature extraction in a thin recorder that logs version, parameters, hashes.
QC & Flags: Produce focus / signal metrics → standardized flags (FOCUS_SOFT, LOW_SIGNAL); optionally add per‑well scores.
Register: Upload manifest + features; receive attestation (hash set) & dataset ID; optional on‑chain anchor.
Monitor Inclusion: Watch training manifests & contribution weights; iterate quality to increase usage share.

FAQ

What is a Virtual Cell?
A biology‑native foundation model: integrates modalities into a calibrated cell‑state, predicts perturbation dose‑time responses, attributes pathways, and quantifies uncertainty.

Why standardization first?
Without shared provenance / perturbation / channel semantics, models overfit lab artifacts and cannot generalize. OMS makes morphology + context interoperable.

Why crypto?
To make attribution & payouts automatic, tamper‑evident, and portable. Hashes anchor dataset identity; smart contracts distribute usage‑based rewards. Crucially, on‑chain incentives create a positive flywheel: datasets that are included and contribute materially to served models receive transparent, proportional rewards and reputation signals, which encourages higher‑quality submissions, better metadata, and sustained participation that improves model performance over time.

What earns?
Datasets whose features materially contribute to a served model (inclusion & weighted usage), not mere uploads. Transparent manifests show proportions.

How do you handle quality?
QC metrics & standardized flags are stored, allowing weighting and exclusion. Low‑quality regions lower contribution weight.

How are datasets licensed?
All datasets contributed to the registry are licensed for reuse with attribution: we default to CC‑BY for individual dataset content and ODC‑BY for dataset/collection database rights. Licensing metadata is stored with each manifest so reuse terms and attribution obligations are transparent.

What about Privacy / PII?
Contributors must remove or redact personally identifiable information before submission. Manifests support redaction flags and feature‑only bundles (no raw images). We provide guidance and tooling to help anonymize data; sensitive fields can be omitted from public manifests or held under restricted access where required.