How CellDAO Works: A Walkthrough of the Stack

CellDAO is designed to standardize, align, and continuously learn from high-value cellular datasets. Below is an end-to-end view of how raw experimental output becomes part of an evolving Virtual Cell.

1. Data Generation & Local Packaging

Labs perform experiments (imaging, perturbation screens, single-cell omics). Alongside raw files, they export a preliminary metadata bundle: plate maps, channel descriptions, perturbation annotations, acquisition logs.

2. OMS Manifest Construction

A local tooling CLI ingests the preliminary bundle and produces an OMS manifest:

Sample provenance (cell line, passage, culture conditions)
Perturbations (compound structures, doses, genetic edits, exposure times)
Acquisition parameters (instrument ID, objective, channels w/ roles + stains)
Processing placeholders (to be filled post-feature extraction)
Licensing & attribution metadata

Validation runs, flagging missing or ambiguous fields for user correction.

3. Feature Extraction & Processing Trace

Containerized pipelines perform segmentation and feature extraction (e.g., cell morphology vectors). Each step appends to the processing trace:

Segmentation model + version
Feature extractor + parameter digest
Normalization and QC filters applied

Outputs: per-cell or per-well feature matrices, QC flags, lineage links back to raw assets.

4. Submission & Integrity Anchoring

The dataset (raw assets optional, processed features mandatory) plus manifest is prepared:

Content hashes computed (raw subset, features, manifest).
An integrity anchor (hash of hash set) optionally registered on-chain as a contribution record.
Dataset registered in a discovery index with search facets.

5. Ingestion & Standardization Layer

On the network side:

Manifests parsed; schema version compatibility checked.
Controlled vocabulary normalization (channel roles, perturbation units).
Batch/site effect variables extracted for downstream modeling.
Derived dataset lineage incorporated into a global graph.

6. Virtual Cell Training Pipeline

A scheduled (or continuous) training job:

Loads new OMS-compliant datasets.
Updates multimodal encoders (imaging, expression) to align latent spaces.
Trains perturbation-aware dynamics (dose, time, sequential interventions).
Calibrates uncertainty via likelihood heads + OOD detectors.
Evaluates on benchmark suite (held-out cell types, perturbations, timepoints).

Model checkpoints record dataset contribution proportions (for attribution or incentives).

7. Inference & Simulation Services

APIs expose:

State embedding of new observations.
Predictive simulation: given a baseline state + perturbation plan (compound X @ dose D over T hours), return predicted morphology / expression trajectories.
Counterfactual queries: difference between interventions A vs. B.
Uncertainty + OOD indicators.

8. Active Learning Loop

The system ranks candidate experiments by expected information gain:

High uncertainty regions in latent space.
Poorly disentangled batch vs. biological signals.
Underrepresented perturbation-time combinations.

Suggested experiments feed back to labs, closing the loop.

9. Incentive & Governance Layer (Optional / Progressive)

If a DAO / token layer is enabled:

Usage-based reward distribution to contributing datasets.
Reputation accrues via validation accuracy, reproducibility audits.
Community voting on OMS schema evolution and model benchmark updates.

10. Observability & Audit

Dashboards track:

Dataset ingestion velocity & compliance scores.
Model performance drift over time.
Uncertainty calibration metrics.
Lineage queries (which datasets influenced prediction X?).

Summary Flow Diagram (Conceptual)

Raw Data -> Local OMS Packaging -> Feature Extraction + Trace -> Submission & Integrity Anchor -> Ingestion Standardization -> Virtual Cell Training -> Inference & Simulation -> Active Learning Suggestions -> (Optional) Incentive Distribution & Governance

Conclusion

CellDAO operationalizes a virtuous cycle: standardized data fuels better models; better models propose informative experiments; new data refines the system. The stack is intentionally modular so labs can adopt components incrementally while moving toward a fully integrated Virtual Cell ecosystem.