How CellDAO Works: A Walkthrough of the Stack
CellDAO is designed to standardize, align, and continuously learn from high-value cellular datasets. Below is an end-to-end view of how raw experimental output becomes part of an evolving Virtual Cell.
1. Data Generation & Local Packaging
Labs perform experiments (imaging, perturbation screens, single-cell omics). Alongside raw files, they export a preliminary metadata bundle: plate maps, channel descriptions, perturbation annotations, acquisition logs.
2. OMS Manifest Construction
A local tooling CLI ingests the preliminary bundle and produces an OMS manifest:
- Sample provenance (cell line, passage, culture conditions)
- Perturbations (compound structures, doses, genetic edits, exposure times)
- Acquisition parameters (instrument ID, objective, channels w/ roles + stains)
- Processing placeholders (to be filled post-feature extraction)
- Licensing & attribution metadata
Validation runs, flagging missing or ambiguous fields for user correction.
3. Feature Extraction & Processing Trace
Containerized pipelines perform segmentation and feature extraction (e.g., cell morphology vectors). Each step appends to the processing trace:
- Segmentation model + version
- Feature extractor + parameter digest
- Normalization and QC filters applied
Outputs: per-cell or per-well feature matrices, QC flags, lineage links back to raw assets.
4. Submission & Integrity Anchoring
The dataset (raw assets optional, processed features mandatory) plus manifest is prepared:
- Content hashes computed (raw subset, features, manifest).
- An integrity anchor (hash of hash set) optionally registered on-chain as a contribution record.
- Dataset registered in a discovery index with search facets.
5. Ingestion & Standardization Layer
On the network side:
- Manifests parsed; schema version compatibility checked.
- Controlled vocabulary normalization (channel roles, perturbation units).
- Batch/site effect variables extracted for downstream modeling.
- Derived dataset lineage incorporated into a global graph.
6. Virtual Cell Training Pipeline
A scheduled (or continuous) training job:
- Loads new OMS-compliant datasets.
- Updates multimodal encoders (imaging, expression) to align latent spaces.
- Trains perturbation-aware dynamics (dose, time, sequential interventions).
- Calibrates uncertainty via likelihood heads + OOD detectors.
- Evaluates on benchmark suite (held-out cell types, perturbations, timepoints).
Model checkpoints record dataset contribution proportions (for attribution or incentives).
7. Inference & Simulation Services
APIs expose:
- State embedding of new observations.
- Predictive simulation: given a baseline state + perturbation plan (compound X @ dose D over T hours), return predicted morphology / expression trajectories.
- Counterfactual queries: difference between interventions A vs. B.
- Uncertainty + OOD indicators.
8. Active Learning Loop
The system ranks candidate experiments by expected information gain:
- High uncertainty regions in latent space.
- Poorly disentangled batch vs. biological signals.
- Underrepresented perturbation-time combinations.
Suggested experiments feed back to labs, closing the loop.
9. Incentive & Governance Layer (Optional / Progressive)
If a DAO / token layer is enabled:
- Usage-based reward distribution to contributing datasets.
- Reputation accrues via validation accuracy, reproducibility audits.
- Community voting on OMS schema evolution and model benchmark updates.
10. Observability & Audit
Dashboards track:
- Dataset ingestion velocity & compliance scores.
- Model performance drift over time.
- Uncertainty calibration metrics.
- Lineage queries (which datasets influenced prediction X?).
Summary Flow Diagram (Conceptual)
Raw Data -> Local OMS Packaging -> Feature Extraction + Trace -> Submission & Integrity Anchor -> Ingestion Standardization -> Virtual Cell Training -> Inference & Simulation -> Active Learning Suggestions -> (Optional) Incentive Distribution & Governance
Conclusion
CellDAO operationalizes a virtuous cycle: standardized data fuels better models; better models propose informative experiments; new data refines the system. The stack is intentionally modular so labs can adopt components incrementally while moving toward a fully integrated Virtual Cell ecosystem.