Inspiration

Modern “AI science” often stops at persuasive narrative. We wanted a system that treats science as an auditable workflow: a claim becomes falsifiable tests, runs produce reproducible artifacts, and uncertainty is handled explicitly (not papered over by confident prose). Our guiding principle was: evidence first, traceability always.

What it does

Amawta turns a user hypothesis into an end-to-end scientific workflow:

  • Normalizes the claim into a minimal schema (domain, entities, relation, observables).
  • Generates a bounded falsification plan (tests + small variant matrix).
  • Performs grounded literature search to avoid “refrying” existing work.
  • Executes a two-phase runner:
    • Toy run (always): quick falsifiers / sanity checks.
    • Field run (mandatory when toy isn’t falsified): resolve/download datasets and run with real evidence.
  • Emits deterministic gate reports: PASS / FAIL / UNRESOLVED.
  • If UNRESOLVED, it self-recovers (retry datasets, rerun field) within a bounded budget, and remains resumable.

All stages produce versioned JSON artifacts; the system does not depend on chat history.

How we built it

  • Google ADK-powered multi-agent orchestration integrated directly in our TypeScript CLI runtime.
  • Gemini 3 enforced everywhere (hard-blocks non–Gemini 3 models).
  • A science workflow agent that runs:
    • dialectic + Bacon-style analysis → normalization → literature → falsification plan → runner → gates → repair loop.
  • A deterministic artifact layer:
    • hypothesis_normalization.json, literature_search.json, falsification_plan.json, runner code + logs, dataset manifests, results, and gate_<gate_id>.json.
  • Deterministic gates (pure functions over artifacts) and an autopoietic repair loop driven by recommended_actions.
  • A “truth anchor” inspired by Ledger Closure: operational existence requires closing an explicit ledger inside a feasible set \(\mathcal{F}\): $$O \in \mathcal{F} \ \wedge \ L\ \text{closes} \ \Rightarrow \ \text{EXISTS}$$ Otherwise, Amawta reports METHOD-NOTE/UNRESOLVED and seeks missing evidence instead of inventing.

Challenges we ran into

  • Latency + routing: keeping greetings/simple Q&A fast while reliably triggering the full workflow for hypotheses.
  • Groundedness: ensuring URLs/DOIs and “what ran” claims are always backed by artifacts, never hallucinated.
  • Resumability: supporting partial failures (downloads, missing evidence) without losing state or corrupting runs.
  • Determinism: stabilizing JSON-only steps (canonical JSON, code-fence tolerance) and making gates artifact-driven.
  • E2E reliability: building smoke tests that cover interleaving quick chat + workflow + retry + resume.

Accomplishments that we're proud of

  • A working scientific autopoietic loop with explicit state + artifacts + deterministic gates + bounded self-repair.
  • Two-phase execution (toy → field) with mandatory field attempt when toy isn’t falsified.
  • Actionable UNRESOLVED: when evidence is insufficient, the system retries acquisition/execution instead of producing false FAILs.
  • Auditability by construction: every meaningful claim about execution, datasets, or results maps to a stored artifact.
  • No regex heuristics for workflow decisions LLM orchestration + deterministic evaluators only.

What we learned

  • “Autonomy” in science is less about longer reasoning and more about contracts: explicit schemas, artifact persistence, deterministic evaluators, and bounded retries.
  • The most important safety feature isn’t a refusal it’s the discipline to say UNRESOLVED and request/obtain evidence.
  • A fast-path is essential: scientific rigor shouldn’t make basic interactions unusable.

What's next for Amawta

  • Add a regression “metabolism” with ADK Evaluate (hallucination checks + rubrics for expert vs guided behavior).
  • Expand field execution across more data modalities while keeping one auditable runner/gate contract.
  • Tighten Ledger Closure further toward full “physics-as-accounting” semantics (more explicit ledger lines and invariants) and then layer additional epistemic gates on top without sacrificing determinism or traceability.
Share this project:

Updates