Inspiration
Modern “AI science” often stops at persuasive narrative. We wanted a system that treats science as an auditable workflow: a claim becomes falsifiable tests, runs produce reproducible artifacts, and uncertainty is handled explicitly (not papered over by confident prose). Our guiding principle was: evidence first, traceability always.
What it does
Amawta turns a user hypothesis into an end-to-end scientific workflow:
- Normalizes the claim into a minimal schema (domain, entities, relation, observables).
- Generates a bounded falsification plan (tests + small variant matrix).
- Performs grounded literature search to avoid “refrying” existing work.
- Executes a two-phase runner:
- Toy run (always): quick falsifiers / sanity checks.
- Field run (mandatory when toy isn’t falsified): resolve/download datasets and run with real evidence.
- Emits deterministic gate reports:
PASS / FAIL / UNRESOLVED. - If
UNRESOLVED, it self-recovers (retry datasets, rerun field) within a bounded budget, and remains resumable.
All stages produce versioned JSON artifacts; the system does not depend on chat history.
How we built it
- Google ADK-powered multi-agent orchestration integrated directly in our TypeScript CLI runtime.
- Gemini 3 enforced everywhere (hard-blocks non–Gemini 3 models).
- A science workflow agent that runs:
- dialectic + Bacon-style analysis → normalization → literature → falsification plan → runner → gates → repair loop.
- A deterministic artifact layer:
hypothesis_normalization.json,literature_search.json,falsification_plan.json, runner code + logs, dataset manifests, results, andgate_<gate_id>.json.
- Deterministic gates (pure functions over artifacts) and an autopoietic repair loop driven by
recommended_actions. - A “truth anchor” inspired by Ledger Closure: operational existence requires closing an explicit ledger inside a feasible set \(\mathcal{F}\):
$$O \in \mathcal{F} \ \wedge \ L\ \text{closes} \ \Rightarrow \ \text{EXISTS}$$
Otherwise, Amawta reports
METHOD-NOTE/UNRESOLVEDand seeks missing evidence instead of inventing.
Challenges we ran into
- Latency + routing: keeping greetings/simple Q&A fast while reliably triggering the full workflow for hypotheses.
- Groundedness: ensuring URLs/DOIs and “what ran” claims are always backed by artifacts, never hallucinated.
- Resumability: supporting partial failures (downloads, missing evidence) without losing state or corrupting runs.
- Determinism: stabilizing JSON-only steps (canonical JSON, code-fence tolerance) and making gates artifact-driven.
- E2E reliability: building smoke tests that cover interleaving quick chat + workflow + retry + resume.
Accomplishments that we're proud of
- A working scientific autopoietic loop with explicit state + artifacts + deterministic gates + bounded self-repair.
- Two-phase execution (toy → field) with mandatory field attempt when toy isn’t falsified.
- Actionable UNRESOLVED: when evidence is insufficient, the system retries acquisition/execution instead of producing false FAILs.
- Auditability by construction: every meaningful claim about execution, datasets, or results maps to a stored artifact.
- No regex heuristics for workflow decisions LLM orchestration + deterministic evaluators only.
What we learned
- “Autonomy” in science is less about longer reasoning and more about contracts: explicit schemas, artifact persistence, deterministic evaluators, and bounded retries.
- The most important safety feature isn’t a refusal it’s the discipline to say UNRESOLVED and request/obtain evidence.
- A fast-path is essential: scientific rigor shouldn’t make basic interactions unusable.
What's next for Amawta
- Add a regression “metabolism” with ADK Evaluate (hallucination checks + rubrics for expert vs guided behavior).
- Expand field execution across more data modalities while keeping one auditable runner/gate contract.
- Tighten Ledger Closure further toward full “physics-as-accounting” semantics (more explicit ledger lines and invariants) and then layer additional epistemic gates on top without sacrificing determinism or traceability.
Log in or sign up for Devpost to join the conversation.