Aquarius: Counterfactual World Engine

"Turn hindsight into foresight."

An agentic system that reconstructs and simulates alternate realities from real-world incident artifacts using Vision Language Models.


🎯 Inspiration

Every major incidentβ€”whether a highway collision, a production database meltdown, or a flash crashβ€”leaves behind a trail of artifacts: dashcam footage, server logs, Slack transcripts, sensor telemetry. Post-mortems ask "what went wrong?" but rarely explore "what could have gone differently?"

We were inspired by three converging ideas:

  1. Counterfactual reasoning in AI safety β€” The ability to simulate "what-if" scenarios is fundamental to understanding causality, not just correlation.

  2. The untapped potential of multimodal AI β€” With Gemini 3's native video understanding and 2M token context, we can finally ingest entire incident corpora (video + logs + reports) in a single reasoning pass.

  3. Cross-domain applicability β€” The same causal reasoning that asks "what if the driver braked 2 seconds earlier?" can ask "what if the circuit breaker tripped at 50% instead of 80%?" or "what if the kill switch threshold was \$1M instead of \$2.5M?"

The goal: build a universal incident reconstruction and simulation engine that works across domains.


πŸ—οΈ How We Built It

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    User Interface                            β”‚
β”‚              (API / CLI / Web Dashboard)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Marathon Agent                            β”‚
β”‚         (Autonomous hypothesis exploration)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Counterfactual β”‚  Causal Graph    β”‚   Physics/Kinematic   β”‚
β”‚   Simulator      β”‚  Engine          β”‚   Validator           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  VLM Reasoning Core                          β”‚
β”‚            (Gemini 3 / Grok / Claude / GPT-4)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Temporal Alignment    β”‚    Multimodal Ingestion          β”‚
β”‚    (Cross-modal sync)    β”‚    (Video/Logs/Reports/Sensors)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

Component Purpose
Ingestion Layer Parses video, logs (syslog, JSON, custom), PDF reports, and sensor telemetry into a unified artifact schema
Temporal Alignment Synchronizes multimodal streams using VLM-assisted anchor detection (visual cues ↔ log events)
VLM Reasoning Core Multi-provider abstraction supporting Gemini, Grok, Claude, and GPT-4 with function calling
Causal Graph Engine Extracts cause-effect relationships using structured function calls
Counterfactual Simulator Generates interventions and simulates alternate timelines
Marathon Agent Autonomous exploration of hypothesis space

The Simulation Algorithm

For each counterfactual scenario, we:

  1. Identify the divergence point β€” Where does the intervention change the timeline?
  2. Propagate effects β€” What cascading changes occur?
  3. Assess outcomes β€” Does the catastrophic event still happen?
  4. Score effectiveness β€” How much did the intervention help?

Mathematically, we model intervention effectiveness as:

$$ \text{Score}(I) = E(I) \times F(I) \times C(I) $$

Where:

  • $$E(I)= Effectiveness β€” normalized harm reduction (0 = no improvement, 1 = full prevention)$$
  • $$F(I) = Feasibility β€” can we realistically implement this? (0-1)$$
  • $$C(I) = Confidence β€” how certain is the simulation? (0-1)$$

Why This Formula?

We use a multiplicative scoring function inspired by decision analysis (Keeney & Raiffa, 1976). An effective intervention must satisfy three independent qualities:

Quality Question It Answers
Effectiveness Does it actually prevent or reduce harm?
Feasibility Can we realistically implement it?
Confidence How certain is our simulation of the outcome?

Why multiplication? It enforces that all factors must be goodβ€”a single zero kills the score:

  • 0.9 * 0.9 * 0.9 = 0.73 βœ“ (good across the board)
  • 1.0 * 0.0 * 1.0 = 0.0 βœ— (perfect prevention, but impossible to implement)

Example: For intervention "AEB triggers 0.5s earlier":

  • Effectiveness: E = 1.0 (full prevention)
  • Feasibility: F = 0.85 (OTA software update)
  • Confidence: C = 0.9 (high certainty)

$$\text{Score} = 1.0 \times 0.85 \times 0.9 = 0.765$$

Compare to "car didn't exist": E=1.0, F=0.0 β†’ Scorec= 0 (useless despite "preventing" the outcome).

This ensures the system recommends actionable, effective interventions rather than obvious or impossible ones.


πŸ”§ Gemini 3 Integration

Gemini 3 serves as the primary reasoning backbone, leveraging four key capabilities:

1. Native Video Understanding

response = await provider.generate_with_video(
    video_path="dashcam.mp4",
    prompt="Identify all vehicles, their trajectories, and the collision sequence",
    functions=[emit_event, establish_entity, link_cause_effect],
    fps=2.0,  # Sample at 2 FPS for fast action
    start_offset="10s",
    end_offset="45s",
)

No frame extraction preprocessingβ€”Gemini processes raw video with configurable FPS sampling and temporal clipping via VideoMetadata.

2. Structured Function Calling

Instead of parsing unstructured text, we use native function calling:

functions = [
    {"name": "emit_event", "description": "Register a timestamped event"},
    {"name": "link_cause_effect", "description": "Establish causal relationship"},
    {"name": "propose_intervention", "description": "Suggest counterfactual change"},
    {"name": "assess_outcome", "description": "Evaluate alternate timeline result"},
]

This ensures reliable structured output across all domains.

3. Extended Thinking Mode

generation_config.thinking_config = types.ThinkingConfig(
    thinking_budget=10000  # Allocate reasoning tokens
)

Complex causal chains require deep reasoningβ€”thinking mode provides the cognitive budget for multi-step analysis.

4. Massive Context Window

The 2M token context allows ingesting complete incident corpora:

  • 30 minutes of dashcam footage
  • 50,000 lines of logs
  • Full postmortem reports
  • Slack incident channel transcripts

All in a single reasoning pass, without lossy summarization.


πŸŽ“ What We Learned

Technical Insights

  1. Function calling > text parsing β€” Structured outputs via function calling are dramatically more reliable than regex/JSON extraction from free-form text.

  2. Temporal alignment is hard β€” Clock drift, timezone mismatches, and missing timestamps require VLM-assisted anchor detection (e.g., correlating a visible clock in video with log timestamps).

  3. Domain-specific interventions matter β€” Generic counterfactuals are less useful than domain-aware ones. A traffic scenario needs "earlier braking" while a DevOps incident needs "lower circuit breaker threshold".

  4. Thinking tokens are worth it β€” For complex causal reasoning, enabling Gemini's thinking mode significantly improves the quality of counterfactual simulations.

Design Principles

  • Provider abstraction β€” Supporting multiple VLMs (Gemini, Grok, Claude, GPT-4) ensures resilience and allows benchmarking.
  • Artifact-first design β€” Everything flows from the unified artifact schema, making the system domain-agnostic.
  • Iterative simulation β€” Counterfactual exploration is a loop, not a single inference.

🚧 Challenges Faced

1. Multimodal Synchronization

Problem: A dashcam video, GPS telemetry, and police report all describe the same incident but with different timestamps, coordinate systems, and granularities.

Solution: We built a temporal alignment engine that uses VLM-assisted anchor point detectionβ€”identifying visual events (brake lights, collisions) and correlating them with log entries to establish a unified timeline.

2. Hallucination in Counterfactual Reasoning

Problem: When simulating "what-if" scenarios, the model can generate physically impossible or logically inconsistent alternate events.

Solution: We added a physics validator for traffic scenarios (checking kinematic constraints like $v^2 = u^2 + 2as$) and domain-specific plausibility checks for other domains.

3. Context Window Management

Problem: Even with 2M tokens, very long incidents can exceed limits. Naive truncation loses critical causal information.

Solution: Priority-based context allocationβ€”critical events near the incident get full detail, while background context is summarized. Rolling window for extended timelines.

4. Intervention Quality

Problem: Early versions generated too many trivial interventions ("the accident wouldn't have happened if the car didn't exist").

Solution: Feasibility scoring (F(I) in [0,1]) and domain-specific intervention templates ensure actionable, realistic counterfactuals.


πŸš€ Future Directions

  • Real-time ingestion β€” Stream incidents as they unfold, not just post-hoc analysis
  • Visual synthesis β€” Render alternate trajectories overlaid on original video
  • Multi-agent simulation β€” Model how different actors would respond to interventions

- Benchmark dataset β€” Curated incidents with ground-truth counterfactual outcomes

Built With

Share this project:

Updates