Aquarius: Counterfactual World Engine
"Turn hindsight into foresight."
An agentic system that reconstructs and simulates alternate realities from real-world incident artifacts using Vision Language Models.
π― Inspiration
Every major incidentβwhether a highway collision, a production database meltdown, or a flash crashβleaves behind a trail of artifacts: dashcam footage, server logs, Slack transcripts, sensor telemetry. Post-mortems ask "what went wrong?" but rarely explore "what could have gone differently?"
We were inspired by three converging ideas:
Counterfactual reasoning in AI safety β The ability to simulate "what-if" scenarios is fundamental to understanding causality, not just correlation.
The untapped potential of multimodal AI β With Gemini 3's native video understanding and 2M token context, we can finally ingest entire incident corpora (video + logs + reports) in a single reasoning pass.
Cross-domain applicability β The same causal reasoning that asks "what if the driver braked 2 seconds earlier?" can ask "what if the circuit breaker tripped at 50% instead of 80%?" or "what if the kill switch threshold was \$1M instead of \$2.5M?"
The goal: build a universal incident reconstruction and simulation engine that works across domains.
ποΈ How We Built It
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface β
β (API / CLI / Web Dashboard) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Marathon Agent β
β (Autonomous hypothesis exploration) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββ
β Counterfactual β Causal Graph β Physics/Kinematic β
β Simulator β Engine β Validator β
ββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VLM Reasoning Core β
β (Gemini 3 / Grok / Claude / GPT-4) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Temporal Alignment β Multimodal Ingestion β
β (Cross-modal sync) β (Video/Logs/Reports/Sensors) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Components
| Component | Purpose |
|---|---|
| Ingestion Layer | Parses video, logs (syslog, JSON, custom), PDF reports, and sensor telemetry into a unified artifact schema |
| Temporal Alignment | Synchronizes multimodal streams using VLM-assisted anchor detection (visual cues β log events) |
| VLM Reasoning Core | Multi-provider abstraction supporting Gemini, Grok, Claude, and GPT-4 with function calling |
| Causal Graph Engine | Extracts cause-effect relationships using structured function calls |
| Counterfactual Simulator | Generates interventions and simulates alternate timelines |
| Marathon Agent | Autonomous exploration of hypothesis space |
The Simulation Algorithm
For each counterfactual scenario, we:
- Identify the divergence point β Where does the intervention change the timeline?
- Propagate effects β What cascading changes occur?
- Assess outcomes β Does the catastrophic event still happen?
- Score effectiveness β How much did the intervention help?
Mathematically, we model intervention effectiveness as:
$$ \text{Score}(I) = E(I) \times F(I) \times C(I) $$
Where:
- $$E(I)= Effectiveness β normalized harm reduction (0 = no improvement, 1 = full prevention)$$
- $$F(I) = Feasibility β can we realistically implement this? (0-1)$$
- $$C(I) = Confidence β how certain is the simulation? (0-1)$$
Why This Formula?
We use a multiplicative scoring function inspired by decision analysis (Keeney & Raiffa, 1976). An effective intervention must satisfy three independent qualities:
| Quality | Question It Answers |
|---|---|
| Effectiveness | Does it actually prevent or reduce harm? |
| Feasibility | Can we realistically implement it? |
| Confidence | How certain is our simulation of the outcome? |
Why multiplication? It enforces that all factors must be goodβa single zero kills the score:
- 0.9 * 0.9 * 0.9 = 0.73 β (good across the board)
- 1.0 * 0.0 * 1.0 = 0.0 β (perfect prevention, but impossible to implement)
Example: For intervention "AEB triggers 0.5s earlier":
- Effectiveness: E = 1.0 (full prevention)
- Feasibility: F = 0.85 (OTA software update)
- Confidence: C = 0.9 (high certainty)
$$\text{Score} = 1.0 \times 0.85 \times 0.9 = 0.765$$
Compare to "car didn't exist": E=1.0, F=0.0 β Scorec= 0 (useless despite "preventing" the outcome).
This ensures the system recommends actionable, effective interventions rather than obvious or impossible ones.
π§ Gemini 3 Integration
Gemini 3 serves as the primary reasoning backbone, leveraging four key capabilities:
1. Native Video Understanding
response = await provider.generate_with_video(
video_path="dashcam.mp4",
prompt="Identify all vehicles, their trajectories, and the collision sequence",
functions=[emit_event, establish_entity, link_cause_effect],
fps=2.0, # Sample at 2 FPS for fast action
start_offset="10s",
end_offset="45s",
)
No frame extraction preprocessingβGemini processes raw video with configurable FPS sampling and temporal clipping via VideoMetadata.
2. Structured Function Calling
Instead of parsing unstructured text, we use native function calling:
functions = [
{"name": "emit_event", "description": "Register a timestamped event"},
{"name": "link_cause_effect", "description": "Establish causal relationship"},
{"name": "propose_intervention", "description": "Suggest counterfactual change"},
{"name": "assess_outcome", "description": "Evaluate alternate timeline result"},
]
This ensures reliable structured output across all domains.
3. Extended Thinking Mode
generation_config.thinking_config = types.ThinkingConfig(
thinking_budget=10000 # Allocate reasoning tokens
)
Complex causal chains require deep reasoningβthinking mode provides the cognitive budget for multi-step analysis.
4. Massive Context Window
The 2M token context allows ingesting complete incident corpora:
- 30 minutes of dashcam footage
- 50,000 lines of logs
- Full postmortem reports
- Slack incident channel transcripts
All in a single reasoning pass, without lossy summarization.
π What We Learned
Technical Insights
Function calling > text parsing β Structured outputs via function calling are dramatically more reliable than regex/JSON extraction from free-form text.
Temporal alignment is hard β Clock drift, timezone mismatches, and missing timestamps require VLM-assisted anchor detection (e.g., correlating a visible clock in video with log timestamps).
Domain-specific interventions matter β Generic counterfactuals are less useful than domain-aware ones. A traffic scenario needs "earlier braking" while a DevOps incident needs "lower circuit breaker threshold".
Thinking tokens are worth it β For complex causal reasoning, enabling Gemini's thinking mode significantly improves the quality of counterfactual simulations.
Design Principles
- Provider abstraction β Supporting multiple VLMs (Gemini, Grok, Claude, GPT-4) ensures resilience and allows benchmarking.
- Artifact-first design β Everything flows from the unified artifact schema, making the system domain-agnostic.
- Iterative simulation β Counterfactual exploration is a loop, not a single inference.
π§ Challenges Faced
1. Multimodal Synchronization
Problem: A dashcam video, GPS telemetry, and police report all describe the same incident but with different timestamps, coordinate systems, and granularities.
Solution: We built a temporal alignment engine that uses VLM-assisted anchor point detectionβidentifying visual events (brake lights, collisions) and correlating them with log entries to establish a unified timeline.
2. Hallucination in Counterfactual Reasoning
Problem: When simulating "what-if" scenarios, the model can generate physically impossible or logically inconsistent alternate events.
Solution: We added a physics validator for traffic scenarios (checking kinematic constraints like $v^2 = u^2 + 2as$) and domain-specific plausibility checks for other domains.
3. Context Window Management
Problem: Even with 2M tokens, very long incidents can exceed limits. Naive truncation loses critical causal information.
Solution: Priority-based context allocationβcritical events near the incident get full detail, while background context is summarized. Rolling window for extended timelines.
4. Intervention Quality
Problem: Early versions generated too many trivial interventions ("the accident wouldn't have happened if the car didn't exist").
Solution: Feasibility scoring (F(I) in [0,1]) and domain-specific intervention templates ensure actionable, realistic counterfactuals.
π Future Directions
- Real-time ingestion β Stream incidents as they unfold, not just post-hoc analysis
- Visual synthesis β Render alternate trajectories overlaid on original video
- Multi-agent simulation β Model how different actors would respond to interventions
Log in or sign up for Devpost to join the conversation.