Sibyl

Agent Orchestration Flow
Multi Agent Investigation
What is IFRS?
Extracted Claims PDF Viewer
Final "Source of Truth" Report

Inspiration

Sustainability reporting should be a matter of proof, not trust.

Corporate sustainability reports are dense 100 to 200-page documents packed with claims about emissions reductions, net-zero commitments, and governance structures that shape over $35 trillion in ESG investment decisions. Yet according to a 2020 European Commission study, over 53% of green claims were vague or misleading, and 40% were completely unsubstantiated. The problem grew severe enough that the EU drafted an entire Green Claims Directive to force independent verification. And that is just for marketing materials. Corporate sustainability reports face even less scrutiny.

The IFRS S1/S2 standards, effective January 2024, now mandate specific disclosures across Governance, Strategy, Risk Management, and Metrics and Targets. But no automated tooling exists to validate whether those claims are supported by real-world evidence.

Greenwashing is not only about fabricating claims. It is equally about strategic omission. A report that quietly drops Scope 3 emissions or never mentions transition plan dependencies is engaging in selective disclosure just as much as one that invents renewable energy commitments. We built Sibyl to catch both. The people who need this are ESG analysts, institutional investors, regulators, investigative journalists, and, frankly anyone who suspects the sustainability report they are reading is too good to be true.

The name is a nod to the Sibyl System from the cyberpunk anime Psycho-Pass: a hive mind of specialized processors working in unison to deliver objective verdicts that no single actor could reach alone. We thought that was a pretty good description of what we built :)

What It Does

Sibyl is a multi-agent AI orchestration system that ingests sustainability report PDFs and produces a paragraph-level IFRS S1/S2 compliance analysis, exposing both what companies claim and what they choose not to disclose.

Upload a PDF. 8 specialized agents go to work.

Menny (Claims Agent) extracts every verifiable sustainability claim from the document and categorizes each by type and IFRS relevance before anything else in the pipeline runs.

Bron (Orchestrator) acts as traffic controller. It generates a routing plan for each claim, handles cross-agent information requests mid-investigation, and takes re-investigation orders from the Judge when evidence is too thin to issue a verdict.

Five specialists investigate in parallel:

Columbo (Geography), using Gemini 2.5 Pro's multimodal vision, is definitely the coolest agent. It geocodes location names via OpenStreetMap Nominatim, queries Microsoft Planetary Computer's STAC API for Sentinel-2 satellite imagery filtered for cloud cover, calculates NDVI from near-infrared and red spectral bands to detect vegetation change, then feeds the actual satellite images into Gemini for visual analysis. Output: change detected, estimated area in hectares, land cover features, and NDVI delta. A company claims it reforested 500 hectares in Riau Province. Sibyl checks from space.

Mike (Legal) maps every claim to specific IFRS paragraph IDs like S2.14(a)(iv) and assesses compliance at the sub-requirement level, evaluating whether each bullet point in paragraphs like S1.26 through S1.27 is fully addressed, partially addressed, or missing entirely.

Izzy (News/Media) runs three parallel searches per claim: company-specific, industry-wide, and controversy-focused. Every result goes through a credibility gauntlet. Tier 1 sources like ProPublica and SEC enforcement filings carry 4x weight, established outlets like Bloomberg carry 3x, press releases carry 2x, and social media is excluded outright. Contradiction detection runs across four types: direct, contextual, omission, and timeline.

Newton (Academic) classifies what investigation framework each claim needs before searching: GHG Protocol methodology, certification legitimacy for RECs and REDD+ carbon credits, genuine SBTi Paris-alignment, or sector benchmark comparison.

Rhea (Data/Metrics) does not ask the LLM to do math. It uses a SafeEval calculator tool so arithmetic is provably correct. It checks Scope totals, year-over-year reduction percentages, baseline consistency across the report, whether emissions intensities are plausible for the sector, and whether 2030 and 2050 targets are mathematically achievable at the current reduction rate.

Judy (Judge) scores every claim across four weighted dimensions: sufficiency (30%), consistency (25%), quality (25%, applying authority weights where legal scores 0.95 and news scores 0.7), and completeness (20%). Scores above 0.7 get a verdict: Verified, Contradicted, or Insufficient Evidence. Below 0.7, Judy fires a specific re-investigation request back through Bron with identified gaps and refined queries. This cycle up to three times before a verdict is forced.

The output maps every claim to specific IFRS paragraphs with a full evidence chain. A dedicated Disclosure Gaps section surfaces what the report never mentions at all. The investigation is visualized in real-time through a Detective Dashboard where each agent appears as an animated avatar in a village. Watch reasoning streams pulse, claim edges flow between agents, and the message pool route cross-domain requests live.

Why Sibyl Stands Out

This is 6 hackathon submissions made into one project. It is not a demo with mocked data. Real PDF parsing, real Sentinel-2 satellite imagery, real IFRS/SASB RAG retrieval, real LLM reasoning across eight distinct agents, and real paragraph-level compliance mapping. Most ESG tools are checklist-based, verifying whether a topic exists, not whether the claim is true. Sibyl cross-references claims against satellite imagery, legal databases, news archives, peer-reviewed research, and quantitative analysis simultaneously. The cyclic Judge loop means no verdict is issued prematurely. Weak findings go back for another round.

Technical Highlights

Hybrid RAG with pgvector. Three corpora embedded in PostgreSQL: full IFRS S1/S2 texts chunked at paragraph level, SASB industry standards, and the uploaded report. Retrieval combines pgvector cosine similarity with PostgreSQL ts_vector full-text search, re-ranked via reciprocal rank fusion to catch both conceptual and precise terminology matches.

Real-Time SSE Streaming. Every agent node emits StreamEvent objects to the shared LangGraph state. A FastAPI SSE endpoint streams these to the frontend's useSSE hook with module-level caching so events survive navigation and restore instantly.

Strategic Model Selection. Gemini 2.5 Flash handles full report ingestion in a single pass using its 1M token context window. DeepSeek V3.2 runs academic synthesis at a fraction of the cost. Claude Sonnet 4.5 is reserved for the tasks that genuinely need its reasoning quality: orchestration, legal interpretation, and final verdicts.

Design Decisions

Why multiple agents? We read Anthropic's 2026 Agentic Trends Report and saw that it identifies this as the defining architectural shift: "Multi-agent architectures use an orchestrator to coordinate specialized agents working in parallel, each with dedicated context, then synthesize results into integrated output." A single agent handling satellite imagery, legal compliance, emissions math, and news corroboration in one context window produces attention dilution and inconsistent results. Sibyl's agents each receive only the claims relevant to their domain. The principle is the same as a human investigative team with specialists working in parallel, each contributing focused expertise.

Why orchestrator and shared state over a swarm? Agents post InfoRequest and InfoResponse objects to a shared LangGraph state pool rather than communicating directly. This avoids N-squared peer-to-peer communication explosion and unpredictable emergent behaviour. All communication flows through a checkpointed state, giving fault tolerance, replay capability, and clear audit trails.

Why LangGraph? The Judge's re-investigation loop requires explicit cyclic control flow. CrewAI does not support this natively. AutoGen's event-driven model adds unnecessary complexity. LangGraph's StateGraph handles conditional edges, PostgreSQL checkpointing, and SSE streaming callbacks out of the box.

Challenges

PDF claim highlighting was harder than it looked. PDF.js renders each text item as a separate span, and claims text spans multiple spans with inconsistent whitespace. Cross-span DOM Ranges caused index drift where small positional errors compounded, and highlights landed on the wrong text entirely. We rebuilt using a span-by-span matching strategy: concatenate all spans into a searchable string, track which characters belong to which DOM element, normalize text, then reverse-map positions back to individual spans. Fallback highlight bars handle unmatched claims, staggered vertically to stay visible.

What's Next

The immediate roadmap is to expand the RAG corpus beyond IFRS S1/S2 to cover GRI, TCFD, CSRD, and SEC Climate Disclosure rules, making Sibyl the single verification layer across every major reporting framework rather than one of several tools that a compliance team runs separately. We could perhaps set up a system where this part is not hard-coded, and our platform has an upload section for a framework, and it just adapts to whichever one. Beyond that, the more valuable product is continuous monitoring, where verdicts automatically refresh when new satellite imagery arrives, a regulatory action drops, or a paper challenges a methodology the company relies on, because verdicts should have an expiry date and not just an issue date. The natural scale-up is a portfolio-level view where an asset manager verifies an entire ESG fund simultaneously and exports findings into their investment committee workflow, which is where the $35 trillion in ESG capital actually sits and what makes Sibyl a product rather than a feature. Another one would be exposing the Judge's scoring weights as configurable parameters which would let institutional users tune verification to their own risk tolerance and let regulators define a compliance standard the whole market runs against uniformly. Finally, a clean production deployment so anyone can use it :)

Built With

alembic
docker
fast-api
framer-motion
nominatim
openstreetmap
pgvector
python
react
tailwind
tavily
typescript
vite

Submitted to

Hack for Humanity | 2026
- Winner First Place - $1226 value + other prizes
- Winner Participation - $1339 value
- Winner Best use of AI/ML - $180 value

Created by

Implemented "Source of Truth" report page and chatbot

Tawsif Mayaz
SWD @ Empire Life, Cineplex Digital Media | Computer Engineering @ UWaterloo
Implemented and designed the frontend, researched to design an optimal AI orchestration pipeline, created and refined systems diagrams

Abeer Das
Prev @ Questrade, UofT Enterprise, BorderPass | Systems Design Engineering @uWaterloo
Designed and implemented AI orchestration pipeline

Aaron Chow
Systems Design Engineering @ UWaterloo