ESG.audit

Inspiration

ESG reports are everywhere. Net-zero by 2040. 100% renewable energy. Zero forced labor in our supply chain. But behind most of these commitments sits a fundamental question nobody has a fast answer to: is there actual evidence in this report, or is it just well-written marketing? Manual ESG audits take weeks and cost thousands. We wanted to know if AI agents could do it in seconds.

What it does

ESG.audit takes any sustainability report, uploaded as a PDF, and runs it through a multi-agent pipeline that does three things no human analyst wants to do manually.

It extracts every ESG claim in the report across environmental, social, and governance categories. For each claim, an AI agent searches the document iteratively, looking for the data behind the words: a quantitative figure, a baseline year, a trend over time, interim milestones. It returns a verdict: supported, inconsistent, or unsupported. Separately, a second set of agents checks whether the report actually discloses what major international standards like TCFD and GRI require, catching what companies quietly leave out, not just what they get wrong. Everything is scored, the reasoning is visible, and the results stream to the UI in real time as each agent finishes.

How we built it

The frontend uploads PDFs directly from the browser to Supabase Storage, so the file never passes through our server. A Next.js API route then downloads the file, extracts text with pdf-parse, chunks it into roughly 500 token segments, and embeds each chunk using Gemini's text-embedding-001 model. Embeddings are stored in Supabase pgvector.

The analysis pipeline is built on OpenAI's GPT-4o with tool use. Each evidence checker is a real agent loop. It calls a search_document tool against our Supabase vector index as many times as it needs, refining its queries based on what it finds, before calling submit_verdict. Up to 30 claim agents and 12 framework agents run in parallel batches of 5. Results stream back to the frontend via Server-Sent Events as each stage completes. Scoring is pure logic with no LLM involved.

Challenges we ran into

Getting the agent loops to terminate cleanly was harder than expected. Without a hard iteration cap, agents would occasionally keep searching past the point of diminishing returns. We added a MAX_ITERATIONS guard of 6 and a fallback verdict for any loop that exits without calling submit_verdict.

Claim deduplication was trickier than it looked. A large report might produce 60 raw claims where 20 are near-duplicates phrased slightly differently. We solved this by embedding all extracted claims and merging any pair with cosine similarity above 0.92, keeping the more specific phrasing.

Coordinating across a split codebase under time pressure was its own challenge. The ingest side and the agent pipeline side shared only a session ID and a Supabase table as a contract. Keeping that boundary clean while both sides moved fast required discipline.

Accomplishments that we're proud of

The reasoning trace. Every verdict in the UI shows exactly which searches the agent ran and what it found before reaching its conclusion. Nothing is a black box. You can see why a claim was flagged, not just that it was. For a tool making credibility judgments, that transparency matters.

The framework gap checker catching what claims-based analysis misses. A company can make all the right-sounding commitments and still score poorly because they quietly omitted Scope 3 emissions or never reported progress against last year's targets. That gap between what was said and what was left unsaid is where a lot of greenwashing actually lives.

What we learned

Handling token limits across two different APIs in the same pipeline taught us a lot. Gemini's text-embedding-001 model has an input token limit per embedding call. Chunks that exceed this get silently truncated, which corrupts the vector and produces bad retrieval. We learned to enforce a hard chunk size ceiling before embedding, not after. On the OpenAI side, long ESG reports with many claims meant agent message histories could grow large across iterations. We learned to keep tool result payloads lean, returning only chunk text and not metadata, to stay well within GPT-4o's context window across the full agent loop.

Beyond token handling, we learned that retrieval quality lives and dies on query specificity. A broad query like "emissions" returns noisy chunks. A query like "Scope 1 baseline year 2019 historical figures" returns exactly what the agent needs. Prompting the agent to be specific in its search queries, not just to search, made a meaningful difference in verdict accuracy.

What's next for ESG.audit

Persistent monitoring: instead of one-off analysis, tracking a company's reports year over year and flagging when claims change, disappear, or go from supported to unsupported. External signal integration: cross-referencing report claims against regulatory filings, news, and controversy databases like RepRisk to catch cases where the report says one thing and reality says another. And multi-document comparison: benchmarking one company's disclosures against peers in the same sector, so you can see not just whether a report is credible but whether it is above or below industry standard.