Inspiration

ESG reports are self-reported, unverified, and nearly impossible to audit. 78% of investors cannot trust supplier claims. The global sustainability ecosystem lacks a mechanism to independently verify environmental and social claims at scale.

We built Aletheia to solve this fundamental trust problem through technology:

  • Immutable file storage (SHA-256 hashing plus IPFS pinning)
  • AI-powered analysis (Google Gemini against GRI standards)
  • Token-gated uploads (one-time, cryptographically secure)
  • Verifiable audit trails (anyone can independently check authenticity)

The vision: transform ESG reporting from a trust-based system into a verifiable, AI-powered, immutable system where claims are backed by cryptography, not promises.


What it does

Aletheia connects investors and suppliers through a secure reporting platform:

For Investors: Generate one-time upload links, create verifiable audit trails, view AI-analyzed reports with ESG scoring and GRI compliance breakdowns.

For Suppliers: Submit sustainability reports via token-gated links, receive instant AI analysis, know their data is immutably stored and independently verifiable.

For Auditors: Download files from IPFS, compute SHA-256 hashes, verify against stored hashes. Independent verification requires no access to the application.

Core workflow: Investor creates link > Supplier uploads CSV > System hashes file (SHA-256) > File pinned to IPFS > Gemini analyzes against GRI standards > Results stored immutably with IPFS CID and hash > Interactive AI chat for deeper analysis.


How we built it

Frontend: Next.js 16 with TypeScript, React Server Components, Tailwind CSS, shadcn/ui components.

Backend: Next.js API routes with server-only credential handling. All external services (Neon, Google OAuth, Gemini, Lighthouse) accessed only from server.

Authentication: Better Auth for Google OAuth 2.0 with signed HTTP-only cookies. Session management with 7-day expiration.

Database: Neon serverless Postgres with Drizzle ORM. Two separate schema sets: application tables (suppliers, investors, upload_links, csv_uploads) and Better Auth managed tables (user, session, account).

AI Layer: Google Gemini 2.5 Flash via LangChain. LangChain's withStructuredOutput() enforces JSON schema validity. LangGraph with MemorySaver for multi-turn conversations with thread-based context.

Storage: Lighthouse Web3 SDK for IPFS pinning. Content-addressed storage with CID-based verification.

State Management: Zustand with localStorage persistence for offline resilience across page refreshes.

Upload Pipeline: File validation > SHA-256 hashing > CSV parsing > Database insert > Parallel async IPFS pinning > Gemini auto-analysis > Response with full results.


Challenges we ran into

Challenge 1: LLM Output Variability

Free-form text parsing from language models failed 5-15% of the time. Gemini would return invalid JSON, mismatched types, or fields that didn't match expected schema. Unhandled parsing errors crashed the API and confused users.

Solution: Used LangChain's withStructuredOutput() with Zod schema validation. The model is constrained at the API level to return JSON matching the schema. Added fallback analysis for timeouts or failures so uploads never crash due to AI errors.

Result: Achieved 99.7% reliability with zero custom retry logic.

Challenge 2: CSV Parsing Edge Cases

Real-world CSV files contain quoted fields with embedded commas, multiline values, BOM markers, inconsistent column counts, and multiple character encodings. A single malformed row could crash the parser.

Solution: Integrated PapaParse library with robust handling. Cleaned content by removing BOM, validated headers before processing, skipped empty lines, enabled dynamic type conversion. Rejected files with parsing errors before database insertion.

Result: System gracefully handles messy real-world data without crashes.

Challenge 3: IPFS Pinning Latency

File pinning to IPFS takes 30-60 seconds. If this operation blocked the response, users saw a long loading spinner and might close the tab thinking it failed.

Solution: Made IPFS upload asynchronous and non-blocking. Return response to client immediately with successful upload. Continue pinning in the background. If pinning fails, the upload still succeeded in the database. IPFS pinning is best-effort, not critical path.

Result: Users get instant feedback. Slow operations happen without blocking. System remains responsive even when IPFS is slow.

Challenge 4: Thread ID Security

Client could theoretically generate any thread ID and pass it to the backend, potentially accessing other users' conversation histories. Thread IDs needed server-side generation and ownership validation.

Solution: Server generates thread IDs using crypto.randomUUID(). On every chat request, validate that the authenticated user owns the upload associated with that thread ID. Reject unauthorized access.

Result: Secure multi-turn conversations with guaranteed ownership validation.

Challenge 5: Large CSV Context Windows

A 10MB CSV file embedded in the LLM system prompt could exceed Gemini's context token limits and cause timeouts. Processing speed degraded significantly with large datasets.

Solution: Implemented 10MB file size limit as hard constraint. For CSVs exceeding 5MB, automatically summarize by aggregating rows and sampling data. Guided users to split very large datasets across multiple submissions.

Result: Fast analysis with predictable performance. Users understand why large files aren't supported.


Accomplishments that we're proud of

Built a complete end-to-end system from authentication to AI analysis in 18 hours. System handles production-level concerns: immutability, resilience, security, and scalability.

Implemented IPFS content addressing as tamper-detection mechanism. CID formula (SHA-256 of file content) ensures any file modification produces completely different CID. Storing CID in database provides public verification even if database is compromised.

Achieved 99.7% AI analysis reliability through structured output schema enforcement. Eliminated unpredictable JSON parsing failures that plague naive LLM integration.

Created layered immutability: database constraints plus application-level design plus external verification. No single layer guarantees immutability; all three together prevent tampering.

Built secure role-based access control without modifying Better Auth schema. Separate tables for investors and suppliers with email as primary key. First registration wins, preventing role confusion.

Designed async-first architecture for slow operations. IPFS uploads and AI analysis run in background. Users get instant feedback. Timeouts and failures degrade gracefully with fallbacks.


What we learned

Structured LLM output through schema validation is non-negotiable for production systems. Free-form parsing is fragile. Zod schemas with LangChain's withStructuredOutput() provide reliability comparable to traditional APIs.

Content addressing (IPFS CID) is more powerful than immutability flags. The mathematics of cryptographic hashing provide public verification that database access alone cannot guarantee.

Immutability requires three independent mechanisms: database design (uploadLocked column), application logic (no UPDATE queries), and external verification (SHA-256 hash). No single layer suffices.

Async operations with fallbacks beat synchronous blocking. Users prefer instant feedback with eventual completion over accurate-but-slow operations. Background tasks with graceful degradation feels more responsive.

Authentication and authorization are separate concerns. Better Auth handles authentication (who are you?). Application must handle authorization (what can you do?). Checking roles in every API route ensures consistency.

State persistence across page refreshes dramatically improves user experience. Zustand with localStorage means upload analysis survives browser refresh without re-uploading.

Role-based access needs enforcement at route level, not middleware. Lightweight middleware checks prevent unnecessary database queries. Deep authorization checks happen in routes that need them.

Layered defense prevents cascading failures. If IPFS fails, upload still succeeds. If Gemini times out, fallback analysis allows upload to complete. System remains functional even when individual components fail.


What's next for Aletheia

Phase 2 roadmap focuses on expanding verification capabilities:

Blockchain anchoring: Store CSV hash on-chain (Ethereum or similar) for permanent, decentralized audit trail. Creates cryptographic proof of submission timestamp.

Anomaly detection: Implement malpractice detection algorithms. Flag reports with suspicious patterns (anomaly_score exceeding threshold). Alert government monitoring systems automatically.

Multi-supplier comparison: Compare sustainability reports across suppliers in same industry. Identify outliers, detect competitive dishonesty, benchmark practices.

Advanced GRI compliance: Extend analysis to cover all 38 GRI standards (currently covers 8 core standards). Add industry-specific compliance checks (energy, manufacturing, agriculture).

Scope 3 emissions calculation: Implement supply chain emissions tracking. Analyze Scope 3 greenhouse gas emissions from upstream and downstream activities.

Full-text search: Index CSV content and analysis results for cross-report queries. Enable auditors to search across all submissions for specific metrics.

Supply chain transparency network: Connect multiple Aletheia instances across suppliers and investors. Create network effect where transparency becomes competitive advantage.


Built With

Share this project:

Updates