Veritas AI — Tracking Misinformation with Autonomous Agents

Veritas AI turns prediction markets into a clear, data-driven signal of public belief. Instead of hunting for sentiment in noisy social chatter, Veritas reads where people put real money on real outcomes, combines that with robust tooling and a retrieval-augmented chat, and gives journalists, researchers, and citizens a trusted way to explore what the crowd actually expects.

Why this project matters

We live inside an information storm. Social platforms amplify rumors and coordinated campaigns, making it difficult to tell which narratives are simply loud and which are actually believed. Prediction markets are different: they attach financial stakes to beliefs, creating a measurable, incentive-aligned signal about what large groups expect to happen. Veritas AI uses that signal to make misinformation — and shifts in public belief — visible and actionable.

Who benefits:

Journalists who want an evidence-backed view of which narratives are gaining traction.
Researchers measuring the real-world influence of misinformation campaigns.
Platform engineers and policymakers who need reliable early-warning signals for narratives that could cause harm.

What Veritas AI does (short)

Autonomous data pipeline: A crew of specialized AI agents scrapes Polymarket, Manifold, PredictIt (and similar markets), normalizes heterogenous outputs, and deduplicates records into a clean dataset.
Interactive RAG chat: The cleaned data is embedded and indexed into FAISS. Users ask natural-language questions in a Streamlit chat and receive answers grounded only in the scraped market data — preventing hallucinations and keeping answers traceable.

Example queries the chat can answer:

“Summarize markets about the upcoming election.”
“Which markets about public health have shifted most in the last week?”
“Are there signs that a rumor is gaining financial backing?”

How it works — architecture at a glance

Part A — CrewAI Autonomous Pipeline (ETL)

Scraper Agent: Scrapes each source and returns raw JSON records. Each scraper is a focused tool to reduce brittle LLM decisions.
Aggregator Agent: Merges and normalizes heterogeneous outputs into a canonical schema (1_combined_raw_data.json).
Duplicate Analyzer Agent: Detects near-duplicates and flags them (2_data_with_duplicates.json) using deterministic rules and heuristics.
CSV Converter Agent: Exports final cleaned records as 3_final_output.csv for downstream analysis.

Data flow:

[Scraper Agent] → [Aggregator Agent] → [Duplicate Analyzer] → [CSV Converter]

Each agent is limited to the small, deterministic tools it requires — this prevents “overthinking” and yields predictable, auditable steps.

Part B — RAG Application (Search + Generate)

Embeddings: SentenceTransformers (all-MiniLM-L6-v2) produces compact, high-quality vectors for each cleaned record.
Index: FAISS stores vectors for sub-800ms retrieval latency.
Generator: Google Gemini synthesizes responses using only retrieved documents as context (no free-form web browsing during answer generation).
UI: Streamlit chat where users ask questions and receive traceable answers.

Retrieve → Generate ensures the model is grounded in data and reduces hallucination risk.

Key features

Multi-source scraping (Polymarket, Manifold, PredictIt)
Schema normalization and error-tolerant parsing
Deterministic duplicate detection with is_duplicate flags
FAISS-backed retrieval for sub-second searches
Streamlit conversational UI for easy exploration
Toolified CrewAI agents for reproducible, debuggable automation

Performance snapshot (use these exact metrics)

Processed records: 101
ETL runtime: ~120 seconds
Duplicates flagged: 6 (≈ 6% of dataset)
Retrieval latency: < 800 ms
Generation latency: ~3.3 s (avg)
Initial success rate (complex queries): ~25%
Graceful no-data responses: 100%

These numbers reflect the prototype’s early-stage telemetry and highlight pipeline efficiency and areas for improvement (e.g., success rate on complex queries).

Challenges & solutions

Agent overthinking: Agents returned summaries instead of raw outputs. Fix: restrict each agent to a single deterministic tool and tighten prompts.
Data integrity & broken endpoints: One scraper failed due to a deprecated API. Fix: inspected network traffic, reimplemented the scraper with the new GraphQL endpoint.
RAG retrieval mismatches: Chat sometimes answered "I don't know" despite available data. Fix: improved data-loading and caching logic and aligned the ingest pipeline with the RAG app’s expected schema.

What I’m proud of

A fully autonomous multi-agent system that runs end-to-end with minimal human intervention.
A compact, high-performance RAG stack built from FAISS + SentenceTransformers that demonstrates a clear grounding strategy for LLM outputs.
A polished Streamlit UI that makes backend complexity accessible to non-technical users.

Technical decisions & lessons learned

Use code for deterministic logic, AI for reasoning. Deterministic ETL steps belong in code; agents should orchestrate and choose tools.
Specialist agents outperform generalists. Narrow job descriptions reduce hallucination and make failures easier to debug.
Ground all LLM outputs. Retrieval-first design (embed → index → retrieve → generate) is essential when answers must remain factual and traceable.

Roadmap — what’s next

More data sources: forums, subreddits, news sites, and markets tied to media outcomes.
Time-series & alerting: capture market snapshots over time and add automated alerts for sudden belief shifts.
Advanced duplicate detection: semantic similarity via sentence embeddings to catch reworded but equivalent markets.
Public deployment: host the Streamlit app in the cloud so journalists and the public can use Veritas AI without local setup.

How to run (quick)

Clone repo and create a Python 3.10+ virtualenv.
Add GOOGLE_API_KEY to .env.
Install requirements.txt (crewai, faiss-cpu, sentence-transformers, langchain-google-genai, pandas, etc.).
Run full pipeline: bash python ALL_AGENTS.py
Build RAG index / run chat: bash python RAG_Handle.py

Outputs: outputs/1_combined_raw_data.json, outputs/2_data_with_duplicates.json, outputs/3_final_output.csv.

Built With

crewai
faiss
google-gemini-api
langchain
numpy
ollama
pandas
python
rag
sentencetransformers
streamlit

Updates

Umang Singh started this project — Sep 14, 2025 02:13 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.