Truth Guard

Inspiration

We were motivated by the urgent need to counter the spread of misinformation. Our team saw an opportunity to use generative AI to scrutinize and verify facts. We envision a world where documents can be rapidly ingested, dissected, and assessed for factual accuracy, helping users distinguish trusted content from dubious sources.

What it does

Truth Guard processes PDFs—splitting large files into smaller page chunks for easier ingestion—and stores each chunk in Snowflake. Using PARSE_DOCUMENT in LAYOUT mode, it captures text structure. Then, statements are extracted from unverified documents and compared against a trusted, already-verified corpus. If a sufficient portion of the statements are verified, the document is deemed trustworthy and automatically added to the verified corpus. Otherwise, it’s rejected. A Q&A interface lets users ask questions and get fact-checked answers based on the verified corpus.

How we built it

Page-Level Chunking: We use PyPDF2 to split documents into 200-page intervals (or smaller if needed), then upload each chunk to a Snowflake stage.
Structured Parsing: We call SNOWFLAKE.CORTEX.PARSE_DOCUMENT(..., {'mode': 'LAYOUT'}) to capture layout-structured text, making it easier to handle tables and multi-column content.
Text Chunking & Statements: Within Snowflake, we run a Python-based text chunker to split any large text into manageable pieces. We then call a language model (like llama3-70b) to extract discrete statements.
Verification Workflow: For each statement, we compare it to our verified corpus using the Snowflake Cortex Search Service. If the LLM sees enough supporting evidence, it’s considered “verified;” if contradictory or irrelevant, “contradicted” or “unverified.”
Acceptance Threshold: A simple scoring system checks whether enough statements pass verification. If the document meets or exceeds this threshold, we move it to the verified corpus.
Streamlit App: We built two main pages:
- Add & Verify Document: Upload a PDF, see it chunked, and watch as statements are scored and either accepted or rejected.
- Ask a Question (Chat): Interactively query the verified corpus. A chat flow uses Mistral or LLama-based LLM calls to return thorough, reference-backed answers.

Challenges we ran into

Chunking & Page Splits: Determining the right page-range size (200 pages) and cleaning up partial pages required experimentation.
Layout-Mode Parsing: Some PDFs with unusual layouts or tables didn’t parse as cleanly, forcing us to handle edge cases.
Scoring Thresholds: Deciding what fraction of verified statements is sufficient for trusting a document was nontrivial—we experimented with 50-70% cutoffs.
Statement Extraction Accuracy: The LLM sometimes under- or over-segmented statements. We refined prompts to improve reliability.
End-to-End Integration: Debugging across multiple tables, stages, and bridging Streamlit with Snowflake Cortex calls was a complex process.

Accomplishments that we're proud of

Fully Automated Misinformation Check: We integrated a streamlined workflow that can reject entire documents if enough statements fail verification.
Leveraging Layout Data: By using PARSE_DOCUMENT in LAYOUT mode, we can better preserve textual structures and potentially handle complex formats (like tables and columns).
User-Friendly Chat Interface: The Q&A page is both easy to use and capable of citing relevant chunks, ensuring transparency behind each answer.

What we learned

Modular Chunking Matters: Combining local page splits (PyPDF2) with layout-based chunking inside Snowflake gave us a much more robust ingestion pipeline.
Structured vs. OCR Text: LAYOUT mode drastically helps with more advanced documents, but we still need fallback strategies for messy PDFs.
Careful Prompts for Statement Extraction: Minor changes in the LLM prompt can significantly affect how accurately statements are segmented.
Scoring & Governance: Even advanced LLM-based checks require clear acceptance criteria—overly lenient thresholds introduce risk, while overly strict ones might discard good documents.

What's next for Truth Guard

Fine-Grained Statement Verification: We plan to develop a more refined scoring system that weighs each statement by importance or context, rather than simple tallies.
Support for More Domains: We want to broaden the verified corpus to include scientific, legal, or other specialized texts.
Enhanced Chat Context Memory: Storing conversation history so users can follow up on previously answered queries without losing context.
User Feedback Mechanism: Allow domain experts or power users to override or confirm LLM verdicts, improving the system’s accuracy over time.