Inspiration
Our team chose this project out of curiosity about tech in healthcare, and we are ready to tackle any difficult problems Anton Rx presents. A real-world problem with a solution waiting for us to solve never disappoints us. That is what brings us together and works on this project. Through our project, we envision an AI-powered, centralized information database that can expand automatically and help health plan consultants look up health plans and drug policies efficiently.
What it does
Sift is a full end-to-end medical benefit drug policy intelligence pipeline. Given a drug name, Sift automatically searches major payer sources, including Cigna, UnitedHealthcare, and EmblemHealth, downloads the relevant policy PDFs, and runs them through an AI extraction pipeline. Users can also upload a policy PDF directly if they already have it. Either way, Sift classifies the document, selects the most relevant pages, and extracts structured data covering covered indications, prior authorization criteria, step therapy requirements, site-of-care restrictions, dosing limits, and policy metadata. That structured output is normalized into a consistent JSON schema and stored in a queryable database, giving users a unified, searchable view of medical benefit drug coverage across payers, all from a single interface.
How we built it
We built Sift as a modular Python pipeline with a web-based frontend. For document retrieval, we use DuckDuckGo search with payer-specific query templates, and Playwright-based browser interception for portals like EmblemHealth that serve policies through authenticated JavaScript APIs. PDF text is extracted page by page using pdfplumber, with keyword-aware page selection that adapts based on whether the document covers a single drug or an entire formulary. AI extraction is handled by Gemini 2.5 Flash, guided by a strict 20-rule schema prompt that enforces consistent field naming, type normalization, and null handling across wildly different source formats. Extracted records are written into a normalized SQLite database with separate tables for policies, indications, step therapy steps, and dosing limits, with upsert logic to handle re-ingestion cleanly. A unified pipeline orchestrator ties the full flow together from retrieval through storage.
Challenges we ran into
- AI token usage was a huge problem for us because it is limited by our membership. The token limit per minute prevents us from parsing large PDF files at once. We worked around this by building a page selection layer that intelligently narrows down which pages to send to the model, reducing token usage while preserving extraction quality.
- Semantic analysis was also a challenge when we struggled to read and recognize terminologies across multiple payers. Different health plans use different names for the same concepts, such as prior authorization, coverage determination, clinical policy bulletin, and structure their criteria in completely different ways. We addressed this through careful prompt engineering and a rigid output schema that forces the model to normalize on extraction rather than leaving it to post-processing.
- Unreliable search results from web queries added friction to the retrieval step, particularly for payers that don't expose clean PDF links. We handled this with retry logic, PDF header validation, and payer-specific retrieval modules for portals that require browser-level interaction.
Accomplishments that we're proud of
We are proud of ourselves for solving the normalization algorithm and for efficient PDF parsing. We are proud of building a normalization pipeline that actually works across structurally different documents from different payers, which was the hardest part of the problem and the most valuable to get right. We are also proud of the two-input design: supporting both drug name search and direct PDF upload in the same pipeline makes Sift flexible enough for real-world workflows where not every document comes from a clean web source. We aim for a clean, easy-to-use, high-performance application and future scalability with the database. We make sure that what AI finds on the internet is verified by our algorithm to ensure correctness throughout the whole process.
What we learned
- We learned to break the problem down into its smallest parts. Start with small steps, cover small areas, then scale up one at a time to maintain correctness.
- We learned how to use an AI model effectively using efficient prompts to overcome complex algorithms. Getting consistent, schema-valid JSON from a language model across dozens of document formats required precise instructions, explicit rules for edge cases, and careful handling of nulls, booleans, and nested structures.
- We learned that the hardest problems in healthcare data are not technical; they are definitional. Deciding what "prior authorization required" means when a document implies it rather than states it forces judgment calls that an AI alone cannot make without clear guidance baked into the schema.
What's next for Sift
The immediate next steps are expanding payer coverage to include Blue Cross Blue Shield plans, Aetna, and additional regional payers. We also want to add change detection so users are automatically alerted when a policy they track is updated. On the query side, we plan to add side-by-side payer comparison views and natural language querying over the database, so a user can ask "which plans cover Drug X with no step therapy requirement?" and get a direct answer. Longer term, we see Sift becoming the foundation for a broader medical benefit intelligence platform, one that tracks policy changes over time, surfaces coverage gaps, and helps health plan consultants make faster, better-informed decisions.
Log in or sign up for Devpost to join the conversation.