Needle

Inspiration

Every developer has asked an AI for help with a library and gotten a confident, detailed, completely wrong answer. The model's training data is stale, the APIs have changed, and it fills the gaps with plausible hallucinations. Research from METR measured a 19% slowdown from AI-assisted coding - largely from debugging fabricated APIs and verifying suspect answers. For a 6-person team, that's nearly 23 hours lost per week.

We wanted something that could answer questions about any library - undocumented, bleeding-edge, internal, or just moving too fast for docs to keep up - by reading the actual source code instead of guessing from model memory.

What it does

Needle MCP is an MCP server that answers questions about libraries by grounding every answer in their actual source code. Give it an npm package, git repo, or local directory and a question. It resolves the resource, discovers the real API surface, searches the source with ripgrep, and synthesizes an answer using Amazon Nova Premier on AWS Bedrock - returning structured JSON with line-level citations.

Key capabilities:

Multi-model pipeline: Nova Lite handles query expansion, evidence reranking, and gap analysis; Nova Premier handles final synthesis
Iterative two-pass evidence collection: search, rerank, deep-read, follow imports, gap analysis, then a second targeted pass
Citation Integrity Guard: every citation is validated against real files and line ranges - fabricated references fail the response
Snippet Verification Engine: executes generated code in a sandboxed temp project with auto-imports, syntax repair, and multi-pass symbol resolution
Quality gates: prevents synthesis when evidence is too thin, returning low-confidence fallbacks with suggested follow-up queries instead
Real-time dashboard: SSE-powered web UI for watching the full pipeline execute live, with bundled demo presets including a deliberately difficult hard-mode target

How we built it

The core is a TypeScript MCP stdio server with an iterative evidence pipeline:

Resource resolution - npm packages via pacote, shallow git clones with #ref, or local directories
API surface discovery - reads package.json entrypoints (main, module, types, exports), scans barrel exports and .d.ts files to build a shortlist of real exported symbols
Query generation - four sources: discovered symbols, code-like tokens from the question, structural intent patterns, and Nova Lite semantic expansion. Generic filler words are filtered before hitting ripgrep
Evidence scoring - hits ranked by file path (.d.ts +5, src/ +4, node_modules/ -10), then Nova Lite reranks for question relevance
Deep reading - top files are read in full, relative imports are followed, oversized files get chunk extraction around evidence hits
Gap analysis - Nova Lite identifies what's still missing, a second targeted ripgrep pass fills the gaps
Synthesis - Nova Premier generates the answer with citations, validated through Zod schemas with automatic retry and repair
Citation validation - paths must resolve inside the target resource, line ranges must be in-bounds. Invalid citations are dropped, confidence is downgraded, fully invalid answers fail safely
Optional verification - generated snippets run in a sandboxed temp project with three modes: direct execution, assisted import recovery, and nested module resolution. Non-JS snippets and CLI examples are skipped with explicit notes

The dashboard is a bundled SSE-powered web UI with demo presets backed by deliberately undocumented local libraries (knot-machine with SCC/Tarjan control flow, quiet-router, pulse-cache, framepack) for realistic stress-testing. It also includes a judge/control preset for @anthropic-ai/sdk@0.78.0 so reviewers can compare against a mainstream package.

The benchmark suite compares Needle against an ungrounded baseline (GPT-5.3 Codex or Nova Premier direct) using 8 deterministic cases with hand-written ground-truth facts per test case.

Challenges we ran into

Hallucinated line numbers: Even with strong evidence, Nova would sometimes fabricate citations. We built the Citation Integrity Guard - paths must resolve inside the target, line ranges are bounds-checked, and fully invalid answers fail with confidence 0 instead of being served with false authority.
Query generation noise: Early versions used naive keyword extraction and hit tons of irrelevant code. The four-source approach (symbols + tokens + intent patterns + Nova Lite expansion) with generic word filtering dramatically improved evidence recall.
Large packages: npm packages like zod have 50K+ line type definition files. We implemented file-size limits, deep-read budgets, chunk extraction around evidence hits, and import-following to stay within context windows without losing critical context.
Verification edge cases: Generated snippets often omit imports, use TypeScript annotations in JS contexts, or have malformed object literals. The verifier needed three execution modes (direct, assisted, resolved), TypeScript annotation stripping, syntax repair for missing braces, and explicit skip reasons for non-JS content.
Evidence depth vs. breadth: A single-pass search misses too much. The iterative approach - search, rerank, deep-read, follow imports, gap analysis, second pass - catches significantly more relevant evidence, which directly improves synthesis quality.

Accomplishments that we're proud of

85% reduction in expected time to first correct answer vs baseline (27.8s vs 185.1s)
53.1% average fact coverage vs 21.9% for baseline - 2.4x more grounded
100% citation validity - every citation points to real code at real line numbers, vs 0% for baseline
3x correctness rate (37.5% vs 12.5% at 75% fact-coverage threshold)
6.6x speed multiplier to first correct answer
31.5 minutes saved per engineer per week, 189 minutes for a 6-person team
Fully reproducible benchmark suite - ./benchmark runs everything from scratch, generates reports and SVG charts
Works with any MCP client - Claude Desktop, Codex CLI, or anything supporting the protocol
Bundled hard-mode demo resources with minified multi-file internals for realistic stress-testing

What we learned

Grounding answers in source code is dramatically more effective than relying on training data for fast-moving or undocumented libraries
Citation validation is non-negotiable - without it, high confidence scores are meaningless and dangerous
Iterative evidence collection with gap analysis catches what single-pass search misses every time
File-path scoring is a simple but powerful heuristic - src/ files are almost always more relevant than README.md, and penalizing node_modules/ eliminates massive amounts of noise
The quality gate (minimum code hits before synthesis) prevents more hallucination than any amount of prompt engineering
Multi-model pipelines work well - Nova Lite is fast and cheap for evidence tasks, Nova Premier brings the reasoning power for synthesis

What's next for Needle

Streaming synthesis - show answers as they generate instead of waiting for the full response
Cross-package queries - answer questions spanning multiple dependencies in a project
Incremental indexing - cache API surface discovery and evidence for repeated queries against the same resource
TypeScript type-checking verification - validate generated code at the type level, not just runtime execution
Community benchmark contributions - open the benchmark harness for community-submitted test cases and ground-truth facts
More resource types - PyPI packages, crates.io, Go modules

Built With

aws-bedrock
jsonrepair
mcp-sdk
node.js
pacote
pnpm
ripgrep
typescript
vitest
zod

Updates

Lewis Allegedly started this project — Mar 15, 2026 12:47 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.