Pudding

HomePage

Inspiration

"Need: Al that scans PR #1024 and comments: 'This is a duplicate of PR #20' Kinda hard sometimes to find who fixed a bug or implemented a feature first and give attribution. " ~ Shadcn.

Open source maintainer a fighting a losing battle against people who make slight modifications, implement an existing PR or worse straight up duplicates a PR. This cannot continue and 'Pudding' is solving this problem

What it does

Pudding is a Gemini powered assistant that identifies duplicate PRs by understanding their intent, not just comparing code diffs. We initially run the program with Bag-of-Words embeddings and Jaccard file overlap filter thousands of PRs down to a few candidates in milliseconds, then on selected few Gemini performs semantic comparison by extracts structured intent (problem, component, behavior, code approach) where its verified whether the PRs are duplicated or not and then output is commented!

How we built it

We built a funnel-based architecture:

Stage 1-2 (Fast Local Filters): Bag-of-Words embeddings and Jaccard file overlap filter thousands of PRs down to a few candidates in milliseconds

Stage 3-5 (LLM Reasoning): Gemini 3 extracts structured intent (problem, component, behavior, code approach) and performs semantic comparison on remaining candidates

Challenges we ran into

Mock embeddings producing garbage: Initial Math.random() vector generation made all embeddings orthogonal. Replaced with deterministic Bag-of-Words (Hashing Trick) to ensure similar text yields similar vectors.

Gemini JSON response inconsistency: Gemini sometimes returned JSON wrapped in markdown code blocks instead of raw JSON. Had to implement regex stripping and fallback parsing with JSON.parse() wrapped in try-catch.

TypeScript generic inference: generateJSON() function lacked a generic type parameter , causing TypeScript errors at call sites like generateJSON(prompt). Fixed by adding to the function signature.

GitHub API diff gaps: The Files API returns patch (unified diff) only for text files under a size limit. Large binary files or renamed files have no patch data, causing undefined in our diff concatenation.

Confidence floor logic: Initial weighted scoring produced confusingly low scores (60%) for clear duplicates when one factor was low. Added a "confidence floor" (85% minimum) when Gemini's semantic similarity exceeds 90%.

Array response normalization: Gemini occasionally returned [{pr1: ...}, {pr2: ...}] instead of {pr1: ..., pr2: ...}. Frontend had to detect arrays and reduce() them into a single object.

Accomplishments that we're proud of

97%+ accuracy on test duplicate pairs with the weighted scoring system

Sub-second filtering for the first 2 stages using local embeddings

Intent-aware analysis: The system understands that "fix auth bug for special chars" and "handle symbols in passwords" are duplicates even with different code

What we learned

Response format enforcement: Adding responseMimeType: 'application/json' to Gemini config drastically reduces markdown-wrapped responses, but doesn't eliminate them entirely.

Temperature = 0.1 for consistency: Low temperature makes Gemini's structured outputs reproducible. Higher values caused random field ordering and inconsistent scoring.

Few-shot isn't always needed: Explicit JSON schema in the prompt with field descriptions worked better than few-shot examples for our use case. Retry on 503, not 429: 503 (overloaded) is transient and worth retrying. 429 (rate limit) means backoff and wait. Different error handling for each.

Confidence floors prevent confusion: Raw weighted averages can produce misleading scores. Floors and boosts based on key signals improve user trust.