Inspiration

I spend a lot of time around researchers, and one pattern I kept noticing is that the hardest part of science is not the idea. Ideas are cheap. The hard part is everything that comes after: figuring out which papers are relevant, writing a protocol that actually works, sourcing the right reagents with the right catalog numbers from the right suppliers, estimating a budget that a PI would approve, and building a timeline that accounts for dependencies between phases.

A junior grad student might spend two weeks just getting a protocol right for an experiment that a senior researcher could sketch out in an afternoon. That knowledge gap costs time, money, and sometimes entire projects. I wanted to build something that bridges that gap. Not a chatbot that gives vague answers, but a system that produces the same kind of detailed, literature-backed experiment plan that an experienced researcher would write.

The second thing that inspired me was the idea of institutional memory. In most labs, when a researcher leaves, their knowledge walks out the door with them. There is no system that captures "we tried 0.5M trehalose and it caused osmotic damage, so we switched to 0.2M." I wanted corrections like that to persist and automatically improve future plans. That became the learning loop.

What It Does

Neucleus takes a hypothesis written in plain English and produces a complete, runnable experiment plan. Here is everything it generates:

Literature-Grounded Protocol: A step-by-step protocol where each step is generated from real papers retrieved from scientific databases. Each step is then independently scored as HIGH, MEDIUM, or LOW based on how well published literature supports it. LOW-scored steps are flagged so the scientist knows exactly where to be cautious.

Verified Materials and Supply Chain: Every material is listed with a specific supplier, catalog number, quantity, and estimated cost. Each catalog number is then verified against actual supplier websites in real time. Materials are tagged as VERIFIED, PARTIALLY_VERIFIED, CORRECTED, or UNVERIFIED so the scientist knows exactly what needs manual checking.

Novelty Assessment: Before generating anything, the system analyzes whether the hypothesis is truly novel, partially novel (similar work exists), or well-established. It cites the relevant papers so the researcher can make an informed decision about whether to proceed.

Realistic Budget: An itemized breakdown that separates reagents, consumables, equipment, personnel, overhead, and contingency. Not just a single dollar amount, but a structured budget a PI can actually use in a grant proposal.

Phased Timeline: A week-by-week schedule broken into phases, each with specific tasks, duration, and milestones.

Validation Criteria: Primary endpoints, success thresholds, failure indicators, and recommended statistical tests with sample size calculations.

The Learning Loop: After a plan is generated, a scientist can review every section, leave ratings and comments, and submit structured corrections (for example: "change trehalose concentration in step 2 from 0.5M to 0.2M because 0.5M causes osmotic stress"). These corrections are stored by experiment domain. The next time anyone generates a plan for a similar hypothesis, those corrections are automatically retrieved and incorporated into the generation. The system improves with every review, without any retraining or fine-tuning.

The 10-Stage AI Pipeline

The entire process runs through a 10-stage pipeline, where each stage is a specialized agent:

  1. Parse Hypothesis: Extracts domain, organisms, techniques, and variables from the natural language input
  2. Retrieve Prior Feedback: Queries the feedback database for corrections from similar past experiments in the same domain
  3. Search Literature: Searches Tavily, OpenAlex, and CrossRef in parallel for relevant papers, protocols, and technical resources
  4. Analyze Novelty: Compares the hypothesis against retrieved literature to classify it as novel, partially novel, or well-established
  5. Generate Protocol: Produces a detailed step-by-step protocol using retrieved literature as context, including durations, critical notes, and safety warnings
  6. Verify Protocol Grounding: Cross-references each protocol step against the source material and assigns HIGH, MEDIUM, or LOW grounding scores
  7. Generate Materials List: Creates a complete materials list with suppliers, catalog numbers, quantities, and costs
  8. Verify Catalog Numbers: Searches supplier websites to validate every catalog number and flags discrepancies
  9. Create Timeline: Generates a phased timeline with week ranges, task lists, and milestones
  10. Finalize Plan: Computes overall grounding scores, assembles metadata, and packages everything into the final output

The pipeline streams progress to the frontend in real time via Server-Sent Events, so the user sees each stage complete as it happens.

How I Built It

Backend

The backend is written entirely in Python. The pipeline is orchestrated using LangGraph, which provides stateful multi-agent graph execution with conditional edges. Each of the 10 stages is a node in the graph. The API layer is built with FastAPI, which handles REST endpoints and SSE streaming. All LLM outputs are validated through Pydantic v2 schemas to ensure consistent data structure regardless of model variance.

I built custom utilities for two recurring problems:

  • A JSON normalization module that handles the many ways LLMs return broken JSON (markdown-wrapped, partial, missing keys, trailing commas)
  • A retry wrapper with exponential backoff specifically for 429 rate-limit errors and empty responses from the inference API

The feedback store uses SQLite via aiosqlite for async database operations. Feedback is tagged by experiment domain so corrections are only applied to relevant future plans.

Frontend

The frontend is a dashboard-style application built with Next.js 16 (React 19) using the App Router. The UI has a landing page at the root route and the full application at /app. The dashboard features a collapsible sidebar, breadcrumb navigation, and panel-based layout where each section of the plan (Protocol, Materials, Budget, Timeline, Validation) has its own dedicated view.

Animations and transitions are handled by Framer Motion. The pipeline progress is shown as an animated 10-step vertical stepper with a live elapsed timer. The Scientist Review panel provides a tabbed form interface for rating, commenting, and correcting each section.

Styling

The UI uses Tailwind CSS v4 with a custom color palette called Graphite Coral (dark charcoal sidebar, coral/orange accents, white content area). Icons come from Lucide React.

Deployment

The backend is containerized with Docker and deployed on DigitalOcean App Platform. The frontend is deployed on Vercel.

Tech Stack

AI and Orchestration:

  • LangGraph and LangChain for multi-agent pipeline orchestration
  • Featherless.ai for LLM inference (OpenAI-compatible API with open-weight models)
  • Pydantic v2 for structured output validation

Backend:

  • Python 3.11
  • FastAPI with Uvicorn
  • SSE (Server-Sent Events) via sse-starlette for real-time progress streaming
  • SQLite with aiosqlite for async feedback storage
  • httpx for async HTTP requests
  • python-dotenv for environment management

Literature and Data Sources:

  • Tavily Search API for web search
  • OpenAlex API for academic paper metadata
  • CrossRef API for DOI resolution and citation data

Frontend:

  • Next.js 16 with React 19
  • TypeScript
  • Tailwind CSS v4
  • Framer Motion for animations
  • Lucide React for icons

Deployment:

  • Docker for backend containerization
  • DigitalOcean App Platform for backend hosting
  • Vercel for frontend hosting

Challenges I Ran Into

LLM output consistency was the biggest obstacle. Open-weight models accessed through Featherless.ai do not always return clean JSON. Sometimes the output is wrapped in markdown code blocks. Sometimes keys are missing. Sometimes the model returns a completely empty response. I tried solving this with prompt engineering alone and quickly realized it was not enough. I ended up building a multi-strategy JSON normalization utility that strips markdown fences, attempts to extract JSON from mixed text, handles partial objects, and falls back gracefully. On top of that, I added retry logic that specifically handles empty responses and rate limits. In a typical pipeline run, 2-3 stages retry at least once before succeeding.

Verification accuracy required careful categorization. When I verify catalog numbers against supplier websites via web search, the results are not always clean matches. Sometimes the search returns a product page that matches the description but has a different catalog number. Sometimes it returns a completely unrelated product. I had to build a four-tier classification system: VERIFIED (exact catalog match confirmed), CORRECTED (correct catalog found via search, different from what the LLM generated), PARTIALLY_VERIFIED (product found but catalog number unconfirmed), and UNVERIFIED (nothing found). Getting the verification agent to correctly distinguish between these four cases took significant prompt iteration.

Feedback contamination in the learning loop. The learning loop only works if corrections are applied to the right context. A fix for cryopreservation of HeLa cells should not affect a protocol for electrochemical CO2 fixation. I solved this by tagging all feedback by experiment domain and only retrieving feedback entries that match the current hypothesis. The corrections are injected as few-shot examples in the generation prompts rather than hard overrides, giving the model flexibility to adapt them.

Pipeline duration. The full pipeline takes 10-15 minutes because of sequential LLM calls with large context windows. I could not parallelize the stages because each depends on the output of the previous one (you cannot verify materials before generating them). Instead, I focused on making the wait feel productive by implementing real-time SSE streaming with an animated stepper and elapsed timer. Users consistently told me the wait felt acceptable because they could see exactly what was happening.

CORS and deployment configuration. Small but annoying. Getting the frontend on Vercel to communicate with the backend on DigitalOcean required careful CORS configuration, environment variable management across two platforms, and ensuring the Dockerfile correctly exposed the right port.

Accomplishments That I Am Proud Of

The learning loop works end-to-end, and I can prove it. I tested it by generating a cryopreservation protocol, submitting 6 specific corrections (trehalose concentration 0.5M to 0.2M, FBS catalog number fix, equipment budget from $32,900 to $2,500, timeline phase 1 from 2 weeks to 1 week, adding Calcein-AM/PI viability assay, sample size adjustment). Then I regenerated for the same hypothesis. Five out of six corrections were fully applied automatically, and the sixth was partially applied. The grounding score jumped from 47% to 61%, the budget dropped from $90,000 to $19,000, and the system added an entirely new protocol step for Calcein-AM/PI dual staining that was not in the original plan. All of this happened without being explicitly prompted. The [FEEDBACK_APPLIED] tags in the output confirm exactly which corrections were used.

Real supplier verification. Catalog numbers are not just hallucinated. They are checked against actual supplier websites. In test runs, 70-85% of materials are verified or corrected, and unverified items are clearly flagged rather than presented with false confidence.

Hallucination mitigation is architectural, not cosmetic. RAG grounding, chain-of-thought prompting, mandatory citations, independent verification passes, and uncertainty tagging all work together. The system does not hide what it does not know. LOW grounding scores and UNVERIFIED statuses are features, not bugs.

A production-quality interface. This is not a Jupyter notebook or a bare API. It is a complete dashboard application with a landing page, animated pipeline visualization, collapsible sidebar, hover tooltips, sortable tables, and a structured review interface. It looks and feels like a real product.

What I Learned

Structured output from LLMs is a hard problem that does not get enough attention. Everyone talks about prompt engineering for better answers, but the real engineering challenge is getting the model to return valid, parseable JSON consistently. I learned to treat every LLM response as "probably valid" rather than "definitely valid" and to build defensive parsing at every layer.

Few-shot correction injection is more effective than I expected. The learning loop uses prior corrections as few-shot examples in the prompt. I initially worried the model would ignore them or apply them too literally. In practice, the model adapts corrections intelligently to the current context. When I corrected trehalose to 0.2M in one plan, the next plan did not just copy-paste the fix but also adjusted the calculations and added an explanation for why 0.2M was chosen.

Real-time feedback transforms user experience for long-running AI tasks. A 12-minute wait with a spinner feels broken. A 12-minute wait with a live 10-stage stepper showing which stage is active, which are complete, and how much time has elapsed feels like watching a build pipeline run. The investment in SSE streaming was worth every hour.

Verification is where trust comes from. Anyone can prompt an LLM to generate a protocol. But when I show a scientist that each step has a grounding score and each catalog number has been checked against the supplier website, that is when they start taking the output seriously. Generation without verification is a toy. Generation with verification is a tool.

What's Next for Neucleus

  • Multi-model pipeline: Using specialized models for different stages (a reasoning model for novelty analysis, a code model for statistical calculations, a general model for protocol generation) instead of routing everything through one model
  • PDF export: One-click export of the complete experiment plan as a formatted PDF ready for lab notebooks or grant applications
  • Collaborative review: Multiple scientists reviewing the same plan, with conflict resolution when corrections disagree
  • Fine-grained feedback matching: Currently feedback is matched by domain (e.g., "cell biology"). I want to match by specific techniques, organisms, and reagents for more precise correction retrieval
  • Lab inventory integration: Checking the generated materials list against what is already available in the lab before producing a procurement list
  • Confidence calibration: Training a lightweight classifier on accumulated feedback data to predict which protocol steps are most likely to be corrected, and proactively flagging them

Built With

Share this project:

Updates