Skills Cubed

Results

Across the same 40 conversations, the continual setup outperformed baseline with a higher average score (3.43 vs 3.13, +0.30), and the gap widened over time as skills accumulated, reaching +0.30 by conversation 40. When skills were actually used, performance jumped substantially (4.43 vs 3.71, +0.71), with multiple first-use and reuse cases showing large individual gains—clear evidence of a real, compounding learning effect even at this small scale.

Inspiration

Customer support AI has a cold start problem. Every time an agent encounters a new issue, it reasons from scratch and burns expensive LLM tokens and making the customer wait. The next time someone asks the exact same question, it does it all over again. Humans don't work this way. A support rep who solves a tricky billing dispute on Monday doesn't re-derive the solution on Tuesday, instead they remember what worked.

We asked: what if an AI agent could do the same? Not just retrieve documents from a static knowledge base, but write its own playbooks from successful interactions, and get measurably better with every conversation.

What it does

Skills Cubed is a continual-learning MCP server that any AI agent can connect to. It exposes three tools via the Model Context Protocol:

Search Skills — Hybrid vector + keyword search over learned resolution playbooks. A Gemini Flash judge selects the best match semantically, not by arbitrary score threshold.
Create Skill — After a successful resolution, extracts a structured playbook (with Do/Check/Say action steps) from the conversation transcript.
Update Skill — When an agent deviates from an existing playbook and succeeds, refines the skill with the new approach.

The key insight: skills created from conversation N are immediately searchable for conversation N+1. There's no batch retraining, no reindexing pipeline. Neo4j auto-indexes on write, so knowledge becomes available in seconds.

The system tells a story in three beats:

No skills — Agent reasons from scratch using Gemini. Slow, expensive.
First encounter — Agent resolves a complex issue, creates a skill from it.
Next similar query — Skill found instantly, Flash serves it. Fast, cheap, and the resolution quality improves because the playbook captures proven steps.

How we built it

Architecture

[Any MCP Client] → [Skills Cubed MCP Server] → [Neo4j (Hybrid Search)]
       ↑                    ↓                         ↑
       └──── resolution ←── Gemini Flash/Pro ──── skill CRUD

MCP Server — Built with FastMCP on FastAPI. Streamable HTTP transport, hosted on Render. Three tool handlers map directly to orchestration functions. Render gives us persistent containers (no serverless cold starts), 100-minute request timeouts for SSE compatibility, and unrestricted outbound ports for Neo4j Bolt connections.

Database — Neo4j Aura on GCP. Dual indexes: a 768-dimensional cosine vector index for semantic similarity and a BM25 fulltext index for keyword matching. Hybrid search merges both (0.7 vector + 0.3 keyword), with min-max normalization on BM25 scores. We don't have to do any re-indexing on updates because skills are searchable the moment they're written.

LLM Layer — Google Gemini powers everything:

Gemini Flash — Judge calls (routing queries to skills), resolution generation, evaluation
Gemini Pro — Skill extraction from conversations, skill refinement
Gemini Embedding (gemini-embedding-001) — 768-dim vectors for semantic search, L2-normalized at reduced dimensionality

The two-tier strategy means cost drops over time: as skills accumulate, Flash handles more queries and Pro is called less frequently.

Google ADK Agent — We built a baseline agent using the Google Agent Development Kit (Skills-Google-ADK-Agent) powered by Gemini 2.0 Flash. This serves as the consumer facing MCP client that connects to our server and uses the three tools to handle customer conversations. The ADK framework handles session management, tool routing, and agent lifecycle.

Console UI — Built with Lovable (skills-cubed-console) to provide a visual interface for browsing the skills knowledge base, monitoring skill growth, and observing the agent's learning progress.

Evaluation Harness

We built a rigorous evaluation pipeline to prove the system actually works on a research-quality customer service dataset. Using the ABCD dataset (10K+ human-to-human customer service dialogues, 55 intent types), we run two phases on the same conversations:

Baseline — Gemini resolves each conversation with no skill access. An LLM judge scores quality (1-5) against the human agent's ground truth resolution.
Continual Learning — Same conversations, but now the agent searches for skills, creates new ones, and uses existing ones. Skills accumulate as it goes.

The result is an improvement curve which is the heart of the entire project. Judge scores start at baseline (what Gemini can do on its own with zero learned knowledge) and rise as skills accumulate. Early conversations look identical to baseline because no skills exist yet. But as the agent resolves issues and writes playbooks, later conversations benefit from that accumulated knowledge. The curve bends upward.

Our core metric is resolution quality as scored by an independent judge against human expert ground truth. We're measuring whether it makes the agent better at its job over time. This is at the heart of our thesis and the theme of this hackathon: a self-reinforcing loop where every successful customer interaction makes the next similar interaction faster and more likely to succeed.