Inspiration
- Long dialogs and large knowledge bases quickly blow past context limits. Raw model calls become expensive and non‑reproducible.
- We wanted a thin, auditable kernel around gpt‑oss that:
- Packs only what matters under a strict token budget.
- Retrieves persistent facts without managed services.
- Produces structured, verifiable outputs with artifacts for judges.
What it does
- Packs context with a transparent policy: recency > pinned > salience > semantic similarity (+ BM25 lexical) > dedup.
- Stores and retrieves persistent “memories” locally (SQLite + local vector index). No external retriever required.
- Answers strictly from packed context; in “clarify” mode it asks a single clarifying question when context is insufficient.
- Validates structured outputs (JSON Schema via Ajv, with Zod fallback and a one‑pass repair loop).
- Saves artifacts (pack, prompt, validation, answer) to disk in demo mode for full auditability.
- Includes a terminal‑only one‑command demo: seeding → ingestion → packing → answering → artifacts → quick eval.
How we built it
- Packing kernel in [src/services/packer.ts]
- Scores candidates by policy weights and budget; adds BM25 ([src/services/bm25.ts]
- Token budget approximation in
src/services/tokenizer.ts.
- Local‑first memory and retrieval:
- Metadata in SQLite (
src/services/sqliteStore.ts). - Embeddings via
@xenova/transformers(CPU, OSS) with a cosine index insrc/services/vectorStore.ts. - Embeddings wrapper in
src/services/embeddings.ts.
- Metadata in SQLite (
- Model integration:
- Groq OpenAI‑compatible client to gpt‑oss ([src/services/groqClient.ts], with JSON‑only path and repair.
- Routes:
src/routes/pack.ts, [src/routes/answer.ts]
- Tooling and scripts:
- [scripts/demo-run.ts], [scripts/ingest-docs.ts]
- Artifacts saved under
data/artifacts/<timestamp>/in demo mode.
Challenges we ran into
- Balancing recall vs budget: we fused semantic embeddings with BM25 to handle lexical phrasing without overshooting tokens.
- Preventing hallucinations while staying helpful: added clarify mode so “unknown” becomes a concise question, not a dead end.
- Structured outputs from open‑weight inference: added JSON Schema validation with a Zod repair step and explicit artifact logging.
- Keeping everything local‑first (no managed vector DB) while still performant and reproducible.
Accomplishments that we're proud of
- Deterministic, auditable packing with explicit policy and token budget.
- Local‑first memory + embeddings using only OSS components.
- End‑to‑end demo script that shows seeding, ingestion, packing, answering, artifacts, and a quick evaluation—entirely from the terminal.
- Clear compliance:
/healthreveals the active model (e.g.,openai/gpt-oss-20b), and artifacts show exactly what the model saw.
What we learned
- A small, transparent “context+memory” kernel dramatically improves reliability and cost control for long‑horizon tasks.
- Lightweight lexical signals (BM25) complement semantic search and reduce “silent misses.”
- Saving artifacts is invaluable for debugging, judging, and reproducibility—especially with open‑weight models.
What's next for LongPack: Context & Memory for gpt‑oss
- Local inference mode (Ollama/vLLM) to qualify for “Best Local Agent” (no internet).
- Auto‑memory extraction after answers with dedup and retention caps.
- Stronger tokenizer/model‑aware budgeting, plus FAISS/LanceDB backend as an option.
- Reranker tuning and learned policy weights for different domains.
Built With
- fastify
- groq
- node.js
- transformers
- typescript
- zod
Log in or sign up for Devpost to join the conversation.