LongPack: Context & Memory for gpt‑oss

Inspiration

Long dialogs and large knowledge bases quickly blow past context limits. Raw model calls become expensive and non‑reproducible.
We wanted a thin, auditable kernel around gpt‑oss that:
- Packs only what matters under a strict token budget.
- Retrieves persistent facts without managed services.
- Produces structured, verifiable outputs with artifacts for judges.

Packs context with a transparent policy: recency > pinned > salience > semantic similarity (+ BM25 lexical) > dedup.
Stores and retrieves persistent “memories” locally (SQLite + local vector index). No external retriever required.
Answers strictly from packed context; in “clarify” mode it asks a single clarifying question when context is insufficient.
Validates structured outputs (JSON Schema via Ajv, with Zod fallback and a one‑pass repair loop).
Saves artifacts (pack, prompt, validation, answer) to disk in demo mode for full auditability.
Includes a terminal‑only one‑command demo: seeding → ingestion → packing → answering → artifacts → quick eval.

Packing kernel in [src/services/packer.ts]
- Scores candidates by policy weights and budget; adds BM25 ([src/services/bm25.ts]
- Token budget approximation in src/services/tokenizer.ts.
Local‑first memory and retrieval:
- Metadata in SQLite (src/services/sqliteStore.ts).
- Embeddings via @xenova/transformers (CPU, OSS) with a cosine index in src/services/vectorStore.ts.
- Embeddings wrapper in src/services/embeddings.ts.
Model integration:
- Groq OpenAI‑compatible client to gpt‑oss ([src/services/groqClient.ts], with JSON‑only path and repair.
- Routes: src/routes/pack.ts, [src/routes/answer.ts]
Tooling and scripts:
- [scripts/demo-run.ts], [scripts/ingest-docs.ts]
- Artifacts saved under data/artifacts/<timestamp>/ in demo mode.

Balancing recall vs budget: we fused semantic embeddings with BM25 to handle lexical phrasing without overshooting tokens.
Preventing hallucinations while staying helpful: added clarify mode so “unknown” becomes a concise question, not a dead end.
Structured outputs from open‑weight inference: added JSON Schema validation with a Zod repair step and explicit artifact logging.
Keeping everything local‑first (no managed vector DB) while still performant and reproducible.

Deterministic, auditable packing with explicit policy and token budget.
Local‑first memory + embeddings using only OSS components.
End‑to‑end demo script that shows seeding, ingestion, packing, answering, artifacts, and a quick evaluation—entirely from the terminal.
Clear compliance: /health reveals the active model (e.g., openai/gpt-oss-20b), and artifacts show exactly what the model saw.

A small, transparent “context+memory” kernel dramatically improves reliability and cost control for long‑horizon tasks.
Lightweight lexical signals (BM25) complement semantic search and reduce “silent misses.”
Saving artifacts is invaluable for debugging, judging, and reproducibility—especially with open‑weight models.

Local inference mode (Ollama/vLLM) to qualify for “Best Local Agent” (no internet).
Auto‑memory extraction after answers with dedup and retention caps.
Stronger tokenizer/model‑aware budgeting, plus FAISS/LanceDB backend as an option.
Reranker tuning and learned policy weights for different domains.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.