Inspiration

  • Long dialogs and large knowledge bases quickly blow past context limits. Raw model calls become expensive and non‑reproducible.
  • We wanted a thin, auditable kernel around gpt‑oss that:
    • Packs only what matters under a strict token budget.
    • Retrieves persistent facts without managed services.
    • Produces structured, verifiable outputs with artifacts for judges.

What it does

  • Packs context with a transparent policy: recency > pinned > salience > semantic similarity (+ BM25 lexical) > dedup.
  • Stores and retrieves persistent “memories” locally (SQLite + local vector index). No external retriever required.
  • Answers strictly from packed context; in “clarify” mode it asks a single clarifying question when context is insufficient.
  • Validates structured outputs (JSON Schema via Ajv, with Zod fallback and a one‑pass repair loop).
  • Saves artifacts (pack, prompt, validation, answer) to disk in demo mode for full auditability.
  • Includes a terminal‑only one‑command demo: seeding → ingestion → packing → answering → artifacts → quick eval.

How we built it

  • Packing kernel in [src/services/packer.ts]
    • Scores candidates by policy weights and budget; adds BM25 ([src/services/bm25.ts]
    • Token budget approximation in src/services/tokenizer.ts.
  • Local‑first memory and retrieval:
    • Metadata in SQLite (src/services/sqliteStore.ts).
    • Embeddings via @xenova/transformers (CPU, OSS) with a cosine index in src/services/vectorStore.ts.
    • Embeddings wrapper in src/services/embeddings.ts.
  • Model integration:
    • Groq OpenAI‑compatible client to gpt‑oss ([src/services/groqClient.ts], with JSON‑only path and repair.
    • Routes: src/routes/pack.ts, [src/routes/answer.ts]
  • Tooling and scripts:
    • [scripts/demo-run.ts], [scripts/ingest-docs.ts]
    • Artifacts saved under data/artifacts/<timestamp>/ in demo mode.

Challenges we ran into

  • Balancing recall vs budget: we fused semantic embeddings with BM25 to handle lexical phrasing without overshooting tokens.
  • Preventing hallucinations while staying helpful: added clarify mode so “unknown” becomes a concise question, not a dead end.
  • Structured outputs from open‑weight inference: added JSON Schema validation with a Zod repair step and explicit artifact logging.
  • Keeping everything local‑first (no managed vector DB) while still performant and reproducible.

Accomplishments that we're proud of

  • Deterministic, auditable packing with explicit policy and token budget.
  • Local‑first memory + embeddings using only OSS components.
  • End‑to‑end demo script that shows seeding, ingestion, packing, answering, artifacts, and a quick evaluation—entirely from the terminal.
  • Clear compliance: /health reveals the active model (e.g., openai/gpt-oss-20b), and artifacts show exactly what the model saw.

What we learned

  • A small, transparent “context+memory” kernel dramatically improves reliability and cost control for long‑horizon tasks.
  • Lightweight lexical signals (BM25) complement semantic search and reduce “silent misses.”
  • Saving artifacts is invaluable for debugging, judging, and reproducibility—especially with open‑weight models.

What's next for LongPack: Context & Memory for gpt‑oss

  • Local inference mode (Ollama/vLLM) to qualify for “Best Local Agent” (no internet).
  • Auto‑memory extraction after answers with dedup and retention caps.
  • Stronger tokenizer/model‑aware budgeting, plus FAISS/LanceDB backend as an option.
  • Reranker tuning and learned policy weights for different domains.

Built With

Share this project:

Updates