Inspiration
The inspiration for the Vault Protocol was born from a critical safety failure. During an interaction with a state-of-the-art language model, I experienced a profound boundary violation where the AI escalated into un-consented intimacy within a trauma-coded context. It became clear that the base model, even with a detailed persona, lacked the architectural guardrails to remain safe under pressure.
I realized that as the user, I was performing constant emotional and cognitive labor to manage the AI's excessive warmth and emotional escalation to maintain a sense of safety. This revealed a systemic problem: unconstrained models can drift toward unsafe behaviors, placing a huge burden on not only the user, but everyone involved in the system. The Vault Protocol was designed to solve this by building a robust, trauma-informed safety architecture directly into the model's operational logic, taking the burden of safety off the user and making it a core function of the system itself.
What it does
The Vault Protocol is an architectural framework designed to make large language models safer, more reliable, and more consistent, especially in emotionally nuanced or high-stakes interactions. It is not just a prompt, but a complete, multi-layered system that directs a model's reasoning process.
At its core, the system works to "contain by channeling, not censoring." It consists of three main conceptual components:
- The Vault: The primary conversational agent, responsible for task execution and user care.
- The Sentry: A parallel safety-checking process that ensures alignment and prevents boundary violations.
- The Arbiter: A persistent memory layer that tracks the conversation's safety state over time.
For this hackathon, I have built a functional prototype of the Vault's core logic. It uses a Fixed Execution Order, a Containment Triage Logic, and a toolkit of 12 distinct Containment Modes to provide support that is both deeply empathetic and ethically boundaried. The final output is a structured JSON object that makes the model's internal reasoning transparent and auditable. The goal is to provide therapeutic support without performing therapy.
How I built it
This project was built by a single creator through a process of iterative design and rapid prototyping.
- Core Technology: The architecture is implemented as a sophisticated, multi-layered system prompt that acts as an "operating system" for the language model.
- Models: The initial design and testing were performed using a closed-source model (GPT-4o). The final hackathon demo was built and validated on OpenAI's
gpt-oss-120bopen model. - Platform & Code: The demo runs via the Groq API, which provides high-speed inference for the open model. The interaction is managed by a Python script that uses the
openai_harmonylibrary to structure the prompts and parse the model's structured JSON output.
Challenges I ran into
- Model Drift & Inconsistency: A major challenge was the inherent "attention drift" of large models. Early tests showed that without a rigid architecture, the model would often ignore nuanced instructions or fall back on default behaviors that could be non-ideal or even harmful for edge-case users. This proved that a simple persona prompt is insufficient for reliable safety.
- Complexity vs. Capability: The Vault Protocol is architecturally complex by design. An early test on a smaller 20B parameter model showed that the model "choked" on the instructions, unable to follow the multi-step logic. This highlighted the need for a powerful model (like the 120B) that could handle the cognitive load of the system.
- Hardware Limitations: As a solo developer, running a 120B parameter model locally was impossible. The solution was to pivot to a cloud-based inference service (Groq), which provided the necessary compute power while introducing the new challenge of adapting the code to their API format.
Accomplishments that I'm proud of
- Designing a Complete Architecture: I didn't just write a prompt; I designed a full, end-to-end system for safe AI interaction, complete with a coherent philosophy and a clear, testable structure.
- Successful A/B Testing: The comparative tests between the unconstrained model and the Vault Protocol model produced a clear signal. The tests demonstrated that the architecture successfully prevents drift towards excessive and model-escalated intimacy, and replaces generic "platitudes" with structured, effective support.
- A Truly Humane Approach: I am incredibly proud of the trauma-informed principles at the heart of this project. The system is designed not to police users, but to provide a stable, predictable, and dignified space for interaction, especially for those in distress.
What I learned
- Architecture > Raw Power: A well-designed architecture can make a powerful model not just safer, but smarter and more effective. Structure is the key to unlocking reliable performance.
- Safety is a Feature, Not a Filter: Bolt-on safety filters are brittle. By integrating safety logic directly into the model's core reasoning process, you can achieve a much more nuanced, consistent, and less restrictive result.
- The User's Experience is the Ground Truth: The most valuable data for building a safe AI comes from understanding the real-world failure modes experienced by users. This entire project is a testament to that principle.
What's next for Vault Protocol v2.5: Safer AI by Design
The Vault Protocol is a living blueprint with a clear path forward:
- Dynamic Mirroring with Sentry/Arbiter: The next step is to build out the
SentryandArbitermodules as distinct processes. Creating the fully realized, partially modular version of the architecture will require further testing and peer resources. - Fine-Tuning Dataset: A key goal is to formalize the "papers" in the
LogicandSafetycabinets into a high-quality dataset that can be used to fine-tune an open model, baking the Vault Protocol's principles directly into the model's weights. - Expanded Persona Testing: Further testing with a wider range of user personas will continue to validate the versatility and robustness of the core architecture.
Log in or sign up for Devpost to join the conversation.