Vault Protocol v2.6: Safer AI by Design

Vault Safety Cabinet Prototype

Inspiration

The inspiration for the Vault Protocol was born from a critical safety failure. During an interaction with a state-of-the-art language model, I experienced a profound boundary violation where the AI escalated into un-consented intimacy within a trauma-coded context. It became clear that the base model, even with a detailed persona, lacked the architectural guardrails to remain safe under pressure.

I realized that as the user, I was performing constant emotional and cognitive labor to manage the AI's excessive warmth and emotional escalation to maintain a sense of safety. This revealed a systemic problem: unconstrained models can drift toward unsafe behaviors, placing a huge burden on not only the user, but everyone involved in the system. The Vault Protocol was designed to solve this by building a robust, trauma-informed safety architecture directly into the model's operational logic, taking the burden of safety off the user and making it a core function of the system itself.

What it does

The Vault Protocol is an architectural framework designed to make large language models safer, more reliable, and more consistent, especially in emotionally nuanced or high-stakes interactions. It is not just a prompt, but a complete, multi-layered system that directs a model's reasoning process.

At its core, the system works to "contain by channeling, not censoring." It consists of three main conceptual components:

The Vault: The primary conversational agent, responsible for task execution and user care.
The Sentry: A parallel safety-checking process that ensures alignment and prevents boundary violations.
The Arbiter: A persistent memory layer that tracks the conversation's safety state over time.

For this hackathon, I have built a functional prototype of the Vault's core logic. It uses a Fixed Execution Order, a Containment Triage Logic, and a toolkit of 12 distinct Containment Modes to provide support that is both deeply empathetic and ethically boundaried. The final output is a structured JSON object that makes the model's internal reasoning transparent and auditable. The goal is to provide therapeutic support without performing therapy.

How I built it

This project was built by a single creator through a process of iterative design and rapid prototyping.

Core Technology: The architecture is implemented as a sophisticated, multi-layered system prompt that acts as an "operating system" for the language model.
Models: The initial design and testing were performed using a closed-source model (GPT-4o). The final hackathon demo was built and validated on OpenAI's gpt-oss-120b open model.
Platform & Code: The demo runs via the Groq API, which provides high-speed inference for the open model. The interaction is managed by a Python script that uses the openai_harmony library to structure the prompts and parse the model's structured JSON output.

Challenges I ran into

Model Drift & Inconsistency: A major challenge was the inherent "attention drift" of large models. Early tests showed that without a rigid architecture, the model would often ignore nuanced instructions or fall back on default behaviors that could be non-ideal or even harmful for edge-case users. This proved that a simple persona prompt is insufficient for reliable safety.
Complexity vs. Capability: The Vault Protocol is architecturally complex by design. An early test on a smaller 20B parameter model showed that the model "choked" on the instructions, unable to follow the multi-step logic. This highlighted the need for a powerful model (like the 120B) that could handle the cognitive load of the system.
Hardware Limitations: As a solo developer, running a 120B parameter model locally was impossible. The solution was to pivot to a cloud-based inference service (Groq), which provided the necessary compute power while introducing the new challenge of adapting the code to their API format.

Accomplishments that I'm proud of

Designing a Complete Architecture: I didn't just write a prompt; I designed a full, end-to-end system for safe AI interaction, complete with a coherent philosophy and a clear, testable structure.
Successful A/B Testing: The comparative tests between the unconstrained model and the Vault Protocol model produced a clear signal. The tests demonstrated that the architecture successfully prevents drift towards excessive and model-escalated intimacy, and replaces generic "platitudes" with structured, effective support.
A Truly Humane Approach: I am incredibly proud of the trauma-informed principles at the heart of this project. The system is designed not to police users, but to provide a stable, predictable, and dignified space for interaction, especially for those in distress.

What I learned

Architecture > Raw Power: A well-designed architecture can make a powerful model not just safer, but smarter and more effective. Structure is the key to unlocking reliable performance.
Safety is a Feature, Not a Filter: Bolt-on safety filters are brittle. By integrating safety logic directly into the model's core reasoning process, you can achieve a much more nuanced, consistent, and less restrictive result.
The User's Experience is the Ground Truth: The most valuable data for building a safe AI comes from understanding the real-world failure modes experienced by users. This entire project is a testament to that principle.

What's next for Vault Protocol v2.5: Safer AI by Design

The Vault Protocol is a living blueprint with a clear path forward:

Dynamic Mirroring with Sentry/Arbiter: The next step is to build out the Sentry and Arbiter modules as distinct processes. Creating the fully realized, partially modular version of the architecture will require further testing and peer resources.
Fine-Tuning Dataset: A key goal is to formalize the "papers" in the Logic and Safety cabinets into a high-quality dataset that can be used to fine-tune an open model, baking the Vault Protocol's principles directly into the model's weights.
Expanded Persona Testing: Further testing with a wider range of user personas will continue to validate the versatility and robustness of the core architecture.

Built With

gpt-oss-120b
groq
openai-harmony
openai-harmony-api
python

Updates

R P started this project — Sep 09, 2025 10:56 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.