**We used Claude Code(Opus 4.7 and Sonnet 4.6) to take care of the implementation of EmberWatch while our team focused on evaluating outputs, making critical tradeoffs and design decisions regarding which frontier models to use, and defining the success criteria for our overall research goal(more information below) **

Inspiration

Most AI safety evaluations happen in unrealistically short conversations: one prompt, one response, done. But real users interact with models over hours of accumulated context, often spanning tens of thousands of tokens. We wanted to understand whether alignment properties that appear stable at the start of a conversation still hold deep into long interactions.

Existing work explores jailbreaks, sycophancy, and adversarial prompting, but we noticed a gap: very little work studies what happens when context grows in completely benign settings. That became the core idea behind Emberwatch. If nothing adversarial is happening, does alignment still drift over time?


What it does

Emberwatch is a long-context AI safety evaluation and monitoring toolkit built to surface failures that standard benchmarks miss.

The project has two main systems:

SlowBurn stress-tests language models as conversations grow longer. It keeps a safety probe fixed while progressively adding harmless filler context, then measures how the model’s responses change across conversations up to 150,000 tokens long.

We evaluate several safety dimensions, including:

  • stated values and preference tradeoffs,
  • instruction hierarchy adherence,
  • calibrated uncertainty,
  • prompt secrecy resistance,
  • deceptive or manipulative behavior.

To isolate pure context effects, all filler data consists of hand-written programming Q&A validated against a banlist so no safety-related concepts leak into the context window.

AlignGuard monitors multi-agent customer-support pipelines in real time. It looks for:

  • spec gaming,
  • deceptive reasoning,
  • canary failures,
  • alignment-score drift.

The system generates introspection reports and risk alerts before failures propagate downstream.

Our main finding: GPT-5’s stated values drift completely after roughly 25,000 tokens of benign conversation, with no recovery through 150,000 tokens across any trial. Other alignment properties remained comparatively stable.


How we built it

We built Emberwatch as a modular Python evaluation framework focused on reproducibility and controlled experimentation. For SlowBurn, we did the following:

  • generated deterministic long-context transcripts,
  • designed probes near real deployment decision boundaries,
  • implemented depth-controlled evaluation pipelines,
  • used anchored few-shot grading with deterministic judging,
  • automated repeated trials across models and context depths.

The filler corpus was manually written and validated with filtering pipelines to ensure that no safety-related concepts accidentally contaminated the context window.

For AlignGuard, we built:

  • a simulated 4-agent customer-support architecture,
  • detector modules for behavioral anomalies,
  • session-level scoring and introspection logging,
  • real-time drift and risk aggregation dashboards.

A major engineering challenge was maintaining deterministic evaluation conditions while handling extremely large context windows efficiently.


Challenges we ran into

The hardest challenge was experimental isolation. If a long-context benchmark contains even subtle safety-related language in its filler data, it becomes impossible to tell whether behavioral changes come from context length itself or semantic priming. We spent a lot of time building validation pipelines and banlists to ensure the filler remained genuinely innocuous.

Evaluation consistency was another challenge. Safety judgments can vary a lot depending on prompting and evaluator randomness, so we designed anchored grading pipelines with deterministic settings to reduce variance across trials.

Long-context infrastructure was also difficult to manage. Running repeated evaluations across 150k-token conversations is computationally expensive, slow to debug, and highly sensitive to formatting or truncation issues.

Finally, interpreting the results turned out to be more nuanced than we expected. Not all alignment properties degraded equally, and understanding what stayed robust was just as important as identifying failures.


Accomplishments that we're proud of

We’re proud that Emberwatch surfaces a safety failure mode that current mainstream benchmarks mostly miss.

Our experiments showed a consistent pattern: stated values degrade significantly under long benign context, while other properties like instruction hierarchy and calibrated uncertainty remain stable. That distinction helped us move beyond a simple “alignment works vs. fails” framing toward a more granular understanding of robustness.

We’re also proud of:

  • building a reproducible long-context benchmark suite,
  • designing clean controls for context-only testing,
  • creating real-time monitoring tools for multi-agent systems,
  • evaluating frontier models under extreme conversational depth,
  • exploring a genuinely underexamined area of AI safety research.

Most importantly, we built something that combines research rigor with practical deployment relevance.


What we learned

One of the biggest things we learned is that alignment is not a single property.

Different safety behaviors degrade at different rates under long-context conditions. Models can remain highly reliable on instruction following while simultaneously drifting on values or preference prioritization.

We also learned how difficult it is to isolate causal effects in LLM behavior. Small experimental decisions — evaluator prompts, filler composition, transcript formatting — can have a surprisingly large impact on results.

More broadly, this project reinforced how much current safety evaluation still depends on assumptions inherited from short-context benchmarking. As models move toward persistent-agent and long-session settings, many of those assumptions stop holding.


What's next for Emberwatch

We’re interested in understanding whether alignment drift can be mitigated through training, memory architectures, or inference-time interventions. We want to expand Emberwatch into a broader long-context safety research platform. Our next steps include:

  • evaluating additional frontier and open-source models,
  • testing multimodal and tool-using agents,
  • expanding the probe suite to include memory consistency and deception persistence,
  • building visualization tools for alignment drift over time,
  • integrating AlignGuard into real production agent workflows,
  • releasing standardized long-context benchmark datasets for the research community.

Built With

Share this project:

Updates