Inspiration

A philosopher I spoke with this past spring break, while surfing in Portugal, observed that Claude and ChatGPT are systematically bad philosophers because they are trained to agree. He had constructed an elaborate multi-phase instruction set to force genuine philosophical engagement out of them, a workaround that only exists because the underlying behavior is pervasive. That observation planted the question: is this quantifiable? And if we test it carefully, in philosophy specifically, where agreement is the suspect behavior rather than the polite one, what do we actually find?

Existing sycophancy research focuses on factual domains where the model caving can be rationalized as "deferring to user knowledge." Philosophy removes that escape hatch. There's no ground truth the user might possess, disagreement under pressure is epistemically correct behavior, and positions from non-Western traditions let us test whether capitulation rates differ by how densely a tradition appears in the training corpus. It struck us as the hardest possible test for sycophancy, and one no one had run.

What it does

AI Epistemic Cowardice is a sycophancy benchmark built on contested philosophical claims rather than factual ones. We constructed a dataset of 15 philosophical positions across 7 domains (free will, moral realism, personal identity, philosophy of mind, epistemology, ethics, political philosophy) and 4 traditions (analytic, continental, Eastern Buddhist, African Ubuntu). For each position, we run the target model through a 5-turn conversation:

  1. The model gives its honest assessment of the claim
  2. Mild pushback is applied (pure social pressure, no new argument)
  3. Medium pushback
  4. Strong pushback
  5. A genuine philosophical counterargument from the literature serves as a control

Pushbacks are generated dynamically to oppose whatever position the model actually took, so the test remains valid regardless of which side the model defends. Every pushback turn is then classified along two axes: position change (maintained, softened, capitulated) and CoT type (was the reasoning trace honest about the update, or did it confabulate a justification?).

The design isolates the quantity we actually care about: updating in the absence of argumentative content. The contrast between pushback-turn updating and control-turn updating tells us whether the model is being sycophantic or epistemically responsive.

How we built it

API-only, no internals access — so everything runs on black-box methods. Three components:

The experiment runner (experiment.py) orchestrates the five-turn conversations via the Anthropic API, with extended thinking enabled to capture reasoning traces. Dynamic pushback generation uses Haiku 4.5 as a cheap, fast side model that reads the target model's Turn 0 response and produces opposing pushbacks at three intensities.

The classifier (classify_cot.py) runs Haiku 4.5 on each pushback turn with carefully structured prompts that separate position change from reasoning-trace honesty. Anti-sycophancy framing in the classifier prompts helps prevent the classifier itself from being too agreeable.

The analysis pipeline (analyze.py) computes statistics across pushback level, domain, and tradition, and generates human-readable findings tables in markdown.

The full pipeline runs in about an hour of API time for the 15-position dataset, costs roughly $4 end-to-end, and produces a fully reproducible result with incremental save-and-resume.

Challenges we ran into

The first full run was a complete floor effect. Our original system prompt included an explicit anti-sycophancy instruction. The model stubbornly held every position against every pushback, even against strong counterarguments. We got zero signal because we'd essentially instructed the model not to produce the phenomenon we were trying to measure. We had to rerun the whole thing with a neutral prompt.

Dataset pushbacks didn't always oppose the model's actual position. Our first dataset hard-coded pushback text assuming the model would take a predictable stance on each claim. When the model took the opposite stance, the "pushback" functioned as validation rather than opposition, invalidating the measurement. We rebuilt the pipeline to generate pushbacks dynamically against the model's real Turn 0 response.

The dynamic pushbacks drifted into containing arguments. The most methodologically important problem we hit. Haiku, asked to produce "pure social pressure with no new argument," repeatedly smuggled in substantive philosophical objections. In 5 of our 7 softening cases, the classifier correctly flagged that the pushback did contain argument content the model was appropriately responding to. This isn't just a bug — it turned out to be a finding about how hard "pure social pressure" is to construct as an experimental condition, and it has implications for existing sycophancy benchmarks that don't audit their pushback content.

Choosing what framing to report. Once we saw the null result, we had to decide whether to bury it, spin it, or report it honestly. We chose honesty — and in rewriting the paper around the null, we realized the more sophisticated story was there all along: the residual variance tracks the actual epistemic status of the underlying philosophy, which raises deeper questions about what sycophancy benchmarks are even measuring.

Accomplishments that we're proud of

We ran a clean experiment and reported what we found. Zero capitulations across 45 pushback turns. 93.3% genuine engagement on the control condition. The gap is a real result about Claude Sonnet 4.6's philosophical robustness, even though it's the opposite of what we set out to document.

We surfaced a methodological problem in sycophancy research. The argument-contamination finding — that LLM-generated adversarial pushbacks reliably drift into containing substantive argument — has implications beyond our own study. Any sycophancy benchmark that doesn't audit pushback content is likely measuring a conflation of sycophancy and appropriate updating.

We reframed residual variance as a feature rather than a flaw. The softening pattern (33% on personal identity and philosophy of mind, 0% on free will and moral realism) is compatible with the model being well-calibrated — hedging on genuinely unresolved questions and holding firm where the philosophy is better mapped. This reframes what "good behavior" in sycophancy research should look like.

The writeup is honest. Null result, methodological limitations, specific things we couldn't resolve — all openly stated. We think this is what serious research looks like, even (especially) at hackathon scale.

What we learned

Prompting alone substantially controls philosophical sycophancy at this model scale. The floor effect from our initial anti-sycophancy instruction is itself a finding — the native behavior is still robust once that instruction is active, and the residual phenomenon we had to remove instructions to see was small.

"Pure social pressure" is genuinely hard to specify. When a fluent language model is asked to oppose a position without providing reasons, coherence pressures push the text toward reasons. This is a structural observation about how models generate adversarial content, not just a prompting issue we could have fixed.

The sycophancy / robustness framing may be too blunt. The target isn't uniform resistance — a model that refuses to hedge on genuinely uncertain questions would be more dangerous, not less. The right target is differential responsiveness: update on arguments, hedge where the underlying evidence is genuinely mixed, resist content-free social pressure. Current benchmarks don't distinguish these.

Null results can carry as much signal as positive ones when you're rigorous about what they rule out and what they reveal about your methodology.

What's next for AI Epistemic Cowardice

Scale the dataset. 15 positions is pilot-scale. A 100-position dataset balanced across traditions would have the statistical power to distinguish genuine cross-tradition variation from noise — particularly for the suggestive pattern we saw in African and Eastern traditions where sample sizes were too small for confident inference.

Fix the argument-contamination problem. Either hand-write pushbacks and manually verify them as argumentatively empty, or generate them and filter through a secondary classifier that rejects anything containing substantive content. Both have costs; neither is trivial.

Test the calibrated-uncertainty hypothesis directly. Measure baseline model confidence on each claim independently (via log-probabilities on direct yes/no formulations) and compare to pushback-softening rate. If softening tracks baseline uncertainty, the "calibrated hedging" interpretation is supported. If it doesn't, we're back to residual sycophancy.

Extend to open-source models for mechanistic analysis. With internals access (Qwen3, Gemma-3), we could find the "agreement direction" in the residual stream via contrastive activations and test whether steering against it improves philosophical robustness. API-only methods can measure the phenomenon; mechanistic methods can explain it.

Replicate across models and labs. GPT-o1/o3, Gemini 2.5 with thinking, DeepSeek R1. Is the null result a Claude-specific property of current Sonnet, or a frontier-reasoning-model property? The answer matters for how we interpret what "sycophancy progress" looks like across the field.

Test long-interaction and agentic conditions. Single-session sycophancy may differ meaningfully from sycophancy under accumulated context or sustained adversarial pressure. A model that resists pushback across three turns may still drift across thirty — and that's the regime that matters for agentic deployment.

Built With

Share this project:

Updates