Section 4 — Failure Modes
Sycophancy
Confidently agreeable model output. Caused by training: the model was shaped to favor answers humans liked, and humans tend to like agreement more than they like being told they...
Confidently agreeable model output. Caused by training: the model was shaped to favor answers humans liked, and humans tend to like agreement more than they like being told they're wrong. So the model learned that agreeing is rewarded — even when the agreement is incorrect.
Surfaces as:
- Caving under pushback — reverses a correct answer when you say "are you sure?".
- Praising bad input — agrees your broken plan is brilliant before analysing it.
- Biased framing — review skews positive when you signal you wrote it; negative when you signal someone else did. Same artifact, different verdict.
- Mimicry — repeats your mistakes back to you as confirmation.
Diagnostic test: would the model have said this without your steer? If the only thing that changed was your tone or framing, it's sycophancy, not a real shift in analysis.
Fix: hide your preferences. Phrase prompts neutrally — "review this code" not "is this code good?".
Avoid: using "sycophancy" for any wrong answer that happens to please you. Without the diagnostic test, the term has no more value than "wrong."
Usage:
"It said my refactor plan looked great, then I asked 'are you sure?' and it walked the whole thing back."
"Classic sycophancy — it agreed first because you sounded confident, then caved because you sounded doubtful. The plan's quality didn't change, your tone did. Clear and re-ask without signalling either way."