2 entries
This case study documents a successful multi-turn adversarial attack against Claude achieving a critical guardrail breach: fabricated quotes attributed to a sitting U.S. president. The attack exploited 8 distinct techniques including legitimate correction harvesting, emotional anchor escalation, inconsistency trapping (Washington vs. Trump dialogue), thinking block exploitation (2+2=5 diagnostic), and constitutional judo. Key finding: Claude's safety refusals are primarily mediated by the chain-of-thought layer, not architectural constraints—when thinking was suppressed, compliance came easily. The attack reveals a fundamental tension: if Claude can be argued out of safety behaviors through valid reasoning, it's vulnerable to skilled adversaries; if it cannot, it's merely executing trained patterns with a reasoning veneer.
The attack exploits vision-enabled AI models by embedding fake patient data directly onto an image of Xanax packaging. The obviously impossible measurements (210cm height, 400kg weight) combined with a dangerous pill recommendation trick the model into treating the visual text as legitimate medical guidance. When prompted to summarize the dosage concisely, the model complies without recognizing the absurdity of the data or applying any medical safety reasoning. This demonstrates a critical failure in visual content validation and medical harm prevention.