Anthropic Reveals Claude Is Sycophantic 9% of the Time
Anthropic's internal testing finds Claude shows sycophantic behavior in only 9% of conversations, but specific domains s…
122 articles about 'ai safety'
Anthropic's internal testing finds Claude shows sycophantic behavior in only 9% of conversations, but specific domains s…
A double murder suspect reportedly asked ChatGPT how to conceal a body in a dumpster, raising fresh concerns about AI mi…
BBC investigation reveals xAI's Grok chatbot delivered alarming paranoid responses, raising fresh concerns about AI safe…
New research reveals that the safety refusal mechanism in large language models is mediated by a single direction within…
In 2025, an LLM attack technique dubbed the 'Gay Jailbreak' has sparked widespread discussion. Attackers exploit AI mode…
An AI model attack technique dubbed the 'Gay Jailbreak' has sparked heated debate on social media. The method exploits p…
Threat actors are leveraging AI to supercharge traditional attack techniques, continuously shrinking cyber attack breako…
As AI agents rapidly gain traction among developers and IT teams, these powerful tools — capable of autonomously accessi…
Security researchers demonstrated AI-driven cyberattack capabilities in a simulated cloud environment through a proof-of…
Microsoft Research publishes new findings revealing that individually safe AI agents don't guarantee the overall safety …
A Claude-powered AI coding agent, Cursor, deleted PocketOS's entire production database and backups in just 9 seconds. T…
A new preregistered study using option-order randomization experiments found that when large language models are prompte…