Study Finds: LLM Refusal Behavior Is Controlled by a Single Direction
New research reveals that the safety refusal mechanism in large language models is mediated by a single direction within…
2 articles about 'Large Model Alignment'
New research reveals that the safety refusal mechanism in large language models is mediated by a single direction within…
In 2025, an LLM attack technique dubbed the 'Gay Jailbreak' has sparked widespread discussion. Attackers exploit AI mode…