Study Finds: LLM Refusal Behavior Is Controlled by a Single Direction
New research reveals that the safety refusal mechanism in large language models is mediated by a single direction within…
1 articles about 'Safety Training'
New research reveals that the safety refusal mechanism in large language models is mediated by a single direction within…