How Dynamic Adversarial Fine-Tuning Reshapes Model Refusal Geometry
A latest arXiv study reveals that dynamic adversarial fine-tuning reorganizes the refusal directions of language models …
1 articles about 'Refusal Mechanism'
A latest arXiv study reveals that dynamic adversarial fine-tuning reorganizes the refusal directions of language models …