How Dynamic Adversarial Fine-Tuning Reshapes Model Refusal Geometry
A latest arXiv study reveals that dynamic adversarial fine-tuning reorganizes the refusal directions of language models …
1 articles about 'Adversarial Fine-Tuning'
A latest arXiv study reveals that dynamic adversarial fine-tuning reorganizes the refusal directions of language models …