How Dynamic Adversarial Fine-Tuning Reshapes Model Refusal Geometry
A latest arXiv study reveals that dynamic adversarial fine-tuning reorganizes the refusal directions of language models …
1 articles about 'AI Safety Alignment'
A latest arXiv study reveals that dynamic adversarial fine-tuning reorganizes the refusal directions of language models …