Helpful AI Fails to Mimic Humans
New study of 208k participants reveals that RLHF training reduces AI's ability to simulate human behavior, with effects …
4 articles about 'LLM alignment'
New study of 208k participants reveals that RLHF training reduces AI's ability to simulate human behavior, with effects …
A latest arXiv paper proposes using Intrinsic Mutual Information (IMI) as a regulator for preference optimization, aimin…
Researchers propose the KARL framework, a knowledge-boundary-aware reinforcement learning approach that enables large la…
A latest arXiv paper investigates the 'sandbagging effect' where large language models deliberately underperform under w…