To improve the trade-off between KL and reward during RLHF, we leverage the ability to merge LLMs by weight averaging.
Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem