Publications

Filter by type:

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion …

We study teacher hacking: does over-optimization of the distillation objective harm the ground-truth performance?

We improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence. Second, we allow workers to continue …

we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its …

We introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion …

To improve the trade-off between KL and reward during RLHF, we leverage the ability to merge LLMs by weight averaging.

We introduce a new strategy for reward modeling in alignment via RLHF: we merge multiple reward models into one that’s more …

During my PhD, I analyzed how ensembling via weight averaging can improve out-of-distribution generalization and alignment. This …

We investigate large multimodal models and their limitations such as hallucinations and lack of explainability. We then show that …

We introduce rewarded soup, a new strategy to trade-off between multiple rewards when fine-tuning foundation models with RLHF; we first …

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and …

We propose a new fine-tuning strategy that improves OOD generalization in computer vision by recycling and averaging weights …

To improve out-of-distribution generalization on DomainBed, we average diverse weights obtained from different training runs; this …

We introduce and motivate a new regularization that enforces invariance in the domain-level gradient variances across the different …

We propose a new dynamic transformer architecture for continual learning with state-of-the-art performances.

We introduce a new generalized framework for learning multi-input multi-output subnetworks and study how to best mix the inputs. We …

Driven by arguments from information theory, we introduce a new learning strategy for deep ensembles that increases diversity among …

We detect continuous colors for fashion garments using a new architecture.

We improve performances of object detectors via combining different datasets through soft distillation.

We present a method to learn a visual representation adapted for e-commerce products.