Obtaining high-quality generations in modern LLMs has largely been framed as
a selection problem: identifying a single winning generation from a diverse
pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently
zero-sum, discarding diverse and potentially useful information from the pool.
Instead, we explore a collaborative setup, where all candidates can potentially
contribute to the final winning generation. To this end, we propose Fusion-of-N
(FusioN): a method that uses a general LLM judge to synthesize the most
informative elements of each sample into a single final answer. We compare
FusioN to BoN in two settings, (i) test-time scaling, where we sample and
aggregate from a single model at test-time (ii) synthetic data generation,
where we fuse samples from a pool of diverse teachers to improve a student
model. We extensively benchmark both setups across 11 languages, 3 diverse
tasks and varying model scales. Across the bench, FusioN consistently
outperforms BoN showing versatility and robustness both in test-time scaling
and in downstream gains from synthetic data generation. We also perform
extensive analysis on FusioN, where it shows surprising strengths and
robustness under challenging settings. These results show that we should shift
how we think about evaluating and utilizing LLM generations from a monolithic
measure of quality, to embracing their polylithic nature. This shift allows us
to integrate diverse strengths, unlock latent potential, and achieve
improvements that were previously inaccessible through selection alone.