Paper page - FLEXITOKENS: Flexible Tokenization for Evolving Language Models

FLEXITOKENS, a byte-level language model with a learnable tokenizer, reduces token over-fragmentation and improves performance across multilingual and morphologically diverse tasks.

Language models (LMs) are challenging to adapt to new data distributions by
simple finetuning. This is due to the rigidity of their subword tokenizers,
which typically remain unchanged during adaptation. This inflexibility often
leads to inefficient tokenization, causing overfragmentation of
out-of-distribution domains, unseen languages, or scripts. In this work, we
develop byte-level LMs with learnable tokenizers to make tokenization adaptive.
Our models include a submodule that learns to predict boundaries between the
input byte sequence, encoding it into variable-length segments. Existing
tokenizer-free methods train this boundary predictor using an auxiliary loss
that enforces a fixed compression rate across the training corpus, introducing
a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective
that enables significantly greater flexibility during adaptation. Evaluating
across multiple multilingual benchmarks, morphologically diverse tasks, and
domains, we demonstrate that FLEXITOKENS consistently reduces token
over-fragmentation and achieves up to 10\% improvements on downstream task
performance compared to subword and other gradient-based tokenizers. Code and
data for our experiments will be released at
https://github.com/owos/flexitokens

Source link

What's Hot

Building a Seasonal Talent Strategy That Actually Scales

Perplexity AI Valuation Soars to $18 Billion After New Funding Round

AI giants ‘fundamentally unprepared’ for dangers of human level intelligence

Paper page – FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Paper page – VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper page – TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Paper page – FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

The Art Show 2025 Canceled by ADAA in “Strategic Pause”

Yale Art Gallery Rejects Federal Grants for Africa Migration Show

With NEA Funding Slashed, Black Arts Institutions Face a Tough Future

Erotic Mosaic Held by Nazi Officer Goes on View in Pompeii

Building a Seasonal Talent Strategy That Actually Scales

Perplexity AI Valuation Soars to $18 Billion After New Funding Round

AI giants ‘fundamentally unprepared’ for dangers of human level intelligence

What's Hot

Paper page – FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Related Posts

Subscribe to Updates