At SlatorCon Silicon Valley 2025, Cohere’s Multilingual Team Lead Kelly Marchisio delivered one of the most well-received presentations of the day: an accessible, behind-the-scenes look at how to build a multilingual large language model (LLM).
Marchisio put delegates in the shoes of a machine learning engineer at the frontier of AI innovation, grappling with questions such as: How do we build the best-performing LLM? How do we bake in multilinguality from the start? And how do we ensure the model interacts in ways that are genuinely useful?
Foundational model builder Cohere released its Command A flagship model in March 2025, followed by a specialized translation model, Command A Translate, in August 2025. The timing was fortuitous, allowing Marchisio to use the SlatorCon stage to lift the lid on the life cycle of the LLM’s development.
LLMs are often built as English-centric systems, and only later retrofitted with stronger multilingual capabilities. Cohere takes a different approach. “Multilinguality is core to what we do. We think about making our models multilingual throughout the entire training process,” Marchisio explained.
The idea is to pre-train the LLM across a range of languages so it can deliver strong multilingual performance in capabilities such as question answering, translation, and summarization.
Multilinguality brings unique challenges, however. The first and most obvious is deciding which languages to include. It’s not yet possible to support all languages in a single model, given constraints in size and data, and practical choices have to be made. Cohere selected 23 languages “to support the variety of languages that are used in global business contexts,” Marchisio said.
Feeding the LLM
Next up was one of AI’s most well-known challenges: obtaining huge amounts of training data. Marchisio’s team created a “training mixture” from public sources, annotator-created data, and synthetic generation.
Another hurdle for the Cohere team was tokenization. An LLM’s training text must be split into trainable tokens, whether words, characters, or bytes. A tokenizer handles this splitting process, but without optimizing the tool for all target languages, it can create major imbalances.
Sharing an example with the audience, Marchisio showed how the same phrase might be split into 11 tokens in English but 21 in Hindi. The consequences are not trivial. For users, more tokens mean higher costs, if billing is “per token.” For providers, more tokens increase compute time, making the model slower and more expensive to run.
With the tokenizer optimized, pre-training on Command A could begin. In this stage, data is distributed across servers, and attention mechanisms map relationships between tokens. The process is lengthy, stretching over many months.
At the Frontier
While Command A was pre-training, the Cohere team did not stay idle. Instead, they used this as the perfect moment to turn their energies toward open research questions, of which there are many.
“Because we are at the frontier of multilingual AI research, we face unanswered questions daily,” Marchisio pointed out.
One such question is “language confusion.” Imagine, Marchisio invited the audience, being a Korean user typing a math question in Korean into an LLM, hitting enter, and the answer appears in English.
“This is a real example that I have seen in the wild, and if you’re frequently a user of LLMs outside of English, you have probably come across this type of error,” Marchisio said.
LLMs may show language confusion at the line level, say, switching between Spanish and English for a line or two, or may pepper words from one language into a paragraph written in another. It is, Marchisio noted, “a pretty jarring user experience”.
The result of this exploration was a paper naming the problem, establishing a benchmark to evaluate these types of failures, and pinpointing mitigating techniques.
Efficiency at Work
Cohere, which added USD 100m to its latest funding round in September 2025, focuses on building LLMs for enterprise. The practical realities of deployment are therefore a major focus.
Command A is available via API, but when users need a local or private deployment, hardware becomes a much bigger consideration.
“Given the diversity of our customers worldwide, we observe variations in compute capabilities across different regions, and there is a need for Cohere to be very flexible on efficiency”, Marchisio said.
One way to make models easier to run on less powerful hardware is by simplifying how numbers are stored, in a process called “quantization”, effectively shrinking the model. But, as Marchisio explained, “nothing in life is free, so there is a cost to quantization.”
The team set out to explore what that cost looked like, focusing on how quantization affects quality across languages.
Their results challenged the prevailing view, largely shaped by automated benchmarks, that the effect of quantization is negligible. In fact, they showed that this process does cause quality loss that is noticeable to humans, with non-Latin script languages and complex tasks most affected.
Polishing the Model
With pre-training finally complete, Command A was ready for post-training, the stage where an LLM goes beyond being a mere text predictor and begins to interact in more useful, natural, and human-like ways.
Examples of inputs and outputs were given to the model so it learned to respond usefully. The Cohere team then took a rather unique approach, performing multiple rounds of “expert model” training.
The first round deepened the model’s skills specialization. “We had different teams that focused on different types of skills, coding, multilinguality, safety, instruction following, who tried to build their best ‘expert’ [version of the] model for that skill,” Marchisio said. “Then we merged the results to get a strong all-rounder model.”
In the second round, the same teams each improved the model’s helpfulness by giving feedback on which responses were most useful for their skill focus, then all the resulting models were again merged into one.
A “polishing” stage involved jumping back and forth between a range of training techniques: labeled data, human-ranked answers, and real-time human judgments on usability. This produced several “finished” model versions.
Finally, an assessment to crown the best possible model was carried out, involving “dogfooding”, obtaining feedback from users inside the company on real-world tasks, followed by formal human evaluations. This version became Command A.
“And to bring it full circle, the multilingual team carried out additional refinements on top to create Command A Translate,” Marchisio concluded.
So, what’s next? Marchisio said the cycle continues: training, tackling open problems, and applying new insights, all part of the ongoing work of advancing large language models. She pointed to multimodality, multilingual agents, and language consistency as three key focus areas. “We continue to think about these questions every day,” Marchisio said.