Meta’s Surprise Llama 4 Drop Exposes The Gap Between AI Ambition And Reality

Meta constructed the Llama 4 models using a mixture-of-experts (MoE) architecture, which is one way around the limitations of running huge AI models. Think of MoE like having a large team of specialized workers; instead of everyone working on every task, only the relevant specialists activate for a specific job.

For example, Llama 4 Maverick features a 400 billion parameter size, but only 17 billion of those parameters are active at once across one of 128 experts. Likewise, Scout features 109 billion total parameters, but only 17 billion are active at once across one of 16 experts. This design can reduce the computation needed to run the model, since smaller portions of neural network weights are active simultaneously.

Llama’s reality check arrives quickly

Current AI models have a relatively limited short-term memory. In AI, a context window acts somewhat in that fashion, determining how much information it can process simultaneously. AI language models like Llama typically process that memory as chunks of data called tokens, which can be whole words or fragments of longer words. Large context windows allow AI models to process longer documents, larger code bases, and longer conversations.

Despite Meta’s promotion of Llama 4 Scout’s 10 million token context window, developers have so far discovered that using even a fraction of that amount has proven challenging due to memory limitations. Willison reported on his blog that third-party services providing access, like Groq and Fireworks, limited Scout’s context to just 128,000 tokens. Another provider, Together AI, offered 328,000 tokens.

Evidence suggests accessing larger contexts requires immense resources. Willison pointed to Meta’s own example notebook (“build_with_llama_4”), which states that running a 1.4 million token context needs eight high-end Nvidia H100 GPUs.

Willison documented his own testing troubles. When he asked Llama 4 Scout via the OpenRouter service to summarize a long online discussion (around 20,000 tokens), the result wasn’t useful. He described the output as “complete junk output,” which devolved into repetitive loops.

Source link

What's Hot

OPM makes Copilot, ChatGPT available to its workforce; Meta offers Llama AI models to government for free

Nvidia to invest $100 billion in OpenAI to help expand the ChatGPT maker’s computing power

Nvidia plans to invest up to $100B in OpenAI

Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality

OPM makes Copilot, ChatGPT available to its workforce; Meta offers Llama AI models to government for free

Meta Teams Up With US Government To Bring Llama AI Models To Every Federal Agency – Meta Platforms (NASDAQ:META)

US government agencies approve Meta’s Llama AI for official use

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

New Collectors Drive Strong Sales at New York Fair

Hidden Portrait May Be Vermeer’s Earliest Known Work

Who Are the Art World Figures on the Time 100 List?

OPM makes Copilot, ChatGPT available to its workforce; Meta offers Llama AI models to government for free

Nvidia to invest $100 billion in OpenAI to help expand the ChatGPT maker’s computing power

Nvidia plans to invest up to $100B in OpenAI

What's Hot

Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality

Llama’s reality check arrives quickly

Related Posts

Subscribe to Updates