Beyond GPT Architecture: Why Google's Diffusion Approach Could Reshape LLM Deployment

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

Last month, along with a comprehensive suite of new AI tools and innovations, Google DeepMind unveiled Gemini Diffusion. This experimental research model uses a diffusion-based approach to generate text. Traditionally, large language models (LLMs) like GPT and Gemini itself have relied on autoregression, a step-by-step approach where each word is generated based on the previous one. Diffusion language models (DLMs), also known as diffusion-based large language models (dLLMs), leverage a method more commonly seen in image generation, starting with random noise and gradually refining it into a coherent output. This approach dramatically increases generation speed and can improve coherency and consistency.

Gemini Diffusion is currently available as an experimental demo; sign up for the waitlist here to get access.

(Editor’s note: We’ll be unpacking paradigm shifts like diffusion-based language models—and what it takes to run them in production—at VB Transform, June 24–25 in San Francisco, alongside Google DeepMind, LinkedIn and other enterprise AI leaders.)

Understanding diffusion vs. autoregression

Diffusion and autoregression are fundamentally different approaches. The autoregressive approach generates text sequentially, with tokens predicted one at a time. While this method ensures strong coherence and context tracking, it can be computationally intensive and slow, especially for long-form content.

Diffusion models, by contrast, begin with random noise, which is gradually denoised into a coherent output. When applied to language, the technique has several advantages. Blocks of text can be processed in parallel, potentially producing entire segments or sentences at a much higher rate.

Gemini Diffusion can reportedly generate 1,000-2,000 tokens per second. In contrast, Gemini 2.5 Flash has an average output speed of 272.4 tokens per second. Additionally, mistakes in generation can be corrected during the refining process, improving accuracy and reducing the number of hallucinations. There may be trade-offs in terms of fine-grained accuracy and token-level control; however, the increase in speed will be a game-changer for numerous applications.

How does diffusion-based text generation work?

During training, DLMs work by gradually corrupting a sentence with noise over many steps, until the original sentence is rendered entirely unrecognizable. The model is then trained to reverse this process, step by step, reconstructing the original sentence from increasingly noisy versions. Through the iterative refinement, it learns to model the entire distribution of plausible sentences in the training data.

While the specifics of Gemini Diffusion have not yet been disclosed, the typical training methodology for a diffusion model involves these key stages:

Forward diffusion: With each sample in the training dataset, noise is added progressively over multiple cycles (often 500 to 1,000) until it becomes indistinguishable from random noise.

Reverse diffusion: The model learns to reverse each step of the noising process, essentially learning how to “denoise” a corrupted sentence one stage at a time, eventually restoring the original structure.

This process is repeated millions of times with diverse samples and noise levels, enabling the model to learn a reliable denoising function.

Once trained, the model is capable of generating entirely new sentences. DLMs generally require a condition or input, such as a prompt, class label, or embedding, to guide the generation towards desired outcomes. The condition is injected into each step of the denoising process, which shapes an initial blob of noise into structured and coherent text.

Advantages and disadvantages of diffusion-based models

In an interview with VentureBeat, Brendan O’Donoghue, research scientist at Google DeepMind and one of the leads on the Gemini Diffusion project, elaborated on some of the advantages of diffusion-based techniques when compared to autoregression. According to O’Donoghue, the major advantages of diffusion techniques are the following:

Lower latencies: Diffusion models can produce a sequence of tokens in much less time than autoregressive models.

Adaptive computation: Diffusion models will converge to a sequence of tokens at different rates depending on the task’s difficulty. This allows the model to consume fewer resources (and have lower latencies) on easy tasks and more on harder ones.

Non-causal reasoning: Due to the bidirectional attention in the denoiser, tokens can attend to future tokens within the same generation block. This allows non-causal reasoning to take place and allows the model to make global edits within a block to produce more coherent text.

Iterative refinement / self-correction: The denoising process involves sampling, which can introduce errors just like in autoregressive models. However, unlike autoregressive models, the tokens are passed back into the denoiser, which then has an opportunity to correct the error.

O’Donoghue also noted the main disadvantages: “higher cost of serving and slightly higher time-to-first-token (TTFT), since autoregressive models will produce the first token right away. For diffusion, the first token can only appear when the entire sequence of tokens is ready.”

Performance benchmarks

Google says Gemini Diffusion’s performance is comparable to Gemini 2.0 Flash-Lite.

BenchmarkTypeGemini DiffusionGemini 2.0 Flash-LiteLiveCodeBench (v6)Code30.9%28.5%BigCodeBenchCode45.4%45.8%LBPP (v2)Code56.8%56.0%SWE-Bench Verified*Code22.9%28.5%HumanEvalCode89.6%90.2%MBPPCode76.0%75.8%GPQA DiamondScience40.4%56.5%AIME 2025Mathematics23.3%20.0%BIG-Bench Extra HardReasoning15.0%21.0%Global MMLU (Lite)Multilingual69.1%79.0%

* Non-agentic evaluation (single turn edit only), max prompt length of 32K.

The two models were compared using several benchmarks, with scores based on how many times the model produced the correct answer on the first try. Gemini Diffusion performed well in coding and mathematics tests, while Gemini 2.0 Flash-lite had the edge on reasoning, scientific knowledge, and multilingual capabilities.

As Gemini Diffusion evolves, there’s no reason to think that its performance won’t catch up with more established models. According to O’Donoghue, the gap between the two techniques is “essentially closed in terms of benchmark performance, at least at the relatively small sizes we have scaled up to. In fact, there may be some performance advantage for diffusion in some domains where non-local consistency is important, for example, coding and reasoning.”

Testing Gemini Diffusion

VentureBeat was granted access to the experimental demo. When putting Gemini Diffusion through its paces, the first thing we noticed was the speed. When running the suggested prompts provided by Google, including building interactive HTML apps like Xylophone and Planet Tac Toe, each request completed in under three seconds, with speeds ranging from 600 to 1,300 tokens per second.

To test its performance with a real-world application, we asked Gemini Diffusion to build a video chat interface with the following prompt:

Build an interface for a video chat application. It should have a preview window that accesses the camera on my device and displays its output. The interface should also have a sound level meter that measures the output from the device’s microphone in real time.

In less than two seconds, Gemini Diffusion created a working interface with a video preview and an audio meter.

Though this was not a complex implementation, it could be the start of an MVP that can be completed with a bit of further prompting. Note that Gemini 2.5 Flash also produced a working interface, albeit at a slightly slower pace (approximately seven seconds).

Gemini Diffusion also features “Instant Edit,” a mode where text or code can be pasted in and edited in real-time with minimal prompting. Instant Edit is effective for many types of text editing, including correcting grammar, updating text to target different reader personas, or adding SEO keywords. It is also useful for tasks such as refactoring code, adding new features to applications, or converting an existing codebase to a different language.

Enterprise use cases for DLMs

It’s safe to say that any application that requires a quick response time stands to benefit from DLM technology. This includes real-time and low-latency applications, such as conversational AI and chatbots, live transcription and translation, or IDE autocomplete and coding assistants.

According to O’Donoghue, with applications that leverage “inline editing, for example, taking a piece of text and making some changes in-place, diffusion models are applicable in ways autoregressive models aren’t.” DLMs also have an advantage with reason, math, and coding problems, due to “the non-causal reasoning afforded by the bidirectional attention.”

DLMs are still in their infancy; however, the technology can potentially transform how language models are built. Not only do they generate text at a much higher rate than autoregressive models, but their ability to go back and fix mistakes means that, eventually, they may also produce results with greater accuracy.

Gemini Diffusion enters a growing ecosystem of DLMs, with two notable examples being Mercury, developed by Inception Labs, and LLaDa, an open-source model from GSAI. Together, these models reflect the broader momentum behind diffusion-based language generation and offer a scalable, parallelizable alternative to traditional autoregressive architectures.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

What's Hot

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions – Takara TLDR

AI Systems Can Be Fooled by Fake Dates, Giving Newer Content Unfair Visibility

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety – Takara TLDR

Beyond GPT architecture: Why Google’s Diffusion approach could reshape LLM deployment

Will updating your AI agents help or hamper their performance? Raindrop's new tool Experiments tells you

When dirt meets data: ScottsMiracle-Gro saved $150M using AI

Nvidia researchers boost LLMs reasoning skills by getting them to 'think' during pre-training

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues