Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

AI Is Steadily Conquering The IP World – Survey – Artificial Lawyer

Why The US Is Letting Nvidia Ship H20 AI Chips To China – NVIDIA (NASDAQ:NVDA)

Millennial and Gen Z Gallerists Looking to ‘Redefine Success’ and more

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Industry AI
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
VentureBeat AI

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

By Advanced AI EditorJune 5, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Most people interested in generative AI likely already know that Large Language Models (LLMs) — like those behind ChatGPT, Anthropic’s Claude, and Google’s Gemini — are trained on massive datasets: trillions of words pulled from websites, books, codebases, and, increasingly, other media such as images, audio, and video. But why?

From this data, LLMs develop a statistical, generalized understanding of language, its patterns, and the world — encoded in the form of billions of parameters, or “settings,” in a network of artificial neurons (which are mathematical functions that transform input data into output signals).

By being exposed to all this training data, LLMs learn to detect and generalize patterns that are reflected in the parameters of their neurons. For instance, the word “apple” often appears near terms related to food, fruit, or trees, and sometimes computers. The model picks up that apples can be red, green, or yellow, or even sometimes other colors if rotten or rare, are spelled “a-p-p-l-e” in English, and are edible. This statistical knowledge influences how the model responds when a user enters a prompt — shaping the output it generates based on the associations it “learned” from the training data.

But a big question — even among AI researchers — remains: how much of an LLM’s training data is used to build generalized representations of concepts, and how much is instead memorized verbatim or stored in a way that is identical or nearly identical to the original data?

This is important not only for better understanding how LLMs operate — and when they go wrong — but also as model providers defend themselves in copyright infringement lawsuits brought by data creators and owners, such as artists and record labels. If LLMs are shown to reproduce significant portions of their training data verbatim, courts could be more likely to side with plaintiffs arguing that the models unlawfully copied protected material. If not — if the models are found to generate outputs based on generalized patterns rather than exact replication — developers may be able to continue scraping and training on copyrighted data under existing legal defenses such as fair use.

Now, we finally have an answer to the question of how much LLMs memorize versus generalize: a new study released this week from researchers at Meta, Google DeepMind, Cornell University, and NVIDIA finds that GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

To understand what 3.6 bits means in practice:

A single bit is the smallest unit of digital data, representing either a 0 or a 1. Eight bits make up one byte.

Storing 3.6 bits allows for approximately 12.13 distinct values, as calculated by 2^3.6.

This is about the amount of information needed to choose one of 12 options—similar to selecting a month of the year or the outcome of a roll of a 12-sided die.

It is not enough to store even one English letter (which needs about 4.7 bits), but it is just enough to encode a character from a reduced set of 10 common English letters (which requires about 3.32 bits).

In bytes, 3.6 bits is 0.45 bytes—less than half the size of a typical character stored in ASCII (which uses 8 bits or 1 byte).

This number is model-independent within reasonable architectural variations: different depths, widths, and precisions produced similar results. The estimate held steady across model sizes and even precision levels, with full-precision models reaching slightly higher values (up to 3.83 bits/parameter).

More training data DOES NOT lead to more memorization — in fact, a model will be less likely to memorize any single data point

One key takeaway from the research is that models do not memorize more when trained on more data. Instead, a model’s fixed capacity is distributed across the dataset, meaning each individual datapoint receives less attention.

Jack Morris, the lead author, explained via the social network X that “training on more data will force models to memorize less per-sample.”

These findings may help ease concerns around large models memorizing copyrighted or sensitive content.

If memorization is limited and diluted across many examples, the likelihood of reproducing any one specific training example decreases. In essence, more training data leads to safer generalization behavior, not increased risk.

How the researchers identified these findings

To precisely quantify how much language models memorize, the researchers used an unconventional but powerful approach: they trained transformer models on datasets composed of uniformly random bitstrings. Each of these bitstrings was sampled independently, ensuring that no patterns, structure, or redundancy existed across examples.

Because each sample is unique and devoid of shared features, any ability the model shows in reconstructing or identifying these strings during evaluation directly reflects how much information it retained—or memorized—during training.

The key reason for this setup was to completely eliminate the possibility of generalization. Unlike natural language—which is full of grammatical structure, semantic overlap, and repeating concepts—uniform random data contains no such information. Every example is essentially noise, with no statistical relationship to any other. In such a scenario, any performance by the model on test data must come purely from memorization of the training examples, since there is no distributional pattern to generalize from.

The authors argue their method is perhaps one of the only principled ways to decouple memorization from learning in practice, because when LLMs are trained on real language, even when they produce an output that matches the training data, it’s difficult to know whether they memorized the input or merely inferred the underlying structure from the patterns they’ve observed.

This method allows the researchers to map a direct relationship between the number of model parameters and the total information stored. By gradually increasing model size and training each variant to saturation, across hundreds of experiments on models ranging from 500K to 1.5 billion parameters, they observed consistent results: 3.6 bits memorized per parameter, which they report as a fundamental measure of LLM memory capacity.

The team applied their methodology to models trained on real-world datasets as well. When trained on text, models exhibited a balance of memorization and generalization.

Smaller datasets encouraged more memorization, but as dataset size increased, models shifted toward learning generalizable patterns. This transition was marked by a phenomenon known as “double descent,” where performance temporarily dips before improving once generalization kicks in.

The study also examined how model precision—comparing training in bfloat16 versus float32—affects memorization capacity. They observed a modest increase from 3.51 to 3.83 bits-per-parameter when switching to full 32-bit precision. However, this gain is far less than the doubling of available bits would suggest, implying diminishing returns from higher precision.

Unique data is more likely to be memorized

The paper proposes a scaling law that relates a model’s capacity and dataset size to the effectiveness of membership inference attacks.

These attacks attempt to determine whether a particular data point was part of a model’s training set. The research shows that such attacks become unreliable as dataset size grows, supporting the argument that large-scale training helps reduce privacy risk.

While the paper focuses on average-case behavior, some researchers have pointed out that certain types of data—such as highly unique or stylized writing—may still be more susceptible to memorization.

The authors acknowledge this limitation and emphasize that their method is designed to characterize general trends rather than edge cases.

Moving toward greater human understanding of LLM understanding

By introducing a principled and quantifiable definition of memorization, the study gives developers and researchers new tools for evaluating the behavior of language models. This helps not only with model transparency but also with compliance, privacy, and ethical standards in AI development. The findings suggest that more data—and not less—may be the safer path when training large-scale language models.

To put total model memorization in perspective:

A 500K-parameter model can memorize roughly 1.8 million bits, or 225 KB of data.

A 1.5 billion parameter model can hold about 5.4 billion bits, or 675 megabytes of raw information.

This is not comparable to typical file storage like images (e.g., a 3.6 MB uncompressed image is about 30 million bits), but it is significant when distributed across discrete textual patterns.

I’m no lawyer or legal expert, but I would highly expect such research to be cited in the numerous ongoing lawsuits between AI providers and data creators/rights owners.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleThe founder of DeviantArt is making a $22,000 display for digital art
Next Article Protein Industries Canada Targets Supply Chain Resilience with $15 Million in Genomics and AI Funding – vegconomist
Advanced AI Editor
  • Website

Related Posts

How can enterprises keep systems safe as AI agents join human employees? Cyata launches with a new, dedicated solution

July 30, 2025

Acree opens up new enterprise-focused, customizable AI model AFM-4.5B trained on ‘clean, rigorously filtered data’

July 30, 2025

AI vs. AI: Prophet Security raises $30M to replace human analysts with autonomous defenders

July 30, 2025
Leave A Reply

Latest Posts

Millennial and Gen Z Gallerists Looking to ‘Redefine Success’ and more

Artlogic, ArtCloud Merge in Bid to Shape Art World’s Digital Backbone

Met Museum Trustee Among Those Killed in NYC Shooting

John Roberts Prevented Firing of National Portrait Gallery Director

Latest Posts

AI Is Steadily Conquering The IP World – Survey – Artificial Lawyer

July 30, 2025

Why The US Is Letting Nvidia Ship H20 AI Chips To China – NVIDIA (NASDAQ:NVDA)

July 30, 2025

Millennial and Gen Z Gallerists Looking to ‘Redefine Success’ and more

July 30, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • AI Is Steadily Conquering The IP World – Survey – Artificial Lawyer
  • Why The US Is Letting Nvidia Ship H20 AI Chips To China – NVIDIA (NASDAQ:NVDA)
  • Millennial and Gen Z Gallerists Looking to ‘Redefine Success’ and more
  • Paper page – HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
  • Google Gemma 27B AI model performance tested

Recent Comments

  1. MichaelProps on Local gov’t reps say they look forward to working with Thomas
  2. Lucky Jet on Former Tesla AI czar Andrej Karpathy coins ‘vibe coding’: Here’s what it means
  3. 註冊即可獲得 100 USDT on Your friend, girlfriend, therapist? What Mark Zuckerberg thinks about future of AI, Meta’s Llama AI app, more
  4. ScottFlist on OpenAI Loses 4 Key Researchers to Meta
  5. binance on Nvidia CEO Jensen Huang calls US ban on H20 AI chip ‘deeply painful’

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.