Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

NASA, IBM’s AI model set to unlock Sun mysteries

Cisco ties storage networking gear to IBM z17 mainframe

RynnEC: Bringing MLLMs into Embodied World – Takara TLDR

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Industry Applications

Access to AI Training Data Sparks Legal Questions

By Advanced AI EditorJune 20, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


(Source: MeshCube/Shutterstock)

As “vibe coding” goes mainstream, AI companies are rushing to build the biggest and most authoritative tech knowledge bases to train the next generation of AI copilots. But how will AI companies obtain these curated troves of valuable tech data? Recent moves by Stack Overflow and Reddit show how it might play out.

Vibe coding–or telling a coding copilot what you want, and then sitting back while the AI generates code for you–is all the rage today. Searches for “vibe coding” are up 6,700% over the past 12 months, and even renowned technologists like CEO Ali Ghodsi rely on them.

“You’d even hear Ali himself tell you these days, ‘Look, I just mostly ask [Databricks] Assistant for what I need,” said Databricks VP of Marketing Joel Minnick. “If the first attempt at the code doesn’t work, I just kind of give it the error code and tell it ‘try again,’ and it tries again, and now it’s right.”

The combination of huge swaths of sample code and the incredible learning power of large language models (LLMs) give coding copilots their capabilities. What’s more, when questions arise over some technical topic, the Web’s vast array of discussion boards provides ample fodder for copilots to get even the small details correct.

The question then becomes: How do these coding copilots get access to the discussion boards to learn about the millions of tech tricks and edge cases? In some cases, the AI companies just take it without asking.

(Source: Mamun Sheikh/Shutterstock)

That is what Reddit, which is one of the most popular news aggregation and social media websites in the world, with 102 million daily active users, is accusing Anthropic of doing. On June 4, Reddit filed a lawsuit against Anthropic, accusing the AI company of scraping its website for content to train its AI models, in violation of its data policy.

As Ali Azhar writes in a story on BigDATAwire’s sister publication, AIWire:

“Reddit claims that Anthropic accessed its platform more than 100,000 times since July 2024 to scrape user-generated content for AI training, in violation of Reddit’s terms of service. The platform also claims that Anthropic reportedly assured it had blocked its bots from accessing Reddit, but continued to do so anyway.”

Anthropic, which creates Claude, considered to be one of the top AI models for coding copilots, didn’t pay for the data it took from the Reddit website, Reddit claims. In comparison, Google and OpenAI have signed contracts with Reddit to gain access to user-generated data, with some restrictions to secure user privacy.

Another popular source for technical content is Stack Overflow, which is laser-focused on technical topics. Stack Overflow has about 29 million registered users and more than 100 million monthly users (most of whom are not registered). Its knowledge base, dubbed Stack Exchange, includes more than 24 million questions and about 36 million answers. If you have a specific question about how Kubernetes works–and really, who doesn’t these days?–then Stack Overflow is a great place to get an answer.

One day before the Reddit lawsuit was filed, Stack Overflow signed a deal with Snowflake to enable make its user-generated data available to users via the Snowflake Marketplace. Prashanth Chandrasekar, Stack Overflow’s CEO, said the move makes it easier for Snowflake users to get access to high quality question-and-answer pairs curated by humans.

Prashanth Chandrasekar is the CEO of Stack Overflow.

“You’re getting immediate access to all the data,” Chandrasekar told BigDATAwire at the Snowflake Summit. “It’s pre-indexed and the latency of that is super low. And most important, it’s licensed.”

The Snowflake agreement primarily is to use Stack Overflow’s knowledge base for retrieval augmented generation (RAG), as opposed to training AI models, Chandrasekar said, adding that Stack Overflow has different mechanisms for pure AI training. But the end goal is the same: helping customers build AI systems based on trusted, curated data.

“I think removing the friction to the user to realize the dream of AI systems in a company–I think that is the name of the game,” Chandrasekar said. “Now users, while they’re using Snowflake, can get access to our data versus having to wait for that company to strike something with us.”

Reddit and Stack Overflow are opposites in many ways, with the former being a bit of a wild, anything-goes place, and the latter known more for restraint and ruthless adherence to facts. But their recent moves show they have one thing in common: unauthorized access to its content will not be tolerated.

The nature of the World Wide Web has changed since its egalitarian beginnings late in the 20th century. Over the past 15 years, giant tech firms have hoovered up vast swaths of the Internet, first for targeted analytics and more recently to train AI models. Enclaves that have yet to be fully mined, like Reddit and Stack Overflow, are now working to ensure that any monetization is done according to their terms and conditions, which puts more control back in the hands of users.

Stack Overflow has taken steps to not only prevent its data from being scraped for AI purposes but also to prevent AI from infiltrating the knowledge base. For instance, it utilizes Cloudflare to authenticate that users are human. It also has a strict policy against allowing AI-generated answers on the site. Human curation is essential to Stack Overflow’s process.

(Source: Dennis Diatel/Shutterstock)

Signing deals with companies like Snowflake could be a boon for Stack Overflow, which has seen its website traffic decline and the number of questions asked on Stack Exchange decrease in recent years. About three-quarters of Stack Overflow’s revenue is from hosting private knowledge bases for enterprises, while only one-quarter is from advertising on the public Stack Exchange site, Chandrasekar said.

“I think the nature of the Internet has changed in the past couple of years, the social contract of people building websites, monetizing off ads based on traffic on the website,” he said. “We want to have relationships with everyone and be exposed in a way that we will go wherever the developer is, wherever the user is, wherever they want to be.”

The message to AI model builders and users is clear: If high quality, human-sourced data is important to your endeavor, then you should be willing to pay the provider a fair sum, while simultaneously ensuring user privacy is maintained at all times. After all, it’s only money.

This article first appeared on our sister publication, BigDATAwire.

Related

Tags:
AI model,AI training data,Anthropic,coding copilot,data licensing,data scraping,Databricks,Google,OpenAI,Prashanth Chandrasekar,Reddit,Snowflake,Stack Overflow,vibe coding

About the author: Alex Woodie

Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleWhere Will C3.ai Stock Be in 3 Years?
Next Article Mistral just updated its open source Small model from 3.1 to 3.2: here’s why
Advanced AI Editor
  • Website

Related Posts

SpotDraft, StructureFlow, BigHand, Eudia + ClausePilot – Artificial Lawyer

August 21, 2025

Tech sell-off? Investors could just be taking profit

August 21, 2025

Second day of U.S. tech-sell off — but don’t panic

August 21, 2025
Leave A Reply

Latest Posts

Tanya Bonakdar Gallery to Close Los Angeles Space

Ancient Silver Coins Suggest New History of Trading in Southeast Asia

Sasan Ghandehari Sues Christie’s Over Picasso Once Owned by a Criminal

Ancient Roman Villa in Sicily Reveals Mosaic of Flip-Flops

Latest Posts

NASA, IBM’s AI model set to unlock Sun mysteries

August 21, 2025

Cisco ties storage networking gear to IBM z17 mainframe

August 21, 2025

RynnEC: Bringing MLLMs into Embodied World – Takara TLDR

August 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • NASA, IBM’s AI model set to unlock Sun mysteries
  • Cisco ties storage networking gear to IBM z17 mainframe
  • RynnEC: Bringing MLLMs into Embodied World – Takara TLDR
  • China is more eager to use Nvidia’s AI chips than it lets on
  • OpenAI might sell AI infrastructure services to offset costs

Recent Comments

  1. CraigClepe on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. KevinImake on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. NathanFairl on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. NathanFairl on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. JuliusRex on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.