Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Getty Sues Stability AI Over Copyrighted Image Scraping

Google Gemma 3n is What Apple Intelligence Wants to Be

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Facebook X (Twitter) Instagram
Advanced AI News
Home » Access to AI Training Data Sparks Legal Questions
Industry Applications

Access to AI Training Data Sparks Legal Questions

Advanced AI EditorBy Advanced AI EditorJune 20, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


(Source: MeshCube/Shutterstock)

As “vibe coding” goes mainstream, AI companies are rushing to build the biggest and most authoritative tech knowledge bases to train the next generation of AI copilots. But how will AI companies obtain these curated troves of valuable tech data? Recent moves by Stack Overflow and Reddit show how it might play out.

Vibe coding–or telling a coding copilot what you want, and then sitting back while the AI generates code for you–is all the rage today. Searches for “vibe coding” are up 6,700% over the past 12 months, and even renowned technologists like CEO Ali Ghodsi rely on them.

“You’d even hear Ali himself tell you these days, ‘Look, I just mostly ask [Databricks] Assistant for what I need,” said Databricks VP of Marketing Joel Minnick. “If the first attempt at the code doesn’t work, I just kind of give it the error code and tell it ‘try again,’ and it tries again, and now it’s right.”

The combination of huge swaths of sample code and the incredible learning power of large language models (LLMs) give coding copilots their capabilities. What’s more, when questions arise over some technical topic, the Web’s vast array of discussion boards provides ample fodder for copilots to get even the small details correct.

The question then becomes: How do these coding copilots get access to the discussion boards to learn about the millions of tech tricks and edge cases? In some cases, the AI companies just take it without asking.

(Source: Mamun Sheikh/Shutterstock)

That is what Reddit, which is one of the most popular news aggregation and social media websites in the world, with 102 million daily active users, is accusing Anthropic of doing. On June 4, Reddit filed a lawsuit against Anthropic, accusing the AI company of scraping its website for content to train its AI models, in violation of its data policy.

As Ali Azhar writes in a story on BigDATAwire’s sister publication, AIWire:

“Reddit claims that Anthropic accessed its platform more than 100,000 times since July 2024 to scrape user-generated content for AI training, in violation of Reddit’s terms of service. The platform also claims that Anthropic reportedly assured it had blocked its bots from accessing Reddit, but continued to do so anyway.”

Anthropic, which creates Claude, considered to be one of the top AI models for coding copilots, didn’t pay for the data it took from the Reddit website, Reddit claims. In comparison, Google and OpenAI have signed contracts with Reddit to gain access to user-generated data, with some restrictions to secure user privacy.

Another popular source for technical content is Stack Overflow, which is laser-focused on technical topics. Stack Overflow has about 29 million registered users and more than 100 million monthly users (most of whom are not registered). Its knowledge base, dubbed Stack Exchange, includes more than 24 million questions and about 36 million answers. If you have a specific question about how Kubernetes works–and really, who doesn’t these days?–then Stack Overflow is a great place to get an answer.

One day before the Reddit lawsuit was filed, Stack Overflow signed a deal with Snowflake to enable make its user-generated data available to users via the Snowflake Marketplace. Prashanth Chandrasekar, Stack Overflow’s CEO, said the move makes it easier for Snowflake users to get access to high quality question-and-answer pairs curated by humans.

Prashanth Chandrasekar is the CEO of Stack Overflow.

“You’re getting immediate access to all the data,” Chandrasekar told BigDATAwire at the Snowflake Summit. “It’s pre-indexed and the latency of that is super low. And most important, it’s licensed.”

The Snowflake agreement primarily is to use Stack Overflow’s knowledge base for retrieval augmented generation (RAG), as opposed to training AI models, Chandrasekar said, adding that Stack Overflow has different mechanisms for pure AI training. But the end goal is the same: helping customers build AI systems based on trusted, curated data.

“I think removing the friction to the user to realize the dream of AI systems in a company–I think that is the name of the game,” Chandrasekar said. “Now users, while they’re using Snowflake, can get access to our data versus having to wait for that company to strike something with us.”

Reddit and Stack Overflow are opposites in many ways, with the former being a bit of a wild, anything-goes place, and the latter known more for restraint and ruthless adherence to facts. But their recent moves show they have one thing in common: unauthorized access to its content will not be tolerated.

The nature of the World Wide Web has changed since its egalitarian beginnings late in the 20th century. Over the past 15 years, giant tech firms have hoovered up vast swaths of the Internet, first for targeted analytics and more recently to train AI models. Enclaves that have yet to be fully mined, like Reddit and Stack Overflow, are now working to ensure that any monetization is done according to their terms and conditions, which puts more control back in the hands of users.

Stack Overflow has taken steps to not only prevent its data from being scraped for AI purposes but also to prevent AI from infiltrating the knowledge base. For instance, it utilizes Cloudflare to authenticate that users are human. It also has a strict policy against allowing AI-generated answers on the site. Human curation is essential to Stack Overflow’s process.

(Source: Dennis Diatel/Shutterstock)

Signing deals with companies like Snowflake could be a boon for Stack Overflow, which has seen its website traffic decline and the number of questions asked on Stack Exchange decrease in recent years. About three-quarters of Stack Overflow’s revenue is from hosting private knowledge bases for enterprises, while only one-quarter is from advertising on the public Stack Exchange site, Chandrasekar said.

“I think the nature of the Internet has changed in the past couple of years, the social contract of people building websites, monetizing off ads based on traffic on the website,” he said. “We want to have relationships with everyone and be exposed in a way that we will go wherever the developer is, wherever the user is, wherever they want to be.”

The message to AI model builders and users is clear: If high quality, human-sourced data is important to your endeavor, then you should be willing to pay the provider a fair sum, while simultaneously ensuring user privacy is maintained at all times. After all, it’s only money.

This article first appeared on our sister publication, BigDATAwire.

Related

Tags:
AI model,AI training data,Anthropic,coding copilot,data licensing,data scraping,Databricks,Google,OpenAI,Prashanth Chandrasekar,Reddit,Snowflake,Stack Overflow,vibe coding

About the author: Alex Woodie

Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleWhere Will C3.ai Stock Be in 3 Years?
Next Article Mistral just updated its open source Small model from 3.1 to 3.2: here’s why
Advanced AI Editor
  • Website

Related Posts

Meta and Oakley pair up to launch new AI glasses

June 20, 2025

Pope Leo wades into business regulation, preaching the idea of an ethical AI framework to tech executives

June 20, 2025

Tesla to launch in India in July with vehicles already arriving: report

June 20, 2025
Leave A Reply Cancel Reply

Latest Posts

Songtsam Resorts Launch Collaboration Inspired By Tibet’s Sacred Lake

Spanish Supreme Court Orders Heirs to Return Cathedral Statues

ARTnews Polled 10 Digital Art Experts To Find Out Their Favorite Digital Art Works

How Singapore Reimagines Care Through Design

Latest Posts

Getty Sues Stability AI Over Copyrighted Image Scraping

June 21, 2025

Google Gemma 3n is What Apple Intelligence Wants to Be

June 21, 2025

Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

June 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.