Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

International Business Machines Corporation (IBM) “Is Down Too Much,” Says Jim Cramer

U.S. tech stocks slide after Altman warns of ‘bubble’ in AI and MIT study doubts the hype

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
AI Tools & Product Releases

Wikipedia offers AI developers its article data on Kaggle to stop automated scraping

By Advanced AI EditorApril 17, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


The Wikimedia Foundation, the organization behind the internet’s largest free encyclopedia Wikipedia, is offering an artificial intelligence-ready dataset on Kaggle that’s aimed at dissuading AI companies and large language model trainers from scraping the website.

“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content — making this ideal for training models, building features, and testing NLP pipelines,” Wikimedia said in the announcement on Wednesday.

Kaggle is a data science and machine learning community owned and governed by Google LLC that hosts datasets and data science challenges.

The dataset upload is available as of April 15 and includes high-quality elements such as abstracts, short descriptions, infobox key-value data, image links and segmented article sections. It excludes references and non-prose elements such as images and charts themselves.

Because the content is taken from Wikipedia, it’s licensed under the Creative Commons, a widely open free use license that allows for sharing, adapting and remixing content. It is also licensed under the GNU Free Documentation License, or GDFL, although in some cases public domain or alternative licenses may apply.

“Kaggle is already a top place people go to find datasets, and there are few open datasets that have more impact than those hosted by the Wikimedia Foundation,” said Brenda Flynn, partnerships lead at Kaggle.

LLM developers depend heavily on data from the internet to train their models, but they get their datasets by scraping that data from public-facing websites. Web scraping is an automated process of extracting content, usually text and images, from websites using software that can be aggressive and adds additional load to web servers above and beyond normal human traffic.

That additional load is a costly performance hit for the web servers that have to bear it. The scraped data also must be reformatted so that machine learning and AI workflows can use it for training data.

Wikimedia and Kaggle said in the joint announcement that this dataset is designed to short-circuit this scraping not just to reduce the need for this scraping behavior and lower the burden on Wikimedia’s web servers but also to provide already clean, pre-parsed and developer-friendly data.

Kaggle is host to more than 461,000 freely accessible datasets for AI and machine learning covering a wide variety of topics. Wikipedia’s dataset will join datasets on health (such as diabetes and cancer), finance (such as credit card fraud and the stock market) and social sciences (such as social media trends and education). There’s even a dataset containing nutrition information on 80 cereal products and one about UFO sightings.

The new Wikipedia dataset is available in French and English language editions on Kaggle as an early beta release. Since this is an early release, Kaggle is welcoming feedback and discussions about the dataset from the community directly.

Image: SiliconANGLE/Microsoft Designer

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleMeta FAIR advances human-like AI with five major releases
Next Article Master AI Browser Agents with New Course by AGI Inc. Co-founders | Flash News Detail
Advanced AI Editor
  • Website

Related Posts

ChatGPT recruiting efficiency | Recruiting News Network

August 20, 2025

Hiring for Passion and Will Factor

August 20, 2025

Lenovo’s Customer Service AI Chatbot Got Tricked Into Revealing Sensitive Information. Here’s How.

August 20, 2025
Leave A Reply

Latest Posts

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Getty Grants $2.6 M. to Black Visual Arts Archives Across the U.S.

Barbara Hepworth Sculpture Will Remain in UK After £3.8 M. Raised

After 12-Year Hiatus, Egypt’s Alexandria Biennale Will Return

Latest Posts

International Business Machines Corporation (IBM) “Is Down Too Much,” Says Jim Cramer

August 20, 2025

U.S. tech stocks slide after Altman warns of ‘bubble’ in AI and MIT study doubts the hype

August 20, 2025

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’

August 20, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • International Business Machines Corporation (IBM) “Is Down Too Much,” Says Jim Cramer
  • U.S. tech stocks slide after Altman warns of ‘bubble’ in AI and MIT study doubts the hype
  • CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’
  • Anthropic bundles Claude Code into enterprise plans
  • Is this the moment when the Generative AI bubble finally deflates?

Recent Comments

  1. ChrisStits on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. Robertned on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. ChrisStits on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. HowardGok on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Haroldcoumb on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.