Wikipedia Offers AI Developers Its Article Data On Kaggle To Stop Automated Scraping

The Wikimedia Foundation, the organization behind the internet’s largest free encyclopedia Wikipedia, is offering an artificial intelligence-ready dataset on Kaggle that’s aimed at dissuading AI companies and large language model trainers from scraping the website.

“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content — making this ideal for training models, building features, and testing NLP pipelines,” Wikimedia said in the announcement on Wednesday.

Kaggle is a data science and machine learning community owned and governed by Google LLC that hosts datasets and data science challenges.

The dataset upload is available as of April 15 and includes high-quality elements such as abstracts, short descriptions, infobox key-value data, image links and segmented article sections. It excludes references and non-prose elements such as images and charts themselves.

Because the content is taken from Wikipedia, it’s licensed under the Creative Commons, a widely open free use license that allows for sharing, adapting and remixing content. It is also licensed under the GNU Free Documentation License, or GDFL, although in some cases public domain or alternative licenses may apply.

“Kaggle is already a top place people go to find datasets, and there are few open datasets that have more impact than those hosted by the Wikimedia Foundation,” said Brenda Flynn, partnerships lead at Kaggle.

LLM developers depend heavily on data from the internet to train their models, but they get their datasets by scraping that data from public-facing websites. Web scraping is an automated process of extracting content, usually text and images, from websites using software that can be aggressive and adds additional load to web servers above and beyond normal human traffic.

That additional load is a costly performance hit for the web servers that have to bear it. The scraped data also must be reformatted so that machine learning and AI workflows can use it for training data.

Wikimedia and Kaggle said in the joint announcement that this dataset is designed to short-circuit this scraping not just to reduce the need for this scraping behavior and lower the burden on Wikimedia’s web servers but also to provide already clean, pre-parsed and developer-friendly data.

Kaggle is host to more than 461,000 freely accessible datasets for AI and machine learning covering a wide variety of topics. Wikipedia’s dataset will join datasets on health (such as diabetes and cancer), finance (such as credit card fraud and the stock market) and social sciences (such as social media trends and education). There’s even a dataset containing nutrition information on 80 cereal products and one about UFO sightings.

The new Wikipedia dataset is available in French and English language editions on Kaggle as an early beta release. Since this is an early release, Kaggle is welcoming feedback and discussions about the dataset from the community directly.

Image: SiliconANGLE/Microsoft Designer

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU

Source link

What's Hot

International Business Machines Corporation (IBM) “Is Down Too Much,” Says Jim Cramer

U.S. tech stocks slide after Altman warns of ‘bubble’ in AI and MIT study doubts the hype

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’

Wikipedia offers AI developers its article data on Kaggle to stop automated scraping

ChatGPT recruiting efficiency | Recruiting News Network

Hiring for Passion and Will Factor

Lenovo’s Customer Service AI Chatbot Got Tricked Into Revealing Sensitive Information. Here’s How.

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Getty Grants $2.6 M. to Black Visual Arts Archives Across the U.S.

Barbara Hepworth Sculpture Will Remain in UK After £3.8 M. Raised

After 12-Year Hiatus, Egypt’s Alexandria Biennale Will Return