Robots.txt got some much-needed TLC last week, courtesy of Cloudflare’s latest update.
Cloudflare’s new Content Signals Policy effectively upgrades the decades-old honor system and adds a way for publishers to spell out how they do (and perhaps more importantly – how they don’t – want AI crawlers to use their content once it’s scraped.)
For publishers, that distinction matters because it shifts the robots.txt file from a blunt yes-or-no tool into a way of distinguishing between search, AI training and AI outputs. And that distinction goes to the heart of how their content is used, valued and potentially monetized.
It includes the option to signal that AI systems shouldn’t use their material for things like Google’s AI Overviews or inference.
Several publishers Digiday has spoken to over the last several months have at one point or another described the current robots.txt as “unfit for purpose.” And while this upgrade still doesn’t ensure AI compliance, it does at least set a new precedent for better transparency and means publishers can spell out, in black and white, how they want AI crawlers to use their content – a move many publishers have welcomed as long overdue.
And yet, none are blind to the glaringly obvious: without enforceability, the risk remains that AI platforms will still extract value from their work without compensation.
“The Policy separates out search, AI-train, and AI-crawl, which is a well-evolved understanding of how publishers should think about AI,” said Justin Wohl, vp of strategy for Aditude and former chief revenue officer for fact-checking site Snopes and TV Tropes.
Cloudflare’s policy distinguishes between different ways AI systems use content: ‘search, where material might be pulled into something like an AI Overview with the potential for attribution or referral; ‘train,’ where content is ingested to build the model itself, often without compensation; and ‘crawl,’ where bots systematically scrape pages. For publishers, separating these use cases matters — because only one of them offers even the possibility of return, while the others risk extracting value without reward, noted Wohl.
“The Content Signals Policy is an increasingly necessary solution in that when Google is creating its AI Overviews, the bots are somewhat indistinguishable from humans as they navigate sites, and are going to cause publishers’ IVT scores to explode, if the user agents haven’t been identifiable and the scoring impacts of them mitigated by the companies measuring such things for advertisers,” added Wohl.
Five publishers Digiday spoke to for this article said the update to the robots.txt signals is a good start in letting publishers dictate how their data is used for search versus AI training. “That much-needed nuance is overdue and a genuinely positive step forward,” said Eric Hochberger, CEO and co-founder of Mediavine. “I’d love to see it go further to truly empower publishers to regain control over their content,” he added.
That’s something other initiatives like the Responsible AI Licensing Standard (RSL), being developed by groups including Reddit, Fastly and news publishers, are working on. Whereas Cloudflare’s update is about giving publishers the ability to specify what they do allow their content to be used for by AI crawlers, RSL has created a standard for publishers to then set up AI remuneration – essentially royalties for whenever their content is scraped for retrieval augmentation generation (RAG.)
Cloudflare will add the new policy language to robots.txt for customers that use it to manage their files, and is publishing tools for others who want to customize how crawlers use their content.
Progress, but still an elephant in the room
For all the positives, neither RSL nor Cloudflare’s update addresses the elephant in the room: whether AI crawlers will actually honor these signals, especially the one publishers care about most – Google.
Google technically separates its search crawler (Googlebot) and its AI crawler (Google-Extended), but in practice they overlap. Even if a publisher blocks Google-Extended, their content can still show up in AI Overviews, because those are tied to Google Search. In other words, AI Overviews are bundled with the core search crawler, not treated as a separate opt-in. That has meant most publishers haven’t been able to opt out of Google’s AI crawler for fear of their search traffic being affected.
“I think it [content signals policy] is an interesting idea. But I don’t see any indication that Google and others will follow it,” said a senior exec at a large news organization, who spoke on condition of anonymity. “Google has been pretty clear they see AI summaries as fair use.”
Earlier this month, media group Penske became the biggest publisher to sue Google specifically for allegedly harming its traffic with AI Overviews and for alleged illegal content scraping. Meanwhile, the tech giant is currently working out remedies with the DOJ in court, to determine how it rectifies what has been deemed an illegal monopoly of its ad exchange and ad server.
“Publishers all should commonly be in alignment that AI and Search crawlers should be distinguishable and treated differently,” said Wohl. “I do hope that Google, perhaps via the Chrome team, will see the sensibility in this from the perspective of how their browser works and impacts downstream parties,” he added.
While publishers have welcomed Cloudflare’s update because of the added clarity, many acknowledge it’s just a stopgap: without guaranteed enforcement, the real risks from AI are still only partially addressed. But, it’s progress.
It sets an important legal precedent, said Paul Bannister, CRO of Raptive. “It puts in parameters that a good actor should follow and if they don’t, you can take [legal] action. You may not win, but you can take action. You can, of course, ignore legal stuff, but if you do, you’re taking a real risk that there can be issues there. So much of this is laying the groundwork for how this is all going to look. It’s a small step forward, but it pushes the ball in the right direction.”