Top websites block Google from training AI models on their data. Nowhere near as much as OpenAI, though.

Hugh Langley

Mar 15, 2024, 03:14 IST

Top websites block Google from training AI models on their data. Nowhere near as much as OpenAI, though.

Sundar Pichai on stage at Google IO 2023Google

Google launched a new tool that lets publishers opt out of training Google's AI models.
More and more top-ranking websites are using it.

There's a grand bargain at the heart of the web: A small piece of code that has maintained order for decades.

Robots.txt lets website owners choose whether to let Google and other tech giants scrape their online content. Most sites have let Google do this because the company distributes so much valuable traffic.

Then, the AI wars began. It turns out that all this content has been stored in datasets that are the foundation for training powerful AI models, including those from OpenAI, Google, Meta, and others. These models often answer user questions directly, so less traffic may be distributed and the grand web bargain begins to unravel.

Complimentary Tech Event

Transform talent with learning that works

Capability development is critical for businesses who want to push the envelope of innovation.Discover how business leaders are strategizing around building talent capabilities and empowering employee transformation.Know More

Part of Google's response has been to launch a new tool that lets websites block the company from using their content for training AI models. It's called Google-Extended. It came out in September, and it's getting some pickup.

Data shared by Originality.ai shows the Google-Extended snippet is being used by about 10% of the top 1,000 websites, as of late March.

The New York Times has enabled the Google-Extended blocker, according to a review of its robots.txt file. The publication, which is in a heated AI copyright battle with OpenAI, has also blocked that startup's access to its content.

It's on a warpath with other companies that either tap online data for AI model training, or compile this type of data for others to use in similar ways.

"Use of any device, tool, or process designed to data mine or scrape the content using automated means is prohibited without prior written permission," NYT states on its robots.txt page.

Prohibited uses include "the development of any software, machine learning, artificial intelligence (AI), and/or large language models (LLMs)," the publisher adds. A spokesperson for NYT declined to comment.

Google blocked less than OpenAI

For Google-Extended, other websites have switched this on too, including CNN, BBC, Yelp, and Business Insider, the publisher of this story.

However, Google-Extended has had much less pickup than OpenAI's GPTBot, which is hovering at around 32% of the top 1,000 websites. CCBot, offered by Common Crawl, also has been switched on more.

BI asked Originality.ai CEO Jonathan Gillham why Google-Extended is being used less than other AI training data-blockers.

He said that if Google rolls out a generative AI search engine to the wider public, there's a risk that sites that have blocked the company's access to training data won't get picked up in AI-generated results.

"If a query is 'What is the best deep dish pizza in Chicago?' and a Pizza shop excludes Google's AI from using its website data to train on, then it will not have any knowledge of that restaurant and be unable to include it in its response," Gillham explained.

Google is testing an early version of genAI search through its Search Generative Experience, or SGE. It's unclear if the company will launch this fully in the future, or how much different it will be from the traditional Google search engine.

Those decisions will go a long way to deciding the future of the web in this new AI world.

Axel Springer, Business Insider's parent company, has a global deal to allow OpenAI to train its models on its media brands' reporting.

Next Story Elon Musk would be the best owner for TikTok. There's just one problem.