+

Cookies on the Business Insider India website

Business Insider India has updated its Privacy and Cookie policy. We use cookies to ensure that we give you the better experience on our website. If you continue without changing your settings, we\'ll assume that you are happy to receive all cookies on the Business Insider India website. However, you can change your cookie setting at any time by clicking on our Cookie Policy at any time. You can also see our Privacy Policy.

Close
HomeQuizzoneWhatsappShare Flash Reads
 

Top websites block Google from training AI models on their data. Nowhere near as much as OpenAI, though.

Mar 15, 2024, 03:14 IST
Business Insider
Sundar Pichai on stage at Google IO 2023Google
  • Google launched a new tool that lets publishers opt out of training Google's AI models.
  • More and more top-ranking websites are using it.
Advertisement

There's a grand bargain at the heart of the web: A small piece of code that has maintained order for decades.

Robots.txt lets website owners choose whether to let Google and other tech giants scrape their online content. Most sites have let Google do this because the company distributes so much valuable traffic.

Then, the AI wars began. It turns out that all this content has been stored in datasets that are the foundation for training powerful AI models, including those from OpenAI, Google, Meta, and others. These models often answer user questions directly, so less traffic may be distributed and the grand web bargain begins to unravel.

Complimentary Tech Event
Transform talent with learning that works
Capability development is critical for businesses who want to push the envelope of innovation.Discover how business leaders are strategizing around building talent capabilities and empowering employee transformation.Know More

Part of Google's response has been to launch a new tool that lets websites block the company from using their content for training AI models. It's called Google-Extended. It came out in September, and it's getting some pickup.

Data shared by Originality.ai shows the Google-Extended snippet is being used by about 10% of the top 1,000 websites, as of late March.

Advertisement

Use of code snippets that block tech companies from using online content for AI model training.Originality.ai

The New York Times has enabled the Google-Extended blocker, according to a review of its robots.txt file. The publication, which is in a heated AI copyright battle with OpenAI, has also blocked that startup's access to its content.

It's on a warpath with other companies that either tap online data for AI model training, or compile this type of data for others to use in similar ways.

"Use of any device, tool, or process designed to data mine or scrape the content using automated means is prohibited without prior written permission," NYT states on its robots.txt page.

Prohibited uses include "the development of any software, machine learning, artificial intelligence (AI), and/or large language models (LLMs)," the publisher adds. A spokesperson for NYT declined to comment.

Google blocked less than OpenAI

For Google-Extended, other websites have switched this on too, including CNN, BBC, Yelp, and Business Insider, the publisher of this story.

Advertisement

However, Google-Extended has had much less pickup than OpenAI's GPTBot, which is hovering at around 32% of the top 1,000 websites. CCBot, offered by Common Crawl, also has been switched on more.

BI asked Originality.ai CEO Jonathan Gillham why Google-Extended is being used less than other AI training data-blockers.

He said that if Google rolls out a generative AI search engine to the wider public, there's a risk that sites that have blocked the company's access to training data won't get picked up in AI-generated results.

"If a query is 'What is the best deep dish pizza in Chicago?' and a Pizza shop excludes Google's AI from using its website data to train on, then it will not have any knowledge of that restaurant and be unable to include it in its response," Gillham explained.

Google is testing an early version of genAI search through its Search Generative Experience, or SGE. It's unclear if the company will launch this fully in the future, or how much different it will be from the traditional Google search engine.

Advertisement

Those decisions will go a long way to deciding the future of the web in this new AI world.

Axel Springer, Business Insider's parent company, has a global deal to allow OpenAI to train its models on its media brands' reporting.

Next Article