Featured Image for Article

Prominent News Outlets Opt Out of Apple’s AI Training

Prominent News Outlets Opt Out of Apple’s AI Training

Less than three months after Apple quietly debuted a tool for publishers to opt out of its AI training, a number of prominent news outlets and social platforms have taken the company up on it. WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training.

Applebot-Extended: A New Tool for Data Control

This new tool, Applebot-Extended, is an extension to Apple’s web-crawling bot that specifically lets website owners tell Apple not to use their data for AI training. Applebot-Extended allows publishers to control data usage without stopping the original Applebot from crawling their websites, which would impact how their content appears in Apple search products.

Blocking Applebot-Extended

Publishers can block Applebot-Extended by updating a text file on their websites known as the Robots Exclusion Protocol, or robots.txt. This file has governed how bots go about scraping the web for decades and is now central to the fight over AI training data. Many publishers have already updated their robots.txt files to block AI bots from other major AI players like OpenAI and Anthropic.

Applebot-Extended is so new that relatively few websites block it yet. Recent analyses indicate that only around 6-7 percent of high-traffic websites, predominantly news and media outlets, are blocking Applebot-Extended. This suggests that many website owners either do not object to Apple’s AI training practices or are simply unaware of the option to block the bot.

Strategic Decisions and Licensing Deals

A divide has emerged among news publishers about whether or not to block these bots. Some outlets have explicitly noted they block AI scraping tools due to the absence of partnerships with their owners. Others have made licensing deals that allow bots in exchange for compensation.

For example, Condé Nast websites used to block OpenAI’s web crawlers but unblocked them after announcing a partnership with OpenAI. Buzzfeed, on the other hand, blocks every AI web-crawling bot it can identify unless a partnership agreement is in place.

Challenges in Keeping Up with AI Agents

Because robots.txt needs to be edited manually and many new AI agents are debuting, it can be challenging to maintain an up-to-date block list. Services like Dark Visitors offer solutions to automatically update a client site’s robots.txt, catering primarily to publishers concerned about copyright.

Legal and Commercial Implications

Some news outlets, like The New York Times, are critical of the opt-out nature of tools like Applebot-Extended. They argue that scraping or using content for commercial purposes without prior permission is prohibited by law and their terms of service.

It remains unclear whether Apple is close to finalizing deals with publishers. If such deals are struck, the consequences of any data licensing or sharing arrangements may be visible in robots.txt files even before public announcements.

As Jon Gillham from Originality AI notes, the battle for AI training data is playing out on a seemingly obscure text file, highlighting the significant impact of this technology on the future of the web.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *