Chinese internet search giant Baidu appears to have started blocking online search engines from Alphabet Inc.’s Google and Microsoft Corp.’s Bing from retrieving content from the mainland companies’ Wikipedia-style services, a Post investigation has found.
A recent update to Baidu Baike’s robots.txt (a file that tells search engine crawlers which Uniform Resource Locators (commonly known as web addresses) they can access from a site) has completely blocked Googlebot and Bingbot crawlers from indexing content from the Chinese platform.
The update appears to have occurred sometime on August 8, according to records from the Internet archive service Wayback Machine. Earlier that day, Baidu Baike had blocked off only parts of its website, while still allowing Google and Bing to view and index its online repository of around 30 million entries.
Have questions about the biggest topics and trends from around the world? Find the answers on SCMP Knowledge, our new platform that provides curated content with explainers, FAQs, analyses and infographics from our award-winning team.
The move marks Beijing-based Baidu’s stepped up efforts to protect its online assets as demand for vast amounts of data grows to train and build artificial intelligence (AI) models and applications.
This follows a move in July by Reddit, the US social news aggregation platform and forum, to block various search engines, except Google, from indexing its online posts and discussions. Google has a multi-million dollar deal with Reddit that gives it the right to collect data from the social media platform to train its AI services.
Since OpenAI released ChatGPT on November 30, 2022, major search platforms Google and Microsoft are looking to get more data to use in their own generative artificial intelligence systems. Photo: Shutterstock alt = Since OpenAI released ChatGPT on November 30, 2022, major search platforms Google and Microsoft are looking to get more data to use in their own generative artificial intelligence systems. Photo: Shutterstock>
Even Microsoft threatened last year to cut off access to the internet search data it licenses to rival search engines unless they stopped using it as the basis for chatbots and other generative AI (GenAI) services, Bloomberg reported.
By comparison, the Chinese version of the online encyclopedia Wikipedia currently has 1.43 million entries that are accessible to search engine crawlers.
Following Baidu Baike’s robots.txt update, The Washington Post investigated Google and Bing on Friday and found that entries from the Wikipedia-like service, likely old cached content, were still appearing in results on the U.S. search platforms.
Representatives for Baidu, Google and Microsoft did not immediately respond to requests for comment Friday.
More than two years after the groundbreaking release of OpenAI’s ChatGPT, many leading AI developers around the world have signed deals with content publishers to provide quality content for their GenAI projects.
GenAI refers to algorithms and services such as ChatGPT that are used to create new content, including audio, code, images, text, simulations and videos.
For example, OpenAI inked a deal with American news magazine Time in June, giving it access to all of the magazine’s archived content spanning more than 100 years of history.
This article originally appeared in the South China Morning Post (SCMP), the most authoritative news source on China and Asia for more than a century. For more SCMP articles, visit the SCMP app or follow SCMP on Facebook. Twitter P a g e Copyright © 2024 South China Morning Post Publishers Ltd. All rights reserved.
Copyright (c) 2024. South China Morning Post Publishers Ltd. All rights reserved.