Thursday, July 4, 2024
HomeLatest UpdatesAmazon investigating embarrassment over scraping abuse allegations

Amazon investigating embarrassment over scraping abuse allegations

Amazon’s cloud division has launched an investigation into Perplexity AI over whether the AI ​​search startup is violating Amazon Web Services rules by scraping websites that have tried to prevent such scraping, WIRED reports.

An AWS spokesperson, who spoke to WIRED on the condition of anonymity, confirmed that the company is investigating Perplexity. WIRED previously found that the startup, backed by Jeff Bezos’ family fund Nvidia and recently valued at $3 billion, appears to be relying on scraped website content that is blocked by the Robots Exclusion Protocol, a common web standard. The Robots Exclusion Protocol is not legally binding, but terms of service generally are.

The Robots Exclusion Protocol is a decades-old web standard that lets you place a plain text file (such as wired.com/robots.txt) on a domain to specify pages that automated bots and crawlers shouldn’t visit. Companies that use scrapers can choose to ignore the protocol, but most companies have traditionally respected it. An Amazon spokesperson told WIRED that AWS customers must follow the robots.txt standard when crawling websites.

“AWS’ terms of service prohibit customers from using our services for any illegal activity, and customers are responsible for complying with our terms and all applicable laws,” the spokesperson said in a statement.

The investigation into Perplexity’s actions follows a June 11 Forbes report that the startup had allegedly stolen at least one article. WIRED’s investigation confirmed the activity and found further evidence of scraping abuse and theft by a system linked to Perplexity’s AI-powered search chatbot. Engineers at WIRED’s parent company, Condé Nast, block Perplexity’s crawlers on all of its websites using robots.txt files. But WIRED found that the company had accessed its servers using a private IP address (44.221.181.252) to access and scrape Condé Nast’s websites at least hundreds of times in the past three months.

Machines associated with Perplexity appear to be conducting extensive crawls of news websites that prohibit bots from accessing their content, and spokespeople for The Guardian, Forbes and The New York Times said they had also found the IP address on their servers multiple times.

WIRED traced the IP address to a virtual machine called an Elastic Compute Cloud (EC2) instance hosted on AWS, which launched an investigation after we asked whether using AWS infrastructure to scrape prohibited websites violates the company’s terms of service.

Last week, Perplexity CEO Aravind Srinivas responded to WIRED’s inquiry by saying the questions he initially posed to the company “reflect a deep-seated misunderstanding of Perplexity and how the Internet works.” Srinivas later told Fast Company that the covert IP addresses WIRED saw scraping Condé Nast’s website and a test site we created were operated by a third-party company that provides web crawling and indexing services. He declined to name the company, citing non-disclosure agreements. Asked if he would tell the third parties to stop crawling WIRED, Srinivas said, “It’s complicated.”

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

error: Content is protected !!