Reddit vs Perplexity: The AI Crawler Data

Header Text - The Reddit vs Perplexity Lawsuit and The Future of Content Access

Recently, Reddit filed a lawsuit against Perplexity AI, alleging that the company had unlawfully scraped content from the site. While it may seem on the surface that two tech companies are going to court over bot activity, it raises an important question: where is the line between open and fair content gathering and unauthorized data access? The court’s decision in the Reddit vs Perplexity case could have a direct impact on how website owners and Web Hosting providers manage and protect proprietary content, as well as the ethical and legal implications of how AI models collect and use it.

KEY TAKEAWAYS

In Reddit vs Perplexity, Reddit claims that Perplexity knowingly bypassed no-crawl directives to scrape data indirectly via Google Search results.
Traditional crawlers index content to drive website traffic; AI crawlers gather content to generate answers, often not directing visitors to websites.
The ruling could redefine what ‘publicly available’ means and whether the same restrictions that apply to AI crawlers also apply to summarization tools.
The lawsuit argues that access permissions and restrictions are becoming legal boundaries.
As developments continue, we can expect more on data scraping regulations and how AI could potentially change fair use policies.
With the right tools and web hosting, you can protect your website from data scraping and downtime.

What Started the Reddit vs Perplexity Lawsuit?

It seems there’s a lot of legal activity occurring in the tech world these days. With the ongoing Automattic vs WP Engine case, we now have another battle between two internet giants.

The Reddit vs Perplexity lawsuit started with claims that Perplexity accessed and scraped Reddit data and content that wasn’t publicly available without permission.

In the court documents for Reddit vs Perplexity, Reddit alleges that Perplexity’s AI crawlers bypassed its access controls to obtain content for its AI answer engine by using Google Search results as a backdoor. They also identified three data-scraping services, Oxylabs, AWMProxy, and SerpApi, as co-defendants.

Reddit maintains that, unlike licensed partners such as OpenAI, creators of ChatGPT, which pay for access to its content, Perplexity and the three co-defendants allegedly concealed their bots’ identities and locations to circumvent anti-scraping directives.

Strip Banner Text - Reddit claims Perplexity used Google search results to bypass crawling rules

Also, since Perplexity’s AI scrapers couldn’t access the site data directly, they accessed it indirectly through Google Search results. Reddit stated that the number of its citations in Perplexity’s answers increased nearly 40 times, even after a cease-and-desist letter was sent.

To back up their accusation even more, Reddit set up a hidden test post that could only be seen by Google’s crawlers for indexing purposes, not by any other users.

The lawsuit claims that this “hidden” content appeared in Perplexity’s AI-generated summaries within hours, supposedly demonstrating that Perplexity used data from Google Search results. Reddit contended it had discovered clear evidence of circumvention.

Perplexity denied the allegations, saying that they didn’t scrape or store Reddit content and are being unfairly targeted. The company also framed Reddit’s lawsuit as an attempt to gain leverage in broader negotiations over how platforms will charge developers for providing AI with access to data.

In its public response, the company said, “We do not train on Reddit data. We cite it like a search engine would cite a webpage.”

They also wrote a post on Reddit saying: “We summarize Reddit discussions, and we cite Reddit threads in answers, just as people share links to posts here all the time.”

As you can see from the above statements, Perplexity considers itself an AI-powered search and answer engine, not a data harvester. However, this is not the first time Perplexity has been accused of this behavior.

In an August blog post, the Content Delivery Network (CDN) service, Cloudfare, said they had found evidence of “stealth crawling” by Perplexity. They claimed bots were ignoring no-crawl directives after receiving complaints from their customers who had disallowed and blocked the public (declared) PerplexityBot crawlers. However, their content was still somehow being accessed.

Cloudfare observed that they were only using undeclared crawlers with multiple IP addresses to hide their identities, as well as not retrieving robots.txt files to circumvent the blocks, much like the claims Reddit is making.

AI Crawlers vs Search Engine Bots

The web has long relied on the principle of open access, where websites are crawled and indexed by search engines, and traffic is directed to those sites. However, that seems to be changing, thanks to generative AI models that aren’t only crawling, but are also consuming data on a colossal scale, with bot traffic now accounting for nearly 30% of website traffic.

To give you a better understanding of the technical side of this case, it helps to know how AI crawlers collect data is different from how search engine bots crawl and index websites.

While they share a name, the difference between traditional search engine crawlers, like Googlebot, and AI crawlers (GPTBot, PerplexityBot) is what they were originally created for and the methods they apply.

Traditional bots are designed for indexing, enabling them to understand a page’s content and direct visitors to your website through Search Engine Results Pages (SERPs). They operate transparently, identifying themselves and generally respecting robots.txt files, which specify which of your pages can and can’t be indexed.

On the other hand, AI crawlers are made to gather as much information as possible, pulling massive amounts of text, code, and structured data to train Large Language Models (LLMs). They then use Natural Language Processing to generate answers to questions directly in the chat when you ask them a question, so you don’t necessarily have to visit the source website.

A Wikipedia report shows an 8% decline in human page views since 2024, due to generative AI providing answers directly, which are often based on Wikipedia’s content. They do this directly rather than by directing people to the website.

Technically, traditional search bots are designed to be light on a website’s server resources. They make multiple small visits to check for updates, ensuring the indexed content is fresh and relevant.

On the other end of the spectrum, AI crawlers tend to be heavy-handed. They make fewer visits but collect significantly more data per request. They often retrieve entire pages and related files in bulk to understand the content’s context to provide training data for LLMs, which uses up more bandwidth and server resources.

As you can imagine, aggressive crawling by AI bots puts strain on a site, causing slow load times or even complete crashes.

Lastly, while search bots obey robots.txt files and follow crawl directives, AI bots can hide their identities by changing IP addresses through proxies and intentionally ignore or bypass them to access data. This is where the problem comes in.

Impact on Website Owners & Content Creators

As we’ve discussed, LLMs rely on human-generated content to understand and contextualize the data they use for answers. This means that high-quality web pages, articles, and comments are not just for improving search engine rankings and visibility; they are becoming increasingly valuable as LLM training data.

In Reddit vs Perplexity, if Reddit’s complaint is successful, it could pave the way for tighter enforcement of access rights, resulting in stricter limits on what third parties can use from your site without explicit consent, as well as increased data licensing.

If Perplexity wins, however, it could mean we’re looking at a more open interpretation of what constitutes “fair-use” content for training LLMs and access for big data AI analysis tools.

That distinction between summarizing and training on data is now the center of the debate and could potentially dictate how LLMs operate in the future. For online businesses, this case has direct implications for data protection and usage.

From a practical standpoint, website owners and SMEs should start looking at:

Whether your robots.txt files and access permissions are configured for how you want data to be used.
How search engine snippets and site feeds could expose your content to specific AI tools.
The terms of service related to user-generated content and its use by third parties.

Strip Banner Text - AI crawlers gather data to train LLMs, not index site content

What Happens Next?

Reddit is seeking damages and an injunction to block Perplexity as well as Oxylabs, SerpApi, and AWMProxy from further scraping and permanently stopping them from using or selling any previously scraped Reddit data.

A win for Reddit in the Reddit vs Perplexity case could allow content creators to dictate better terms under which their data is accessed and used, and even force other companies to the negotiating table.

Conversely, a loss could essentially open the floodgates to even more sophisticated, potentially underhanded scraping of publicly accessible content, regardless of restrictions.

The case is still in its early stages, but its outcome could shape how platforms and AI companies negotiate data access in the future. The stakes are high, as they encompass defining what constitutes unfair competition, copyright infringement, and benefiting from others’ work in the context of AI.

Courts will need to decide whether accessing Reddit via Google’s search results constitutes circumvention and whether summarization constitutes derivative use, while defining the boundaries between “fair use” and “theft” in the context of AI.

Potential Effects on Content Use & Data Access

This lawsuit is entering a legal grey area where regulations are only beginning to be defined, particularly in relation to anti-circumvention and emerging AI data-use standards. From a technical standpoint, the case revolves around access controls and the boundaries defining what is fair game for data retrieval.

It also raises questions about copyright infringement, the value of online content, and who should benefit from it.

Some of the questions on data privacy and security that now need to be addressed are:

Can accessing data and content via Google’s search-indexed websites still count as “unauthorized”?
Does summarizing publicly available content, with citations, differ legally from gathering it to use in an AI model answer?
How much control over their content do website owners truly have once their pages are indexed by search engines, given what’s happening with AI crawling?

This also has the knock-on effect of potentially harming content creators’ website monetization methods by using their intellectual property without their consent or compensation.

For example, following Wikipedia’s report on declining human page views, it asked AI developers to use its content responsibly in a blog post on November 10, 2025, and help sustain the world’s go-to source for free, accurate information. The post also states that by using the paid Wikimedia Enterprise platform, they can ensure content contributors are correctly attributed and financially supported. Developers using this opt-in service would also be able to use the content sustainably at scale without straining Wikipedia’s servers.

Most importantly, will it be a free-for-all, or will there be limits to what can and can’t be accessed and used by LLMs? If it’s the latter, it could set a dangerous precedent for data-gathering. We will have to wait and see.

Protecting Your Data with Hosted.com®

As you can see, relying solely on a basic robots.txt file is no longer enough to protect your site’s content. The first line of defense against AI crawler scraping is your hosting. At Hosted.com®, we have you covered.

Our Web and WordPress Hosting security includes advanced server-level and Web Application Firewalls (WAFs) that can identify and block harmful bot traffic and DDoS (Distributed Denial of Service) attacks.

Monitoring software checks traffic patterns and bandwidth consumption for suspicious activity, flagging and blocking it. Our servers automatically limit the number of requests an IP address can make per second.

If you want even more control, you can also add .htaccess rules to block or allow specific ones. This helps prevent bulk requests while maintaining a smooth user experience. Remember, legitimate search engine bots exhibit predictable behavior; AI crawlers, on the other hand, make random, bulk requests.

Our Web Hosting also helps mitigate the impact of aggressive AI crawlers by providing you with CageFS security, isolating your site from others on the server.

With Hosted.com®, you get the resources and infrastructure your site needs for maximum performance and stability, backed by expert support, so you can focus on growing your business.

Strip Banner Text - Keep your website data and content safe with Hosted.com [Learn More]

Find the Perfect Domain Name – AI Domain Name Generator

VIDEO: How to Find the Perfect Domain Name – AI Domain Name Generator

FAQS

What is the Reddit vs Perplexity lawsuit about?

Reddit is suing Perplexity AI, asserting that the company unlawfully accessed and reused Reddit content without authorization. The lawsuit claims Perplexity circumvented Reddit’s access restrictions by scraping data via Google Search results.

What does Perplexity AI do?

Perplexity is an AI-powered search and answer tool that provides human-like responses to user input in a chat format.

What does AI crawling for LLM training mean?

AI crawling for LLM training involves using vast datasets and human-generated content to teach AI models how to respond to queries in natural language.

What’s the difference between data scraping and summarizing?

Data scraping involves automatically collecting large amounts of content from websites, while summarizing means processing or paraphrasing visible content and citing its source.

How can I protect my content from AI crawlers?

Review your robots.txt file and access control settings to establish clear rules about what content bots can index and what they can’t. Use a web hosting service that includes features to block unwanted crawlers.

Other Blogs of Interest

– Giving AI Access To Your Personal Data – The Risks Of Agentic AI

– Big Data, AI & Data Analysis Tools – How To Make Them Work For Your Business

– AI Cyber Attack Guide: The Halloween Version

– 5 AI Tools That Can Help Your Business

– Exploring AI Domains – The Future of Web Addresses

About the Author
Latest Posts

Rhett Freeman

Rhett isn’t just a writer at Hosted.com – he’s our resident WordPress content guru. With over 7 years of experience as a content writer, with a background in copywriting, journalism, research, and SEO, and a passion for websites.

Rhett authors informative blogs, articles, and Knowledgebase guides that simplify the complexities of WordPress, website builders, domains, and cPanel hosting. Rhett’s clear explanations and practical tips provide valuable resources for anyone wanting to own and build a website. Just don’t ask him about coding before he’s had coffee.

Reddit vs Perplexity: The AI Crawler Data Scraping Lawsuit

KEY TAKEAWAYS

TABLE OF CONTENTS

What Started the Reddit vs Perplexity Lawsuit?

AI Crawlers vs Search Engine Bots

Impact on Website Owners & Content Creators

What Happens Next?

Potential Effects on Content Use & Data Access

Protecting Your Data with Hosted.com®

Find the Perfect Domain Name – AI Domain Name Generator

FAQS

What is the Reddit vs Perplexity lawsuit about?

What does Perplexity AI do?

What does AI crawling for LLM training mean?

What’s the difference between data scraping and summarizing?

How can I protect my content from AI crawlers?

Other Blogs of Interest

What do Our Customers say about Hosted.com®?

We make hosting simple

Hosting

Company

Domains

Security

Resources

Legal