AI, Crawlers and the Open Web

March 27, 2025

The rise of AI-driven apps on the Internet has reignited debates about online access, rights management, and digital fairness. From website protection to licensing models, stakeholders face complex technological, legal, and economic challenges when trying to govern AI crawlers and bots that run around the Internet trying to crawl the Internet. This blog is a result of research and different engagement sessions especially the joint session that DigitalMedusa and Center for Journalism and Liberty at Open Markets Institute held back in February.

Blocking Crawlers & Website Protection

Websites frequently deploy services like Cloudflare to shield their content from unwanted scraping and crawling. While this protects against excessive data extraction for very legitimate reasons, a lack of transparency in blocking policies can inadvertently disrupt legitimate web interactions. Over-blocking may restrict access to secure communication channels, research, and journalistic activities. Noncommercial AI systems that need access to crawling can especially be impacted.

One challenge lies in distinguishing AI crawlers from traditional web crawlers, particularly as AI models increasingly rely on large-scale data extraction. Policymakers and businesses should exercise caution in defining and enforcing crawler-related restrictions to prevent unintended consequences that it might have for the Internet and access to the web.

Opt-In vs. Opt-Out & User Controls

The debate over opt-in versus opt-out policies underscores the tension between user control and content discoverability. Opt-in and opt-out modalities are preferences that site operators set to signal to the crawlers if they want their website to be crawled and where to be crawled. Default settings influence the visibility of content online and on search engines, often favoring platforms with predefined access privileges.

We can’t just restrict the use of robots.txt (this is protocol that signals to the crawler what is allowed or disallowed to be crawled) to opt in or opt out as it is too binary. Instead we need to grant the site operators and publishers a range of control so that they can decide proportionally which AI crawler functions they want to restrict. Granular user preferences are essential to aligning transparency and control, especially in search engines, model training datasets, and retrieval-augmented generation (RAG) systems. However, enforcing such controls at scale remains an open question. Without clear distinctions between user agents, different AI crawlers functions, implementing effective content preferences becomes nearly impossible.

AI, Content Access & Compensation

Search engines have long facilitated content discovery, benefiting both users and publishers. However, when AI models ingest content for training without compensation, publishers see diminishing returns. In this context, AI is seen as both an opportunity and a threat—capable of broadening reach while undermining traditional revenue models.

Our solutions so far to address this problem are not optimal and may have unintended consequences. Copyright alone may not be the ideal tool to address compensation gaps. Licensing negotiations introduce complexities around valuation, distribution, and decision-making. Who determines fair compensation? How is it distributed? These questions remain unresolved.

Legal & Regulatory Challenges

Fragmented copyright laws further complicate AI’s relationship with content access. While some jurisdictions offer broad exceptions for AI training, others impose strict constraints. Fair use interpretations remain unsettled in AI-generated content, leaving legal uncertainty for both rights holders and AI developers.

A growing concern is regulatory inconsistency—different jurisdictions impose divergent requirements, creating friction in a global digital ecosystem. Can multiple directives coexist without undermining enforcement efforts? Ensuring regulatory harmony while respecting local policies is an ongoing challenge. This fragmented approach can even lead to a fragmented access to content on the Internet. Sometimes a whole country’s traffic can be blocked just to prevent Bot traffic from that country (reported in Arstechnica).

Internet Evolution & Business Models

The evolution of the Internet has always prioritized convenience and access, but AI-driven content distribution might be reshaping digital economics. Smaller publishers struggle to compete with platforms that leverage AI for content aggregation. Meanwhile, journalists and researchers rely on web crawlers themselves, making stringent enforcement impractical.

As AI reshapes content monetization and distribution, job displacement and shifting traffic patterns raise concerns about sustainable business models for media and content creators.

Copyright & Content Control

Creative Commons operates within the copyright framework, offering legal tools for rights holders to control content use. However, broader issues persist:

Can preference signals (e.g., “Do Not Scrape”) be made more effective without resorting to copyright enforcement?
Could a noncommercial use model with attribution provide an alternative?
Historically, ignoring scraping restrictions has enabled valuable archives (e.g., preserving government records).

While copyright enforcement aims to secure fair compensation, it may not be the most effective mechanism for governing AI’s use of content and protecting the open Internet.

Power Dynamics & Future Considerations

Crawler blockings influence who can access content, creating winners and losers in the digital ecosystem. The ability to enforce AI-related content controls remains uncertain—technical and legal barriers make enforcement difficult at scale.

Looking ahead, key questions persist:

Can AI-driven solutions counteract AI-driven challenges in content access?
Who decides how much control creators should have over AI training?
Could rightsholding structures be used for control rather than fair use?

The answers will shape the future of online content governance and access to knowledge, determining whether AI remains a force for democratization and facilitating access or exacerbates existing power imbalances.

ABOUT THE AUTHOR

Farzaneh Badii

Digital Medusa is a boutique advisory providing digital governance research and advocacy services. It is the brainchild of Farzaneh Badi[e]i.Digital Medusa’s mission is to provide objective and alternative digital governance narratives.

AI, Crawlers and the Open Web

Blocking Crawlers & Website Protection

Opt-In vs. Opt-Out & User Controls

AI, Content Access & Compensation

Legal & Regulatory Challenges

Internet Evolution & Business Models

Copyright & Content Control

Power Dynamics & Future Considerations

Like this:

Farzaneh Badii

Recent Posts

No One Should Control the Internet After AI: Freedom to Build Cleopatra GPT

Why Internet Advocates Should Care About This Week’s ITU-RRB Meeting — Iranian Women’s Coalition for Internet Freedom

ITU, Radio Regulations Board, State Power, and the Silencing of Iranians’ Last Connection to the Internet

When Parked Domains Abuse the Domain Registration and Our Mitigation Processes Fail

Asset-Level AI Signaling: DRM 2.0 the IETF Should Avoid Entirely

AI, Crawlers and the Open Web

Blocking Crawlers & Website Protection

Opt-In vs. Opt-Out & User Controls

AI, Content Access & Compensation

Legal & Regulatory Challenges

Internet Evolution & Business Models

Copyright & Content Control

Power Dynamics & Future Considerations

Share this:

Like this:

Farzaneh Badii

Recent Posts

No One Should Control the Internet After AI: Freedom to Build Cleopatra GPT

Why Internet Advocates Should Care About This Week’s ITU-RRB Meeting — Iranian Women’s Coalition for Internet Freedom

ITU, Radio Regulations Board, State Power, and the Silencing of Iranians’ Last Connection to the Internet

When Parked Domains Abuse the Domain Registration and Our Mitigation Processes Fail

Asset-Level AI Signaling: DRM 2.0 the IETF Should Avoid Entirely

Discover more from Digital Medusa