Dumb AIs, Smart Censors: The Future of Web Fragmentation

There is a growing trend to control AI systems and bots access to the Internet and web content by giving creators and website owners and operators granular control over their content. The message is clear: creators, publishers, journalists and users that create content on platforms want more control over how AI developers use their content. But how can control be granted without sacrificing the open Internet and the open web?

Who Are We Controlling

In these discussions, we often talk about AI bots and systems as if they are entities unto themselves. But there are people and tech-companies behind these crawlers and bots—actors with different incentives, purposes, resources, and demographics. Bots’ desires and behaviors reflect the intentions of real people.

When we regulate or block bots, we are not just controlling software—we are controlling the humans behind them: developers, researchers, educators, and users of the Internet and AI tools. 

How Much Control is Too Much Control?

At first glance, the increase in scraping and AI bots appears to be in need of  correction. Some assert that it is the way to prevent exploitation and bring justice to the creators, authors and publishers. AI systems have scraped massive swaths of the web without consent, attribution, or oversight. There has been much AI Bot traffic that has concerned public knowledge portals like Wikipedia (Link) . Creators and site operators are understandably frustrated. Their work is being used to train commercial AI models, often without compensation and the integrity of their network is being jeopardized.

A Permission-only Internet?

But the remedy should not be granting the site owners and creators total granular control at every layer through regulation, policies or technical standards. If that happens, we risk turning the Internet into a permission-only environment. A place where every link comes with a license. Where knowledge is no longer shared, but enclosed. Where AI systems are trained only on data that can afford to be legally licensed, cutting out independent voices, marginalized creators, and open platforms. It becomes a world of walled gardens and pay-to-play access to information. And for those who cannot pay, or receive a license, access to the web will be diminished. Dare I say, we might even face some dumb AIs because their access to data has been cut off. 

Global Commons Is At Stake

Control, when treated as an absolute right, has a way of backfiring. Imagine a group of large, well-funded news organizations—collaborating on a valuable public-interest project: a digital dictionary to preserve an endangered indigenous language. But what happens when that same group then blocks all AI systems from accessing the dictionary? Suddenly, a resource that could have helped speakers, researchers, and language models preserve and revitalize that language is locked away. The intent might be protective, but the impact is exclusionary. Should site operators be able to block all AI use of publicly available government reports? Should a news outlet prevent an AI from summarizing an article for a person who can’t read it? How about copyright overreach and fair use? What about datasets hosted on public APIs that are suddenly off-limits because they weren’t explicitly marked for AI use?

We forget that the web—for all its flaws—has always been or strived to be an open, remixable, and generative space. Control is not the same as justice. Sometimes, the insistence on total control serves the powerful far more than the vulnerable.

What Rules, Standards and Policies Do we Need?

We do need rules and technical standards that address this issue. But they should be proportionate, purpose-driven, and aware of context. One of the most important principles that those rules and standards should uphold is to keep the Internet open. The web has always been more than just a delivery system for property—it’s a global commons, where people learn, build, and share. Over regulating content use through granular AI permissions risks shifting us toward a future where public knowledge is fragmented and fenced off. If we create a world where using every piece of content and every code and service needs a license, we lose the spirit of openness that made the web possible in the first place. Excessive control mechanisms, no matter how well-intentioned, can end up benefiting the already powerful while locking out smaller players, public interest projects, and underrepresented voices. 

Instead of building fortress walls around content, and asserting copyright over content that might not even belong to us, we can instead build defaults that help the open web and the Internet, transparent attribution systems, and ways for creators and online service providers contribute to knowledge online without breaking the web. 

The real question is not whether creators deserve a say. Of course they do. The question is whether the mechanisms we are building today will serve the future of knowledge and the Internet, or throttle it.

The Ongoing Conversations

Can we rescue AI and the Internet and ultimately people from too much control but at the same time help with keeping the integrity of the networks?

Policy and technical conversations surrounding this issue are happening in multiple venues, the EU AI Code of Practice has specific paragraphs about content control, web crawlers and specific technical means such as robots.txt. There are also some conversations ongoing at W3C in a community group.  The IETF’s AI-Pref working group will hold a meeting this week, from Tuesday (April 8)  to Thursday (April 10), to discuss signaling AI preferences. You can participate remotely and if you don’t have time there will be a recording.  We should take this opportunity to learn some basics and begin developing a shared vocabulary around AI crawler preferences without going into the territories of policy and rules

Discover more from Digital Medusa

Subscribe now to keep reading and get access to the full archive.

Continue reading