Scraping Challenges and Open Standards

Following up what I posted recently about Scrape wars, I wrote a longer post for my company site. Reposting it here just for reference.

We’ve talked before about how everything you write should work as a prompt. Your content should be explicitly structured, easy for AI agents to read, interpret, and reuse. Yet, despite clear advantages, in practice we’re often stuck using workarounds and hacks to access valuable information.

Right now, many AI agents still rely on scraping websites. Scraping is messy, unreliable, and frankly a bit of a nightmare to maintain. It creates an adversarial relationship with companies who increasingly employ tools like robots.txt files, CAPTCHAs, or IP restrictions to block automated access. On top of that, major AI providers like OpenAI and Google are introducing built-in search capabilities within their ecosystems. While these are helpful, they ultimately risk creating a new layer of dependence. If content can only be efficiently accessed through these proprietary AI engines, we risk locking ourselves into another digital silo controlled by private platforms.

There is a simpler, proven, and immediately available solution: RSS. Providing your content via RSS feeds allows AI agents direct, structured access without complicated scraping. Our agents, for example, are already using structured XML reports from the Italian Parliament to effectively monitor parliamentary sessions. This is an ideal case of structured openness. Agents such as our Parliamentary Reporter Agent and the automated Assembly Report Agent thrive precisely because these datasets are publicly available, clearly structured, and easily machine-readable.

However, the reality isn’t always so positive. Other important legislative and governmental sites impose seemingly arbitrary restrictions. We regularly encounter ministries and other government websites that block access to automated tools or restrict access based on geographic location, even though their content is explicitly intended as public information. These decisions push us back into pointless workarounds or simply cut off access entirely, unacceptable when dealing with public information.

When considering concerns around giving AI models access to content, it’s essential to distinguish two different use cases clearly. One case is scraping or downloading massive amounts of data for training LLM models (this understandably raises concerns around copyright, control, and proper attribution). But another entirely different and increasingly crucial case is allowing AI agents access to content purely to provide immediate, useful services to users. In these scenarios, the AI is acting similarly to a traditional user, simply reading and delivering relevant, timely information rather than training on vast archives.

Building on RSS’s straightforwardness, we can take this concept further with more advanced open standards, such as MCP (Machine Content Protocol). Imagine a self-discovery mechanism similar to RSS feeds, but designed to handle richer, more complex datasets. MCP could offer AI agents direct ways to discover, interpret, and process deeper levels of information effortlessly, without the current challenges of scraping or the risk of vendor lock-in.

Of course, valid concerns exist about data protection and theft at scale (curiously the same concerns appeared back in the early RSS days, and even when the printing press first emerged… yet we survived). But if our primary goal is genuinely to share ideas and foster transparency, deliberately restricting access to information contradicts our intentions. Public information should remain public, open, and machine-readable.

Let’s avoid creating unnecessary barriers or new digital silos. Instead, let’s embrace standards like RSS and MCP, making sure AI agents are our partners, not adversaries, in building a more transparent and connected digital landscape.

Leave a Reply