Scrape wars

There’s a lot of scraping going on these days.

It looks like most AI applications that need to access content online are resorting to scraping web pages.

Many AI agents we’ve been working on rely on having some sort of access to online content. Of course, we started with a simple RSS aggregator: it’s clean, it’s efficient, it’s a rock-solid foundation for any application.

But not all sites have feeds. More than one would think (many sites have feeds but don’t advertise them, some of these might very well be just a feature of the CMS used, not a deliberate decision by the publisher).

But for those sites without feeds… well, we scrape them (and drop the content into a feed that we manage, using the aggregator as the central repository).

Some sites don’t want us to scrape them and put up a fight. In most cases, we scrape them anyway.

If most publications were publishing feeds, we wouldn’t have to do this. They would control what is shared and what is not. Everyone would be happy.

Meanwhile, all my sites are getting tons of traffic from places like Boydton and Des Moines, that’s where big server farms sit and tons of bots are scraping the web from. Wasting lots of resources (theirs and mine) instead of just polling my perfectly updated RSS feed.

PS: I wrote this post on Wordland. Refreshing.

One thought on “Scrape wars”

  1. Pingback: Dave's linkblog

Leave a Reply