Paolo Valdemarin Weblog: RssDistiller How To

Updated: 29-11-2002; 3:58:29 PM.

RssDistiller How To

Here's how I usually create filters for RssDistiller.

Notice: of course you need to have Radio UserLand and RssDistiller installed for this to work

open the page that you want to scrape in a browser window;
choose "view source" from your browser menu (how to do this depends on which OS and browser you are using, right-clicking on the page will probably bring out a menu with such a command), you will get the html source code for the page in a different window;
now in the original page select the first words of the first news that you want to pick up and copy it;
in the source code window run a search for the copied text, this will bring you to the point in the html code where the item begins;
now, this is the tricky part: you need to find two repetitive element that will let RssDistiller properly parse the page and "understand" where each news item begins and where it ends, we'll get back to this point soon;
Click here to open the RssDistiller Add feed page in a new browser window;
Fill up the form:
- Target page - copy and paste here the url of the page that you want to scrape
- Channel title - Choose a name for the channel
- Channel description - Insert a short description for this new channel
- Rss file name - This is the name for the rss file, avoid using spaces and non-alphanumeric characters such as / ? %, you don't need to add .xml at the end of the name
- Refresh every - This number defines how frequently the site will be checked for updates
- Save channel to disk - Select "Yes", in order to have your feed saved on the disk. "No" is used only if you need to merge more feeds.
- Filter - enlosePath is the filter that we need to extract contents from a page. HasChanged is used only to generate a news item if a page changes (useful if the site does not change often and you cannot create a filter for it)
- Proceed to filter setup

This is where it can get tricky: we must find the four magic words that will create our feed. The four magic words are:

ignore text before
start pattern
end pattern
ignore text after

The first and the fourth magic words are somehow optional, the second and the third are the really important ones.

Let's assume that this is the we want to parse contains something like this:

News Number one
Text of the first news. The quick brown fox jumps over the lazy dog

News Number two
Text of the first news. The quick brown fox jumps over the lazy dog

News Number three
Text of the first news. The quick brown fox jumps over the lazy dog

The portion of the html source for this table (that you have found searching for the beginning of the news item) will look something like this:

<table cellpadding="3" cellspacing="3">
    <tr>
        <td bgcolor="ivory">
            <font size="5">News Number one</font></br>
            Text of the first news. The quick brown fox jumps over the lazy dog
            </td>
        </tr>
    <tr>
        <td bgcolor="ivory">
            <font size="5">News Number two</font></br>
            Text of the first news. The quick brown fox jumps over the lazy dog
            </td>
        </tr>
    <tr>
        <td bgcolor="ivory">
            <font size="5">News Number three</font></br>
            Text of the first news. The quick brown fox jumps over the lazy dog
            </td>
        </tr>
</table>

Now, let's try to figure out our magic words.

First, let's find the start pattern and the end pattern.

Here's my suggestion:

As you can see, each news item begins and ends with this same pattern. Also consider that while creating the rss feed, the start and begin pattern will not be included in the feed, so with this approach we are removing the table tags that might create some disruption when viewing the feed in your aggregator.

Now, about the Ignore text before and Ignore text after my candidates are:

This means that all the code coming before the table opening and after the table is closed will be ignored, making the filter faster and avoiding other contents that you might not like to have in your feed.

As far as the other tags that are included in the code selected by ignore before/after but are not included in the start/end pattern (for example the <tr> tags), they will simply be dropped by the distiller.

With this approach you will be able to extract feeds from most of the database-driven web sites since, by design, they will always create patterns that you'll be able to use to extract feeds.

Finally, something else to look for in html code are comments: quite often programmers include in their pages debugging tags such as:

Which are just great to be used as delimiters in your filters.

At this point all you have to do is save the filter and subscribe to it. Your last stop is your news aggregator, to see how your new feed looks like.

Probably you will have to refine it the first time, just edit the feed and make some slight changes to see how it reacts. Unsubscribing and re-subscribing to the feed should refresh the view in your aggregator.

That's all folks, enjoy RssDistiller and just write to me () if you need any assistance or just to tell me which part of the web you are distilling.