This is where it can get tricky: we must find the four magic words that will create our feed. The four magic words are:
- ignore text before
- start pattern
- end pattern
- ignore text after
The first and the fourth magic words are somehow optional, the second and the third are the really important ones.
Let's assume that this is the we want to parse contains something like this:
News Number one
Text of the first news. The quick brown fox jumps over the lazy dog
|
News Number two
Text of the first news. The quick brown fox jumps over the lazy dog
|
News Number three
Text of the first news. The quick brown fox jumps over the lazy dog
|
The portion of the html source for this table (that you have found searching for the beginning of the news item) will look something like this:
<table cellpadding="3" cellspacing="3">
<tr>
<td bgcolor="ivory">
<font size="5">News Number one</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
<tr>
<td bgcolor="ivory">
<font size="5">News Number two</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
<tr>
<td bgcolor="ivory">
<font size="5">News Number three</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
</table>
|
Now, let's try to figure out our magic words.
First, let's find the start pattern and the end pattern.
Here's my suggestion:
<table cellpadding="3" cellspacing="3">
<tr>
<td bgcolor="ivory">
<font size="5">News Number one</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
<tr>
<td bgcolor="ivory">
<font size="5">News Number two</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
<tr>
<td bgcolor="ivory">
<font size="5">News Number three</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
</table>
|
As you can see, each news item begins and ends with this same pattern. Also consider that while creating the rss feed, the start and begin pattern will not be included in the feed, so with this approach we are removing the table tags that might create some disruption when viewing the feed in your aggregator.
Now, about the Ignore text before and Ignore text after my candidates are:
<table cellpadding="3" cellspacing="3">
<tr>
<td bgcolor="ivory">
<font size="5">News Number one</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
<tr>
<td bgcolor="ivory">
<font size="5">News Number two</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
<tr>
<td bgcolor="ivory">
<font size="5">News Number three</font></br>
Text of the first news. The quick brown fox jumps over the lazy dog
</td>
</tr>
</table>
|
This means that all the code coming before the table opening and after the table is closed will be ignored, making the filter faster and avoiding other contents that you might not like to have in your feed.
As far as the other tags that are included in the code selected by ignore before/after but are not included in the start/end pattern (for example the <tr> tags), they will simply be dropped by the distiller.
With this approach you will be able to extract feeds from most of the database-driven web sites since, by design, they will always create patterns that you'll be able to use to extract feeds.
Finally, something else to look for in html code are comments: quite often programmers include in their pages debugging tags such as:
<!------News Start Here------->
Which are just great to be used as delimiters in your filters.
At this point all you have to do is save the filter and subscribe to it. Your last stop is your news aggregator, to see how your new feed looks like.
Probably you will have to refine it the first time, just edit the feed and make some slight changes to see how it reacts. Unsubscribing and re-subscribing to the feed should refresh the view in your aggregator.
That's all folks, enjoy RssDistiller and just write to me () if you need any assistance or just to tell me which part of the web you are distilling.
paolo/