Screen scraping your way into RSS
<?php
// Get page $url = "http://www.phpit.net/"; $data =
implode("", file($url));
// Get content items preg_match_all ("/<div
class="contentitem">([^`]*?)</div>/",
$data, $matches);Like I said, the next step is to retrieve
the individual information, but first let's make a beginning on
our feed, by setting the appropriate header (text/xml) and
printing the channel information, etc. // Begin feed header
("Content-Type: text/xml; charset=ISO-8859-1"); echo
"<?xml version="1.0"
encoding="ISO-8859-1" ?> "; ?> <rss
version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
; xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
> <channel> <title>PHPit Latest
Content</title> <description>The latest content from
PHPit (http://www.phpit.net), screen
scraped!</description>
<link>http://www.phpit.net</link>
<language>en-us</language>
<?Now it's time to loop through the items, and print
their RSS XML. We first loop through each item, and get all the
information we get, by using more regular expressions and
preg_match(). After that the RSS for the item is printed.
<?php // Loop through each content item foreach
($matches[0] as $match) { // First, get title preg_match
("/">([^`]*?)</a></h3>/", $match,
$temp); $title = $temp['1']; $title = strip_tags($title); $title
= trim($title);
// Second, get url preg_match ("/<a
href="([^`]*?)">/", $match, $temp); $url =
$temp['1']; $url = trim($url);
// Third, get text preg_match ("/<p>([^`]*?)<span
class="byline">/", $match, $temp); $text =
$temp['1']; $text = trim($text);
// Fourth, and finally, get author preg_match ("/<span
class="byline">By ([^`]*?)</span>/",
$match, $temp); $author = $temp['1']; $author = trim($author);
// Echo RSS XML echo "<item> "; echo "
<title>" . strip_tags($title) . "</title>
"; echo " <link>http://www.phpit.net" .
strip_tags($url) . "</link> "; echo "
<description>" . strip_tags($text) .
"</description> "; echo "
<content:encoded><![CDATA[ "; echo $text . "
"; echo " ]]></content:encoded> "; echo
" <dc:creator>" . strip_tags($author) .
"</dc:creator> "; echo " </item>
"; } ?>And finally, the RSS file is closed off.
</channel> </rss>That's all. If you put all the code together, like in the demo script, then you'll have a perfect RSS feed. Conclusion In this tutorial I have shown you how to create a RSS feed from a website that does not have a RSS feed themselves yet. Though the regular expression is different for each website, the principle is exactly the same. One thing I should mention is that you shouldn't immediately screen scrape a website's content. E-mail them first about a RSS feed. Who knows, they might set one up themselves, and that would be even better. Download sample script
Category: CGI