fopen()PHP allows URLs in the place of file names in the fopen() function. When you specify a URL using a protocol such as http, PHP looks for a wrapper implementing that protocol and, if one is found, uses it to retrieve the content of and make it available through the returned file pointer. You can operate on this file pointer with fread() and fclose(), just as you would with a regular file.
Consider the following PHP code:
<?php
$url = "http://www.cs.ucsd.edu/~elkan/134A/";
if ($fp = fopen($url, "r")) {
while ($buf = fread($fp, 8192))
print $buf;
fclose($fp);
} else {
print("Cannot retrieve $url\n");
}
?>
This is exactly the code in http://www.cs.ucsd.edu/~ddahlstr/cse134a/index2.php. If you load it, you'll see it retrieves the course Web page under Professor Elkan's directory and prints it as output.
Again displaying my penchant for regular expressions, I recommend you use a regexp function such as preg_match() to extract the content you're interested in from Web pages.
Consider now the following code:
<?php
$url = "http://www.cs.ucsd.edu/~elkan/134A/";
if ($fp = fopen($url, "r")) {
while (!feof($fp))
$buf .= fread($fp, 8192);
fclose($fp);
if (preg_match('/<TITLE>([^<]*)<\/TITLE>/i', $buf, $matches))
print "Title: $matches[1] <BR>\n";
if (preg_match('/<meta name="Author" content="(.*)">/i', $buf, $matches))
print "Author: $matches[1] <BR>\n";
if (preg_match('/Most recently updated on (.*) by/i', $buf, $matches))
print "Date: $matches[1] <BR>\n";
} else {
print("Cannot retrieve $url\n");
}
?>
As you can see, the code retrieves the class Web page and attempts to match three patterns in it to find the title, author, and date of the document. This code is in http://www.cs.ucsd.edu/~ddahlstr/cse134a/summarize.php. When I ran it, the output looked like this:
Title: CSE 134A Fall 2002<BR> Author: Charles Elkan<BR> Date: October 24, 2002<BR>
I determined how to identify these pieces of information by looking at the structure of the document, just as you will have to do with the news sites you'll be retrieving.
There are a few things to point out in case you haven't used Perl-style regular expressions before. The first argument to
preg_match() is a pattern; the second is the string (in this case $buf) in which it will look for matches; the third is an array (I called it $matches).
When the pattern matches the string, the entire string goes into $matches[0], and any sub-patterns surrounded by parentheses go into the rest of the array in order. In our case each pattern had one parenthesized sub-pattern, so the part matching it went into $matches[1] which is what we printed out.
Notice also the trailing i at the end of the pattern; this useful flag makes the pattern case-insensitive. This is often useful, especially in this case since HTML itself is case-insensitive.
Unix's cron is a daemona persistent process that provides a serviceexecuting commands scheduled by users. A scheduled command can be once-only or recurrent. Each user can (but is not required to) have a specially formatted file called a crontab that specifies which commands to run when.
To edit your crontab (and create it if necessary), run the command crontab -e. For details on the crontab format I recommend you read the manual by running man crontab, but in brief each line has six whitespace-delimited fields: minute, hour, day of the month, month of the year, day of the week, and command. For example, the entry
30 8,20 * * * rm -rf $HOME/.netscape/cache
would remove the Netscape browser's cache in your home directory twice every day of every week and every month, at 08:30 and 20:30. Also,
0 0 31 10 * cast spell
would run the command cast spell on Halloween at midnight.
If you need more help understanding crontab, I found this tutorial with a quick search on Google.