Build a PHP Link Scraper with cURL
We're going to do WHAT?
You heard me! We're going to build a robot that scrapes links from web
pages and dumps them in a database. Then it reads those links from the
database and follows them,
scraping up the links on those pages, and so on ad infinitum (or until
your server times out or your database fills up, whichever comes
first).
I actually built this a few years ago because
I had grandiose visions of becoming the next Google. Clearly, that did
not happen, mostly because my localhost, database, and
bandwidth are not infinite. Yet this little robot has quite interesting
applications and uses if you really have the time to play with and
fine-tune it. I did not really explore those
options but I encourage you to do so. To begin, let's have a look at the groundwork.
The cURL Component
cURL (or "client for
URLS") is a command-line tool for getting or sending files using URL
syntax. It was first used in 2007 by Daniel Stenberg as a way to
transfer files via protocols such as HTTP, FTP, Gopher, and many
others, via a command-line interface. Since then, many more
contributors has participated in further developing cURL, and the tool
is used widely today.
As an example, the following command is a basic way to retrieve a page from example.com with cURL:
Using cURL with PHP
PHP is one of the languages that provide full support for cURL. (Find a listing of all the PHP functions you can use for cURL
here.)
Luckily, PHP also enables you to use cURL without invoking the command
line, making it much easier to use cURL while the server is executing.
The example below demonstrates how to retrieve a page called
example.com using cURL and PHP.
<?php
$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");
curl_setopt($ch, cURLOPT_FILE, $fp);
curl_setopt($ch, cURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>
The Link Scraper
For the link scraper, you will use cURL to get the content of the page
you are looking for, and then you will use some DOM to grab the links
and insert them into your database. I assume you can build the database
from the information below; it is really simple stuff.
$query = mysql_query("select URL from links where visited != 1);
if($query)
{
while($query = mysql_fetch_array($result))
{
$target_url = $query['url'];
$userAgent = 'ScraperBot';
Next, grab the URL from the database table inside a simple while loop.
$ch = curl_init();
curl_setopt($ch, cURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, cURLOPT_URL,$target_url);
After instantiating cURL, you use curl_setopt() to set the USER AGENT in the HTTP_REQUEST, and then tell cURL which page you are hoping to retrieve.
curl_setopt($qw, cURLOPT_FAILONERROR, true);
curl_setopt($qw, cURLOPT_FOLLOWLOCATION, true);
curl_setopt($qw, cURLOPT_AUTOREFERER, true);
curl_setopt($qw, cURLOPT_RETURNTRANSFER,true);
curl_setopt($qw, cURLOPT_TIMEOUT, 20);
You've set a few more HEADERS with curl_setopt().
This time, you made sure that when an error occurs the script will
return a failed result, and you set the timeout of each page followed
to 20 seconds. Usually, a standard server will time out at 30 seconds,
but if you run this from your localhost you should be able to set up a
no-timeout server.
$html= curl_exec($qw);
if (!$html)
{
echo "ERROR NUMBER: ".curl_errno($ch);
echo "ERROR: ".curl_error($ch);
exit;
}
Grab the actual page by sending the HEADERS along while executing the cURL request using curl_exec(). If an error occurs, it will be reported to PHP by the number and description inside curl_errno() and curl_error, respectively. Obviously, if such an error exists, you exit the script.
$dom = new DOMDocument();
@$dom->loadHTML($html);
Next, you create a document model of your HTML (that you grabbed from the remote server) and set it up as a DOM object.
$xpath = new DOMXPath($dom);
$href = $xpath->evaluate("/html/body//a");
Use XPATH to grab all the links on the page.
for ($i = 0; $i < $href->length; $i++) {
$data = $href->item($i);
$url = $data->getAttribute('href');
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
echo "Successful Link Harvest: ".$url;
}
}
Dump
all the links into the database, as well as the URL they are gathered
from, just so you never go back there again. A more intelligent system
might have a separate table for URLs already visited, as well as a
normalized relationship between the two. Going a step further than just
grabbing the links enables you to harvest images or entire HTML
documents as well. This is kind of where you start when building a
search engine. Although I now know of better ways, they aren't half as
much fun.
Creating your own search engine
may seem naively ambitious, but I hope this little bit of code did
inspire you a bit. If so, I implore you to harvest information and
content from other places responsibly.