Creating a good bot

This is the PHP code I use to retrieve robots.txt content and check urls against the rules specified.

After getting spidered to death from lots of bots (both good and bad) I felt the need to make sure that my bot was a good one so that other webmasters wouldn't have the same problem with me. There are lots of resources for creating a robots.txt file, but not so many for creating a good bot.

Let webmasters know who you are

Use a custom user agent string by adding the following line at the start of your PHP script:

  ini_set('user_agent','YourUserAgentString');

A good format for your user agent string is:

  YourBotName/Version (http://www.yourwebsite.com/whatyourbotdoes.php)

The string is broken down into 3 parts:

YourBotName
Make this something simple, unique and specific to your website. Avoid any special characters. For example the bot used here on the Edge is JugglingEdgeBot.

Version
Keep track of a version number, if you change what your bot does update the version number, this will let webmasters know to check for updates.

http://www.yourwebsite.com/whatyourbotdoes.php
Create and maintain a page that explains what your bot does. For an example see my page for JugglingEdgeBot. Include version history and instructions on how to control your bot.

Honour robots.txt

Good bots follow the instructions specified in a robots.txt file.

First check for the existence of a robots.txt file and get the contents if it does. It is better to check for robots.txt separately from checking a url against the rules contained in the file if you are checking multiple links. You only need to grab robots.txt once per domain, not once per link.


// Check robots.txt
  function GetRobotsTxt($URL)
  {
    $ParsedURL=parse_url($URL);
    $RobotsURL='http://'.$ParsedURL['host'].'/robots.txt';
    return explode("\n",file_get_contents($RobotsURL));
  }

Use this function to parse the contents of the robots.txt file. This function checks whether you may proceed or not and will return either true or false.


// Original PHP code by Chirp Internet: www.chirp.com.au
// Adapted to include 404 and Allow directive checking by Eric at LinkUp.com
// Simpler handling of 404, separation of loading robots.txt file & other 
// minor tweaks by Orinoco http://jugglingedge.com
// Please acknowledge use of this code by including this header.

  function AllowRobotIn($URL,$RobotsTxt,$UserAgent)
  {
    if(empty($RobotsTxt)) // Site has no robots.txt file, ok to continue
      return true;

// Escape user agent name for use in regexp just in case
    $UserAgent=preg_quote($UserAgent,'/');

// Get list of rules that apply to us
    $Rules=[];
    $Applies=false;
    foreach($RobotsTxt as $Line)
      {
// skip blanks & comments
        if(trim($Line)=='' || $Line[0]=='#')
          continue;

        if(preg_match('/^\s*User-agent:\s*(.*)/i',$Line,$Match))
          {
// Found start of a User-agent block, check if
// it applies to all bots, or our specific bot
            $Applies=preg_match("/(\*|$UserAgent)/i",$Match[1]);
            continue;
          }

        if($Applies)
          {
// Add rules to our $Rules array
            list($Type,$Rule)=explode(':',$Line,2);
            $Type=trim(strtolower($Type)); // Allow or Disallow
            $Rules[]=['Type'=>$Type,'Match'=>preg_quote(trim($Rule),'/')];
          }
      }


// Check URL against our list of rules

    $ParsedURL=parse_url($URL);

    $Allowed=true;
    $MaxLength=0;
    foreach($Rules as $Rule)
      {
        if(preg_match('/^'.$Rule['Match'].'/',$ParsedURL['path']))
          {
// Specified rule applies to the URL we are checking
// Longer rules > Shorter rules
// Allow > Disallow if rules same length

            $ThisLength=strlen($Rule['Match']);
            if($MaxLength<$ThisLength)
              {
                $Allowed=($Rule['Type']=='allow');
                $MaxLength=$ThisLength;
              }
            elseif($MaxLength==$ThisLength && $Rule['Type']=='allow')
              {
                $Allowed=true;
              }
          }
      }

    return $Allowed;
  }

Then to check a single link:


  $URL='http://jugglingedge.com';
  $UserAgent='MySiteBot';
  $RobotsTxt=GetRobotsTxt($URL);

  if(AllowRobotIn($URL,$RobotsTxt,$UserAgent))
    {
// do your stuff
    }

Or to check multiple links:


  $Links=[];
  $Links[]='http://jugglingedge.com';
  $Links[]='http://jugglingedge.com/forum.php';
  $Links[]='http://jugglingedge.com/clubs.php';
  $Links[]='http://jugglingedge.com/events.php';
  $Links[]='http://jugglingedge.com/records.php';

// Sort to ensure all urls from the same domain are checked together
// to reduce repeat calls for robots.txt
  sort($Links);

  $UserAgent='MySiteBot';

  foreach($Links as $ThisLink)
    {
      if($ThisLink!=$PrevLink) // don't check the same page twice
        {
          $PrevLink=$ThisLink;

// Check robots.txt
          $ParsedURL=parse_url($ThisLink);
          $RobotsURL='http://'.$ParsedURL['host'].'/robots.txt';

// only check new robots.txt if host is different from the last one
          if($RobotsURL!=$PrevRobotsURL)
            {
              $RobotsTxt=GetRobotsTxt($URL);
              $PrevRobotsURL=$RobotsURL;
            }


          if(AllowRobotIn($ThisLink,$RobotsTxt,$UserAgent))
            {
// do your stuff
            }
        }
    }

?>