Protecting Content And Handling Robots
Site Simulator
Our simulator will have a configuration file, a header, navigation bar, and the content page. Also there will be a validation page used as a trap for robots we do not want. We will create the 5 files in one directory, which will look like this:
sim/config.php
sim/index.php
sim/header.php
sim/navigation.php
sim/validate.php
In real site, config.php and header.php would be placed in document root out of the web root, and the real site structure will look something like
/config.php
/header.php
/navigation.php
/validate.php
/public_html/index.php
where public_html is the web root in RedHat Linux.
To focus on the subject, we will use simplified configuration. So, we will leave config.php and validate.php empty so far. Our HTML output will start in header.php which will be included in all files.
header.php
<html>
<head><title>Site simulator</title>
<style type="text/css">
<!--
TD {
FONT-SIZE: 10pt; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif
}
--></style>
</head>
<body leftmargin="50" topmargin="50" marginwidth="50" marginheight="50">The actual simulation job will be done in navigation.php, which will contain a little script generating links to next and previous page and make the navigation bar.
navigation.php
<table border="0" cellspacing="1" cellpadding="3" width="100%"> <tr> <td width="50%"><a href="index.php">Home</a> </td> <td width="50%"> $i=$page+1; if($page>1){ echo "<a href=\"index.php?page=".($page-1)."\"><<back</a> |"; } echo "<b><font color=\"#FF0000\">$page</font></b>| "; echo "<a href=\"index.php?page=$i\">next>></a>"; </td> </tr> </table> <hr noshade size="1">
Now our index.php will only include config.php, header.php, navigation.php, and closing HTML tags.
index.php
/****** index.php ********/ include("config.php"); include("header.php"); include("navigation.php"); </body> </html>
At this point, with empty config.php and validate.php, header.php, navigation.php, and index.php with the above code, all 5 put in sim/ directory, we can open /sim/index.php and make sure that the navigation works. You will see 2 links: home and next. You can click next endlessly, and the page number increments.
We were going to restrict browsing too many pages, so we have to count hits. In our further work we will need session variables and some environment variables. Now open our config.php and add this code, which is explained in the comments:
//start session session_start(); //register session variable. We will need hits counter session_register("COUNTER"); //For this script I assume that register_globals is set to on //as alternative, you may use super global $_SESSION //then you MUST NOT use session_register() //our counter will count different values so we should set array //we will specify array variables when we need them //so at this point we will just tell that counter is array if(!isset($COUNTER)){ $COUNTER=array(); } //if you use $_SESSION you can skip the above code //and use $_SESSION instead of $COUNTER in the rest of the code //after entering the code on validation page count is unset if($submit){ if($validcode==$COUNTER["code"]){ unset($COUNTER["count"]); } } //limit specifies how many pages can be viewed before validation is called //in our simulator, we will allow to view 20 pages //then the visitor will have to enter validation code (or login/register) $limit=20; //here we will set array keys and assign values //$count will be used to count hits until limit is reached //$sessioncount will count total hits during session $count=$COUNTER["count"]+1; $sessioncount=$COUNTER["sessioncount"]+1; $COUNTER["count"]=$count; $COUNTER["sessioncount"]=$sessioncount; //if count exceeds limit, we set the URL where to return after validation //and validation code (unless you choose redirect to login/registration page) //we will generate validation code from seconds and minutes //using date() function //you can use your own way to generate the validation code //we use $validate variable to make something (not) happen //depending on whether we are on validation page or elsewhere if($count>$limit){ $COUNTER["code"]=date("si"); //if we are not on validation page redirect to validation page if(!$validate){ $COUNTER["back"]=$_SERVER[REQUEST_URI]; header("Location: validate.php"); } }
Now let’s check how it works. Click on the next button 20 times to see if you will be redirected to our empty validate.php
Open validate.php and paste the following code:
$validate=1; include("config.php"); include("header.php"); <table border="0" cellspacing="1" cellpadding="3" width="100%"> <tr> <td align="center"> <h3>Enter the code to continue browsing</h3> </td> </tr> <tr> <td align="center"> <form name="form1" method="post" action="$COUNTER["back"]"> echo "Code: <b><font color=#FF0000>".$COUNTER["code"]."</font></b>" <input type="text" name="validcode"> <input type="submit" name="submit" value="OK"> </form> </td> </tr> </table> <hr noshade size="1"> </body> </html>
Reload the page or close all browser windows and try again. After clicking next or back more than 20 times, the visitor is redirected and requested to enter the validation code.
It is time to fill our index.php with “content”. Your real site’s content is retrieved from database, but for our purpose we will use a short instruction on how to use the simulator and some environment and session variables for debugging purposes.
Now replace your index.php code with this:
include("config.php"); include("header.php"); include("navigation.php"); <table border="0" cellspacing="1" cellpadding="3"> <tr> <td><h2> if($page>0){ Page <b><font color="#FF0000"> $page </font></b> }elseif($page==0){ <b><font color="#FF0000">Homepage </font></b> of the multi-million-page site simulator } </h2></td> </tr> <tr> <td> <p>Click next or back button to browse. When the count limit of <b> $limit </b> pages is reached, you will be redirected to validation page and requested to enter the validation code. In real site you can direct a user to login/registration page after the limit is reached. To set the limit a user can browse within a session, change the <b>$limit</b> value in the <b>config.php</b> file. </p> <p>Some environment variables values, count limit and <b>$COUNTER</b> values are shown in the table below. Also we will use <b>print_r()</b> function in the bottom of the page to see the counter array values.</p> </td> </tr> </table> <hr noshade size="1"> <table border="0" cellspacing="1" cellpadding="3" align="center" bgcolor="#FFFFFF"> <tr bgcolor="#EEEEEE"> <td>Your session id is:</td> <td> session_id() </td> </tr> <tr bgcolor="#EEEEEE"> <td>Your user agent is:</td> <td> $_SERVER[HTTP_USER_AGENT] </td> </tr> <tr bgcolor="#EEEEEE"> <td>Your ip address is:</td> <td> $_SERVER[REMOTE_ADDR] </td> </tr> <tr bgcolor="#EEEEEE"> <td>This page URL is:</td> <td> $_SERVER[HTTP_HOST].$_SERVER[REQUEST_URI] </td> </tr> <tr bgcolor="#EEEEEE"> <td>Count limit: </td> <td><b> $limit </b></td> </tr> <tr bgcolor="#EEEEEE"> <td>Count value:</td> <td><font color="#FF0000"><b> echo $COUNTER["count"]; </b></font> </td> </tr> <tr bgcolor="#EEEEEE"> <td>Session count:</td> <td> $sessioncount </td> </tr> </table> <hr noshade size="1">print_r($COUNTER) </body> </html>
See the working sample on http://www.alxg.net/sim/.
Download all the above code in zip package from http://www.alxg.net/sim/sim.zip.
- Download a PDF version of this article
- View or post comments for this article
- Browse similar articles by tag: PHP, Robots
