PHP : Attempting to Block Site Leechers/Crawlers
Someone put forward a question on the forum regarding how to block leechers and I thought that my answer might make a good post as well.
There is no fool proof way of going about it , cause there are ways in which a script can perfectly pretend to be a valid browser (CURL helps you do exactly that).you will have to put in use several different methods to fight them and reduce the illegal crawlers significantly.Other than the basic idea of banning the IP by getting it by REMOTE_ADDR and HTTP_X_FORWARDED_FOR here are some suggestions.
Method 1:
If there are too many requests from the same IP. Try to locate the ISP/Organization of the IP , like the GeoIP organization and ISP packages give you the database to lookup IP and see the owner. If IP belongs to a valid SE (and provided you want them to crawl) , let them go ahead. You can also do some effort to make a list of valid SE IPs which should override all crawler detections.
Method 2:
On the first request to the site , send a cookie , redirect and check if the cookie can be accessed, if not than redirect to a page asking the user to enable cookies.Generally scripts lack that ability.
Method 3:
Use javascript to set a cookie and then try to access it , if no cookie ask the user to enable javascript. Generic scripts wont be able to process javascript. Unless someone is writing code specifically for your site.
Method 4:
If too many requests show a captcha image which is not so straight forward. If no valid input atleast block that IP from going ahead. Even if alot of users are on that IP , you can show them a captcha again and validate that session_id to browse your site ,even if the IP is the same , a little nuisance but worth it if you have a severe problem.
Method 5:
Always check to see that a user agent header is sent , simple scripts written by newbies (and there are many) forget to send that.
Method 6:
After first entry , each request should contain a HTTP_REFERER logically , so check that too. Newbies forget to send that too.
Method 7:
If the same IP is generating different session_ids , you have a crawler on your hand.
Ok thats all i can think of right now , some of the above might not make sense to some but if all of them are used effectively in combination with each other you got a great crawl blocker system on your hand. I would appreciate if people can suggest other methods as well.

July 21st, 2008 at 3:52 pm
Method 8:
If you want to make sure you page can be bookmarked
correctly, blocking requests without your site referrer will do bad,
Instead, Check for a referrer if it exists and it is not yours,
block the request.
September 17th, 2008 at 10:30 pm
Cool Thanks For the Tips!