• Home
  • About
  • Who Am I?
  •  

    PHP : Attempting to Block Site Leechers/Crawlers

    March 18th, 2007

    Someone put forward a question on the forum regarding how to block leechers and I thought that my answer might make a good post as well.

    There is no fool proof way of going about it , cause there are ways in which a script can perfectly pretend to be a valid browser (CURL helps you do exactly that).you will have to put in use several different methods to fight them and reduce the illegal crawlers significantly.Other than the basic idea of banning the IP by getting it by REMOTE_ADDR and HTTP_X_FORWARDED_FOR here are some suggestions.

    Method 1:
    If there are too many requests from the same IP. Try to locate the ISP/Organization of the IP , like the GeoIP organization and ISP packages give you the database to lookup IP and see the owner. If IP belongs to a valid SE (and provided you want them to crawl) , let them go ahead. You can also do some effort to make a list of valid SE IPs which should override all crawler detections.

    Method 2:
    On the first request to the site , send a cookie , redirect and check if the cookie can be accessed, if not than redirect to a page asking the user to enable cookies.Generally scripts lack that ability.

    Method 3:
    Use javascript to set a cookie and then try to access it , if no cookie ask the user to enable javascript. Generic scripts wont be able to process javascript. Unless someone is writing code specifically for your site.

    Method 4:
    If too many requests show a captcha image which is not so straight forward. If no valid input atleast block that IP from going ahead. Even if alot of users are on that IP , you can show them a captcha again and validate that session_id to browse your site ,even if the IP is the same , a little nuisance but worth it if you have a severe problem.

    Method 5:
    Always check to see that a user agent header is sent , simple scripts written by newbies (and there are many) forget to send that.

    Method 6:
    After first entry , each request should contain a HTTP_REFERER logically , so check that too. Newbies forget to send that too.

    Method 7:
    If the same IP is generating different session_ids , you have a crawler on your hand.

    Ok thats all i can think of right now , some of the above might not make sense to some but if all of them are used effectively in combination with each other you got a great crawl blocker system on your hand. I would appreciate if people can suggest other methods as well.



    Google , who is the loser?

    March 1st, 2007

    Over the years I have seen Google emerge out of nothing and rule the planet when it comes to search. But I have also seen lots of junk in my searches and I have seen a thousands of requests for Search Engine Optimization and ads by Search Engine Optimizers claiming to put your site on top of search results for X amount of dollars in Y number of days.

    Now probably one of the biggest industries around on the internet is Search Engine Optimization. My understanding is that probably the BEST content on the internet comes from personal small time sites. For instance take look at my blog , no bullshit pure content BUT i will never be able to score high on Search Engines. Why? I DONT KNOW!

    Isnt it all about content? Honestly, NO! . Its more about what optimization techniques you follow on your site and what keywords you target. Having said that keyword targeting is earning alot of people alot of money and alot of crap sites are getting loads of hits just because of that and probably alot of great sites with alot more relevant information regarding that keyword come on page 10 , 11 or whatever.

    So what does that mean , that means if you want to have real Internet presence and your intentions are all good and you are a guy with principals who doesn’t want to use black hat methods , i have bad news for you. That aint gonna happen! Atleast not the way things are going.

    Take for instance the Google Bomb concept , it works and it works well. Page cloaking is another well known technique. Getting these methods to help will cost you $$$$ and some SEO expert is going to get rich. There are also magic ebooks out there which tell you techniques to get on top of search engines at a ’small’ price. OK hold on … I thought the internet was supposed to be an equal oppurtunity domain , where everyone is equal BUT where Google has helped us ALOT it has also created a division in the internet society. Even on the internet the rich guys with pots of money for Search Engine Optimization and Marketing get on the top and the not so rich are always in shadows. Google probably supports the ’secret’ black hat community cause they dont seem to be doing much about it. Lets take the example of Sponsored Ads in Google , have they ever filtered what is actually shown on that right column? I see the same ad on the right side for some searches with something like “Looking for XYZ?” and its the same ad .. with ebay selling me PHP and God knows what.

    The point is Google has knowingly or unknowingly put all the good webmasters in a fix. “What should we do to be atleast marginally visible on Google?” . Instead of them concentrating on content , products or services they are more worried about the fact if its even worth the effort. Probably Google should rethink their strategies , if they want to focus on the 1% powerfull people on the internet or the 99% not so powerful people around there.

    Probably one of the best things that happened on the internet indexing wise was www.dmoz.org ,
    “hand pick the sites that should be included”. Now i am not saying Google should hand pick everything BUT atleast they should be checking what is actually showing on top searched keywords and also i also have another suggestion , there are so many keywords where they have nothing to show in that money column of theirs. What if they started to show some good sites in those spots for free , now probably they can handpick lets say 1000 sites each month and rotate them through in those spots? But anyway the point is , where are we going with Google , by WE i mean the not so powerful lot on the internet, the honest  webmasters! . We are the losers in the game cause we believe that our content , service or product will get us on the top but thats dreamland , hope Google can squeeze us in there somewhere on their money pages.



    Latest CURL Issues with PHP (apache segmentation faults)

    March 1st, 2007

    Last couple of days i came across issues with CURL 7.15.3 , PHP 4.4.5 and Apache.

    There was this code i had to work on which was written by someone else , the guy was reusing the CURL handle more than one time and i was like …. eh? BUT interesting that code was working a few days ago , somehow during an update to apache etc that code stopped working. And the strangest behaviour was observed. Just imagine my surprise when a function wasnt returning control to the main script. Then i decided to check apache error logs and there it was … segmentation fault , with details

    glibc detected *** double free or corruption

    Now here is what was happening , a curl handle was generated , some post data was sent and then again that handle was used to send some more post data for a different page. The first request went through , but the second one errored out. Now ideally the handle shouldnt be reused in the first place. But earlier for some reason it was working fine and has been working great for a while.

    Now if you experiencing anything similar .. just stop reusing the handle , create a new one for every request and use that. Thats the way it should be done and thats the way you should do it. For instance the CURL wrapper class i have provided here does the same , it always creates a new handle , everything based on that class hasnt ever broken down. So watch out how you use your curl handles!