Solved

Stop BOTs from hitting my <a href links

Posted on 2008-10-09
14
943 Views
Last Modified: 2008-11-10
I understand the ROBOTS.TXT file that people to prevent bots from accessing certain pages on your site, but I don't want them to not visit all my pages. In fact, I want them to visit all my pages and see all the reciprocal links I use on one of my sites... but just NOT actually click them as though they were a user.

Is there a way to prevent a bot from pretending to be a user and clicking through a text link on my website?

I have (for example) text links like sponsored by abc company where abc company is a hyperlink to abc company's website.  I also have a text hyperlink that says CLICK HERE IF YOU WANT TO REPORT THIS USER on a blog section of my site.  Every day I get at least 6 emails from the script that is executed when someone clicks the REPORT THIS USER link.  I don't want it to be a form where I validate a human (captcha) to report the user, because nobody will take the time.

In addition, when these bots click by SPONSORED BY links, it messes up the statistics of clickthroughs I'm tracking.

Any ideas?

I've found an interesting HTTP_REFFERER snippet that someone said they use, but don't know the semantics on how to get that to work.
0
Comment
Question by:day6
  • 4
  • 3
  • 3
  • +2
14 Comments
 
LVL 63

Assisted Solution

by:Zvonko
Zvonko earned 100 total points
ID: 22681877
Can you change your links from href to onClick?
So if your link is like this:

<a href="user.htm" >REPORT THIS USER</a>

Change it to:

<a href="#"  onClick="window.location.href='user.htm';" >REPORT THIS USER</a>
0
 
LVL 63

Expert Comment

by:Zvonko
ID: 22681888
Of course only for the links that you want to protect. If you change all links to script then you go to danger to disable users that have active scripting turned off.

0
 
LVL 36

Expert Comment

by:SidFishes
ID: 22681937
the answer is that you can limit the problem but not get rid of it entirely as user agent -can- be spoofed

using some modified code from Ben Nadel http://www.bennadel.com/index.cfm?dax=blog:67.view you could try something like this
<cfscript>

// Here, we are using short-circuit evaluation on the

// IF statement with the most popular search engines at the top of the 

// list. This will help us minimize the amount of time that it takes to 

// evaluate the list.

if (

	(NOT Len(CGI.http_user_agent)) OR 

	FindNoCase( "Slurp", CGI.http_user_agent ) OR

	FindNoCase( "Googlebot", CGI.http_user_agent ) OR

	FindNoCase( "BecomeBot", CGI.http_user_agent ) OR

	FindNoCase( "msnbot", CGI.http_user_agent ) OR

	FindNoCase( "Mediapartners-Google", CGI.http_user_agent ) OR

	FindNoCase( "ZyBorg", CGI.http_user_agent ) OR

	FindNoCase( "RufusBot", CGI.http_user_agent ) OR

	FindNoCase( "EMonitor", CGI.http_user_agent ) OR

	FindNoCase( "researchbot", CGI.http_user_agent ) OR

	FindNoCase( "IP2MapBot", CGI.http_user_agent ) OR

	FindNoCase( "GigaBot", CGI.http_user_agent ) OR

	FindNoCase( "Jeeves", CGI.http_user_agent ) OR

	FindNoCase( "Exabot", CGI.http_user_agent ) OR

	FindNoCase( "SBIder", CGI.http_user_agent ) OR

	FindNoCase( "findlinks", CGI.http_user_agent ) OR

	FindNoCase( "YahooSeeker", CGI.http_user_agent ) OR

	FindNoCase( "MMCrawler", CGI.http_user_agent ) OR

	FindNoCase( "MJ12bot", CGI.http_user_agent ) OR

	FindNoCase( "OutfoxBot", CGI.http_user_agent ) OR

	FindNoCase( "jBrowser", CGI.http_user_agent ) OR

	FindNoCase( "ZiggsBot", CGI.http_user_agent ) OR

	FindNoCase( "Java", CGI.http_user_agent ) OR

	FindNoCase( "PMAFind", CGI.http_user_agent ) OR

	FindNoCase( "Blogbeat", CGI.http_user_agent ) OR

	FindNoCase( "TurnitinBot", CGI.http_user_agent ) OR

	FindNoCase( "ConveraCrawler", CGI.http_user_agent ) OR

	FindNoCase( "Ocelli", CGI.http_user_agent ) OR

	FindNoCase( "Labhoo", CGI.http_user_agent ) OR

	FindNoCase( "Validator", CGI.http_user_agent ) OR

	FindNoCase( "sproose", CGI.http_user_agent ) OR

	FindNoCase( "oBot", CGI.http_user_agent ) OR

	FindNoCase( "MyFamilyBot", CGI.http_user_agent ) OR

	FindNoCase( "Girafabot", CGI.http_user_agent ) OR

	FindNoCase( "aipbot", CGI.http_user_agent ) OR

	FindNoCase( "ia_archiver", CGI.http_user_agent ) OR

	FindNoCase( "Snapbot", CGI.http_user_agent ) OR

	FindNoCase( "Larbin", CGI.http_user_agent ) OR

	FindNoCase( "psycheclone", CGI.http_user_agent ) OR

	FindNoCase( "ColdFusion", CGI.http_user_agent )

		){

	 

link = false;

 

} else {

	 

	link = true;

}

</cfscript>
 

<cfif link eq true>

<a href="#">Click Here</a>

<cfelse>

No Link

</cfif> 

Open in new window

0
 
LVL 36

Expert Comment

by:SidFishes
ID: 22681954
btw if you want to test how it works just add a line


or FindNoCase( "mozilla", CGI.http_user_agent )

for firefox  


or FindNoCase( "IE", CGI.http_user_agent )

for IE
0
 
LVL 63

Assisted Solution

by:Zvonko
Zvonko earned 100 total points
ID: 22682054
Or in RegExp form>


<cfscript>

// Here, we are using short-circuit evaluation on the

// IF statement with the most popular search engines at the top of the 

// list. This will help us minimize the amount of time that it takes to 

// evaluate the list.
 

if ( ReFindNoCase("Slurp|Googlebot|BecomeBot|msnbot|Mediapartners-Google|ZyBorg|RufusBot|EMonitor|researchbot|IP2MapBot|GigaBot|Jeeves|Exabot|SBIder|findlinks|YahooSeeker|MMCrawler|MJ12bot|OutfoxBot|jBrowser|ZiggsBot|Java|PMAFind|Blogbeat|TurnitinBot|ConveraCrawler|Ocelli|Labhoo|Validator|sproose|oBot|MyFamilyBot|Girafabot|aipbot|ia_archiver|Snapbot|Larbin|psycheclone|ColdFusion", CGI.http_user_agent )

		){

	 

link = false;

 

} else {

	 

	link = true;

}

</cfscript>
 

<cfdump var="#CGI.http_user_agent#" >
 

<cfif link eq true>

<a href="#">Click Here</a>

<cfelse>

No Link

</cfif>

Open in new window

0
 
LVL 1

Author Comment

by:day6
ID: 22682701
I see the regexpr solution, but does this still allow bots/spiders to know that I have reciprocal links to sites?  I understand that in SEO, reciprocal links (or links to higher traffic sites and from their site to mine) can help the rankings.

Does the bot ignore this link with either the javascript OnClick command and does it still see the destination URL in my raw code and take note of where the link goes?  Or does the bot have to actually follow the link to a resolving page to make it verify it as a good link.
0
 
LVL 36

Expert Comment

by:SidFishes
ID: 22683049
what this code does is sets a local variable "link" based on whether it finds a known user agent you then use this variable in the cfif code to determine whether to show a link or not. the bot can't -ignore- the link because it isn't even on the page that is passed to the bot.

afaik, there's no way to have it both ways... you either allow them to follow links or you don't give them the links to follow. YOu can't say to a bot..here's a link but don't follow it, just index it...not how they're built

you can however decide which links you choose to hide from the bots by putting the ones you want to hide in the cfif blocks
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 10

Assisted Solution

by:Mr_Nil
Mr_Nil earned 400 total points
ID: 22688300
Putting rel="nofollow" into your "REPORT THIS" href's should stop the major search engines from following through on the link.

Otherwise, in the top of the script at the end of the link you could probably check the user agent and not send the email, make adjustments to the database etc. if it is a known bot.
0
 
LVL 1

Author Comment

by:day6
ID: 22690343
where would I put the rel="notfollow" syntax.
0
 
LVL 7

Expert Comment

by:black0ps
ID: 22690962
Maybe I'm wrong about this, but don't most bots not follow links like that. I thought bots index the links and then go to the links later on directly without "click" on the link.
0
 
LVL 10

Accepted Solution

by:
Mr_Nil earned 400 total points
ID: 22691064
@day6 <a href="blah" rel="nofollow">

@black0ps  I'm also not sure, but what would stop them from following a link like this.....   A bot wouldn't be able to differentiate between index.cfm?action=display&contentid=1 and index.cfm?action=report&contentid=1 It would just look at the url and try to index the content behind it.


0
 
LVL 7

Expert Comment

by:black0ps
ID: 22691278
Have you tried putting in nocache meta tags? The bot wouldn't take a snapshot of the page. Or are you trying to prevent the page from being accessed altogether by bots to prevent a counter or something?
0
 
LVL 1

Author Comment

by:day6
ID: 22745964
None of these suggestions has worked stopping the bots from activating my script.  I've used the OnClick java... the BOT filter CFIF script... and event the rel="nofollow" and it simply doesn't stop it from happening.  The reason i know that it is a bot is because it's the exact same time each day and it hits multiple links on multiple pages within a minute or two of each.  It actually cached a URL which had variables in it that made my script function for a specific record in my database which now no longer exists.

i've beefed up my code with each of these ideas and nothing works.  I even wrote an output to show what browser is hitting my script and it simply reads Mozilla 4/0. each time along with a few other browser types.  I presume it's the bot's way of spoofing my script.

Any other ideas?

The NOCACHE meta tag isn't going to stop this since I want spiders to crawl my pages, but just not activate my scripts that are attached to the links on them.
0
 
LVL 1

Author Comment

by:day6
ID: 22775408
I really need help here... the suggestions are not stopping the bots from either hitting the link and/or hitting the script the link goes to after they have cached the path.  I wrote a CFIF statement to prevent a bot from running the script but it doesn't stop it.

I really don't want to implement captcha every time a bot exploits a script I have on my site.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Hi, Even though I have created this Tutorial on My personal Blog, Some people might not able to find my website, So here i am posting it again Today, from the topic it is very clear that i will be showing you here the very basic usage of how we …
CFGRID Custom Functionality Series -  Part 1 Hi Guys, I was once asked how it is possible to to add a hyperlink in the cfgrid and open the window to show the data. Now this is quite simple, I have to use the EXT JS library for this and I achiev…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now