Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

ISA Scheduled Cache content download ignoring remote robots.txt

Posted on 2005-04-08
3
Medium Priority
?
386 Views
Last Modified: 2010-04-09
We have set up a scheduled content download, for this local news site, which people access often, (while they should be working)..

Last week the admin of this web site, set up some formof security in order to ban IPs "attacking" their site... The guy explained that, the softawre is detecting a connection called "Fetch API Request" which is downloading content and ignoring the instructions given by robots.txt.

I have identified the problem to come from the scheduled ISa caching...and i disabled this setting which tells it to cache even if it doesn't get HTTP status code 200... but we still got banned..

Next.. I foudn a setting in the scheduled caching task, instructing the service keep content download within the URL domain... i enabled it, but we still got banned..

I tried lookign for any references to this robots.txt standard for ISA, but didn't find anything..
The usual " Screw teh standards, lets do it our own way", i guess...

I also foudn an event log error, trying to access an administrative page of apache..for this particular schedule..
I could disable the schedule, but this may happen in the future..with other sites

(Sorry if this category isn't exactly related, but this one is teh only one with ref to ISa)

0
Comment
Question by:miklesw
3 Comments
 
LVL 12

Expert Comment

by:srikrishnak
ID: 13740542
Hmm...I dont think so the Robots.txt is a standard...Probably you wanna have a look in to it and comply to the "terms & Conditions"...
0
 
LVL 1

Author Comment

by:miklesw
ID: 13743041
If let's say, the robots.txt instructs me not to go to www.x.com/a/, So i tell ISA myself nto to download that, i'lls till have a problem..if next month www.x.com/b/ is disallowed for example...

PS this robots thing is on the w3c site
0
 
LVL 35

Accepted Solution

by:
Bembi earned 1000 total points
ID: 13762533
Not quite sure, why you want to use then scheduled download for that, I think this functionality is more used to snyc static sites. Robots.txt is mainly introduced for web-spider software to avoid indexing non related or old content. Like some of the spiders, I'm would not wonder about, if  ISA would ignore it.

As you said, that this is a news site, which has usually a very short TTL and may change often, I think that the normal cache functionality of ISA would do what you want. Whenever a site is viewed, ISA stores the content within the cache and it is not downloaded again during the first 50% TTL, if the site itself has not changed. Also any entry within the robots.txt do not affect the status code, as long as the site is still available. Robots.txt has nothing to do with the fact, if a site is available or not, it simply sais, that the robots (spiders) should not index the site anymore.

What the content provider should do is, do enable a site redirection for the sites, which should not accessed anymore. This delivers a different status code, what can be recocnized by ISA.

On you site, you may have to clear the ISA cache, if you have made changed to avoid, that ISA requests updates for the cache content.

0

Featured Post

Who's Defending Your Organization from Threats?

Protecting against advanced threats requires an IT dream team – a well-oiled machine of people and solutions working together to defend your organization. Download our resource kit today to learn more about the tools you need to build you IT Dream Team!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you are like regular user of computer nowadays, a good bet that your home computer is on right now, all exposed to world of Internet to be exploited by somebody you do not know and you never will. Internet security issues has been getting worse d…
The DROP (Spamhaus Don't Route Or Peer List) is a small list of IP address ranges that have been stolen or hijacked from their rightful owners. The DROP list is not a DNS based list.  It is designed to be downloaded as a file, with primary intention…
When cloud platforms entered the scene, users and companies jumped on board to take advantage of the many benefits, like the ability to work and connect with company information from various locations. What many didn't foresee was the increased risk…
Is your OST file inaccessible, Need to transfer OST file from one computer to another? Want to convert OST file to PST? If the answer to any of the above question is yes, then look no further. With the help of Stellar OST to PST Converter, you can e…
Suggested Courses
Course of the Month14 days, 21 hours left to enroll

577 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question