ISA Scheduled Cache content download ignoring remote robots.txt

We have set up a scheduled content download, for this local news site, which people access often, (while they should be working)..

Last week the admin of this web site, set up some formof security in order to ban IPs "attacking" their site... The guy explained that, the softawre is detecting a connection called "Fetch API Request" which is downloading content and ignoring the instructions given by robots.txt.

I have identified the problem to come from the scheduled ISa caching...and i disabled this setting which tells it to cache even if it doesn't get HTTP status code 200... but we still got banned..

Next.. I foudn a setting in the scheduled caching task, instructing the service keep content download within the URL domain... i enabled it, but we still got banned..

I tried lookign for any references to this robots.txt standard for ISA, but didn't find anything..
The usual " Screw teh standards, lets do it our own way", i guess...

I also foudn an event log error, trying to access an administrative page of apache..for this particular schedule..
I could disable the schedule, but this may happen in the future..with other sites

(Sorry if this category isn't exactly related, but this one is teh only one with ref to ISa)

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Hmm...I dont think so the Robots.txt is a standard...Probably you wanna have a look in to it and comply to the "terms & Conditions"...
mikleswAuthor Commented:
If let's say, the robots.txt instructs me not to go to, So i tell ISA myself nto to download that, i'lls till have a problem..if next month is disallowed for example...

PS this robots thing is on the w3c site
Not quite sure, why you want to use then scheduled download for that, I think this functionality is more used to snyc static sites. Robots.txt is mainly introduced for web-spider software to avoid indexing non related or old content. Like some of the spiders, I'm would not wonder about, if  ISA would ignore it.

As you said, that this is a news site, which has usually a very short TTL and may change often, I think that the normal cache functionality of ISA would do what you want. Whenever a site is viewed, ISA stores the content within the cache and it is not downloaded again during the first 50% TTL, if the site itself has not changed. Also any entry within the robots.txt do not affect the status code, as long as the site is still available. Robots.txt has nothing to do with the fact, if a site is available or not, it simply sais, that the robots (spiders) should not index the site anymore.

What the content provider should do is, do enable a site redirection for the sites, which should not accessed anymore. This delivers a different status code, what can be recocnized by ISA.

On you site, you may have to clear the ISA cache, if you have made changed to avoid, that ISA requests updates for the cache content.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Software Firewalls

From novice to tech pro — start learning today.