Link to home
Start Free TrialLog in
Avatar of Ben Conner
Ben ConnerFlag for United States of America

asked on

Robots.txt filtering based on time of day?

Hi,

I host some websites which have been getting hammered by Bingbot.  I had to block them in the robots.txt file but hoped there might be a directive where I could let them in based on time of day--say at night only or something.  I didn't see anything like this in it.  

As a runner-up, was thinking of have 2 versions of robots.txt where one would get swapped out at the appropriate time each day for the other.  I don't know how often search engine bots recheck this type of stuff.

Anyone have a feel for this?  I don't deal with it often enough to have a clue.

Thanks!

--Ben
Avatar of Dr. Klahn
Dr. Klahn

Does the robots.txt ask bingbot to use a crawl delay?

Sitemap: https://www.thepackis.back/gtgsitemap.xml
# ----------
# -- bingbot, microsoft indexer
User-agent: Bingbot
Disallow: /errors/
Disallow: /graphics/
Crawl-delay: 10

Open in new window

Have a scheduled task running to replace the file. If a bot is any trust worthy, BEFORE crawling, it should consult the robots.txt again.
You mentioned, "I had to block them in the robots.txt" which will never work.

The entire robots.txt system is broken.

Bots either ignore robots.txt or look for disallowed directories... which they search first... looking for data to steal, or to craft attacks.

Tip: First step to solving all site problems... Remove robots.txt from your sites.

The only way you can actually throttle connections is by using iptables (never robots.txt) + using iptables has its own problems.
Real Fix: Tune your sites for high throughput.

The test I run on all sites, before I deploy them live... is something like this...

net13 # h2speed https://davidfavor.com/
h2load -ph2c -t8 -c8 -m8 -n8000 https://davidfavor.com/
finished in 412.12ms, 19409.49 req/s, 217.72MB/s
requests: 8000 total, 8000 started, 8000 done, 7999 succeeded, 1 failed, 1 errored, 0 timeout
status codes: 7999 2xx, 0 3xx, 0 4xx, 0 5xx
Requests per second: 19,409.49
Requests per minute: 1,164,569.4
Requests per hour  : 69,874,164

Open in new window


Then tune the site till it produces a minimum throughput of 1,000,000+ reqs/minute running a localhost test.

This is for both static sites + WordPress sites.

If a site can survive 1,000,000+ reqs/minute then you'll never notice any Crawler effects on your site.

If you're unfamiliar with site tuning, open a 2nd question asking about how to approach site tuning, providing a clickable URL for testing.
Why should robot.txt be a file.......
you could create a php script and run that when robot.txt is asked for...
Or for that matter any cgi, cgi+ etc. script can be used.

By specifying * as disallowed there is no hint on what is disallowed to scan.
most scanner are reasonable simple to be recognized (and possibly blacklisted) if needed.
Not a lot of systems do request pages in bulk.

The "blacklisted" systems can then selectively be provided with pages & links.
Avatar of Ben Conner

ASKER

Hi David, Dr Klahn,

I actually mis-quoted the original problem slightly.  While I created the server and host it, technically the content is managed by another company.  The problem is they implemented a solution as a client-server model where queries coming in here reach out across the Internet to their client's lan for things like Inventory, etc.  Since it isn't my code (or area of responsibility), the most I can touch is something like robots.txt or suggest something to them to mitigate the issue.

I'll try the crawl-delay directive and see if they honor it.  If not, I can always implement the disallow again.

--Ben
ASKER CERTIFIED SOLUTION
Avatar of Anthony Garcia
Anthony Garcia
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yes, Bingbot was the main culprit in this case.
I agree that some search engines are intentionally clueless.  Those that are can't get here any longer.
Thanks to all that responded.  I can use the scheduled task approach also with this and other items in the future.  

Interesting insights!

--Ben