How to disable URL with specific query string URL in robots.txt

I would like to add an entry in my robots.txt file to disallow URLS with the value "nocach" in the query string.

For example
http://www.mywebsite.com/products/cages,7246035?animalidx=8&nocache=174259&ipp=16
mike99cAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dave BaldwinFixer of ProblemsCommented:
robots.txt is not the place to do that.  robots.txt only tells 'good bots' what to index on your web site.  It does not actually control access, it is merely a suggestion that the 'bad bots' totally ignore.

What is the problem that you are having?
0
mike99cAuthor Commented:
Our server CPU has hit the roof and I isolated it down to the Bing search engine robot. This appears to be indexing pages in a particular e-commerce website by clicking on product listing filters which causes a large permutation of pages to be generated.

I  have since made sure that pages generated as a result of clicking a filter have the following meta tag:

<meta name="robots" content="noindex,nofollow,noarchive" />

I did this about 2 days ago but it has had no effect, so as a last resort I want to disallow these types of pages within the robots.txt file. I assume that Bing is one of the "good" bots so hopefully this directive will be respected.
0
Dave BaldwinFixer of ProblemsCommented:
Here's the descriptive page for robots.txt: http://www.robotstxt.org/robotstxt.html  I think you may have to 'disallow' the entire page.  I had to 'disallow' my calendar and a database directory for those same reasons, too many permutations.
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

mike99cAuthor Commented:
Thanks but I have already seen this page. But going back to my specific question, would the following directive disallow any URL which contains "nocach" in the query string?

User-agent: *
Disallow: /*nocach=
0
Dave BaldwinFixer of ProblemsCommented:
No, I don't think so.  From that page...
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".
0
mike99cAuthor Commented:
https://support.google.com/webmasters/answer/6062596?hl=en&ref_topic=6061961

if you expand the section "Pattern matching rules..." it does appear to be accepted by Google at least but I accept this may not be conventional.
0
Dave BaldwinFixer of ProblemsCommented:
This page http://blogs.bing.com/webmaster/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbots-question/ says that Bing will honor a 'Crawl-delay directive' to slow down their rate of crawl.  Maybe that will help.  It also mentions Bing Webmaster Tools has 'Crawl' settings.
0
tigermattCommented:
[Disclaimer: I typed this out earlier, and neglected to "submit" it]

I  have since made sure that pages generated as a result of clicking a filter have the following meta tag:

<meta name="robots" content="noindex,nofollow,noarchive" />
It is worth stopping and recalling the semantics of the "nofollow" directive in the robots meta tag. It is intended to direct to the search engine's reputation algorithm (e.g. PageRank in the case of Google) that the reputation of a page should not be passed to the URLs embedded in that page; useful, for example, when user-supplied content can be provided, and hence an adversary could attract a high reputation from another high rep site.

While a crawler should honour this request, it does NOT imply that the search engine cannot continue to explore the entire web graph, and simply figure out later where reputation should flow and where it should not. The option is not really named too well. It is unclear how crawlers interpret the result, with Wikipedia claiming that Bing follows "nofollow" links but simply does not assign rep to them (but this has no independent source I could find).

While you can probably disallow through a robots.txt, a better way would be to provide a hint to the crawler that pages reachable at all variations of a given URL are members of the same equivalence class, and hence it needs only crawl it once rather than hit your server repeatedly for every accessible permutation of the page. For this purpose, check out the canonical URL meta tag (Google's word), and also at the webmaster tools provided by Google, Bing, et al.. (e.g. Bing provides a "URL normalization" feature, which you can read more about here).

As always with crawlers, their underlying algorithms are in a constant state of flux, so your mileage may vary...
0
Lucas BishopClick TrackerCommented:
In Bing Webmaster Tools (BWT) you can specify specific url parameters to be ignored. You can also specify times when Bingbot crawl rate should be reduced.

Based on the issues you outlined above, I'd expect that configuring both of these features (and/or adding a crawl-delay to your robots.txt) would help reduce your server load.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mike99cAuthor Commented:
Ok it was good to get all the feedback which I gratefully applied. However it did not seem to make much difference overall. In the end I had to adjust the scripts so that if the URL had the "nocach" in the query string I returned a 403 forbidden page. This helped reduce the CPU.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Search Engine Optimization (SEO)

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.