Google bot requesting non-existent pages buried in appended directory names

Posted on 2011-05-09
Last Modified: 2013-12-08
I was looking through my server request logs and see that google bots are requesting all kinds of bad URL's on my site.  It seems to be appending directory names to each other.

For example:

All of the directories appended to each other are top level directories.

Any idea why this is happening and how I can fix it?
Question by:MFredin
    LVL 12

    Expert Comment

    Although not strictly necessary if you're willing to put up with whatever Google does, there are ways to help you control how robots search your site.

    Google recommends using a sitemap. If you aren't using one, they're described here:

    A robots.txt file may be useful too. They're described at

    There are many tools available to help you automate the creation of sitemaps and robot.txt files.
    Here are Google links to sitemap generators and robots.txt info
    LVL 82

    Expert Comment

    by:Dave Baldwin
    Like @Amick said, Google Sitemaps are the best way.  If the links are coming from another site for some reason, you won't be able to do much about it.   Do a Google search to see if they are getting those links somewhere else.
    LVL 23

    Accepted Solution

    This is often caused by badly written links.

    In this case I would suspect some of your links are of the form "directory1/" and not "/directory1/".

    The first form is a relative link, so if you are already in a directory it will append the second directly to it. Just like your example. the second for has the preceding / which indicate the link is relative to the root of the website.

    If you can't spot the bad links I'd suggest you use a link checker (like Zenu) to try and find them.
    LVL 29

    Assisted Solution


    1. Google Sitemaps are NOT a solution to this problem.

    Google sitemaps do not say "this a complete list of web pages and no other pages should be indexed", what they say is "I know your spiders will discover all pages in my site, but to help you here is a list of some pages present on the site. And although I have tried to make this list as exhaustive as possible, I would not be surprised that your spiders will discover other pages."
    As a matter of fact, this last sentence is very important, since it allows you to use several sitemaps when needed.

    2. The solution is robots.txt

    These loops are probably caused by some error in the site code (I know, I have one on one site I cannot solve). If you do not find the code, then the only solution is to rely on robots.txt.
    Again, robots.txt does not say "it is forbidden to do differently than what is written here", but it says "if you are a nice and well-educated spider/robot, you will do as is written here".
    So every nice spider, i.e. those that count, might index pages on other sites that index "wrong pages" on your site, but when following the link it will index this page, the look at the link on this page and, if robots.txt tells not to index them, then it will not.
    So you should build a list of all the wrong 2-level (or is it 3-level?) directories that should not be explored, and place it in your robots.txt.
    Note that this will not remove the pages already listed in the search engines...

    3. Consider using the rewrites 301

    All of this could be solved if it is simple to resolve all of your wrong paths with a regexp mod-rewrite in htaccess: using a rewrite 301 would clean up things for all spiders(even ill behaving ones)
    LVL 29

    Expert Comment

    I think the thread proposes solutions which should be evaluated.

    Not evaluating mine (I obviously have some bias!), I think the best contribution is Tiggerito's  #a35735050 because it points to a probable cause of the problem, thus curing the problem rather than its symptoms.
    LVL 142

    Expert Comment

    by:Guy Hengel [angelIII / a3]
    This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
    LVL 29

    Expert Comment

    Thx Angel

    Featured Post

    IT, Stop Being Called Into Every Meeting

    Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

    Join & Write a Comment

    Browsers only know CSS so your awesome SASS code needs to be translated into normal CSS. Here I'll try to explain what you should aim for in order to take full advantage of SASS.
    This story has been written with permission from the scammed victim, a valued client of mine – identity protected by request.
    The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
    Learn how to set-up custom confirmation messages to users who complete your Wufoo form. Include inputs from fields in your form, webpage redirects, and more with Wufoo’s confirmation options.

    728 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now