Link to home
Start Free TrialLog in
Avatar of textex
textex

asked on

Need .htaccess help!

I am having canonical/dupe issues with BING for a few of my sites. I need help with my htacess file! I am using the two examples below. For whatever reason, neither is working properly since BING is not listening that I don't want https or non-www versions of my site recognized.

Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

RewriteEngine On
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
Avatar of Tony McCreath
Tony McCreath
Flag of Australia image

They look fine to me.

Do any commands in your .htaccess file work? Maybe it is not enabled.
does your client have a proper DNS for example.com when you access in it's browser with http://example.com/
Avatar of David Johnson, CD
do you have a robots.txt?
ASKER CERTIFIED SOLUTION
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
1 - You htaccess seems to properly redirect http://mydomain to http://www.mydomain.
2 - It does not address the https thing however (not sure it matters really...) for which you might find some clue at http://www.askapache.com/htaccess/http-https-rewriterule-redirect.html
3 - There are several possible causes to non-www pages being indexed
-- maybe they were before you put in place the htaccess redirect -->not sure with Bing webmasters tools, but with Google's ones you can ask Google to remove them
-- (most probable) some pages on your site are still linking to non-www addresses --> use the program xenu to check that (http://home.snafu.de/tilman/xenulink.html )
-- (should have no effect) some incoming links are still linking etc
-- do you have any sitemap (or urllist) in xml or txt format? are they linked-to from within your robots.txt? --> edit the sitemap, or generate a new one, either with xenu or some other tools, my preferred being gsite crawler ( http://gsitecrawler.com/ ) and check that no non-www link is part of the map! then update google webmaster and bing webmaster with the updated maps
Avatar of textex
textex

ASKER

I am still confused.

Where does this code go?
RewriteEngine on
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

Also, what would the code be to prevent non-www versions as well as pages with added parameters (we don't use any but I do see a few pages indexed with some funky stuff added).
That code would go in your main .htaccess file (which is usually in your site's document root). To add support for preventing non-www versions, just add another condition and rule below it:

RewriteEngine on
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule ^robots\.txt$ robots_ssl.txt [L]
Avatar of textex

ASKER

I am specifically having this dup content issues with BING. I got an email from BING stating 'Just a quick check here, but the crawlers don't follow htaccess files. They reference robot.txt files. The htaccess file is a server-level file to tell the server how to handle certain requests made of it - like 301 redirects, custom 404 pages, etc. '.

Seems to me like he is telling me I need to edit the robots.txt file for eliminating the issue of the https and non-www versions of my sites being indexed. I already have htacess set-up to handle all this but apparently, that's not the answer.
it never was the answer, robots.txt, sitemap.xml, or individual meta-tags on the pages i.e.
<meta name="robots" content="noindex, follow"/>
The spider will not look at this page but will crawl through the rest of the pages on your website.

A good tutorial is at: http://www.metatags.org/meta_name_robots
Avatar of textex

ASKER

Right, but I don't have an https or non-www version. So how can I add that code to the pages?
I don't have an https or non-www version

I have no idea of what you mean by a non-www version or an https version.  You have public data and private data. only include the public html pages in the robots.txt and you can specifically state what pages NOT to index.

there is only 1 robots.txt file or sitemap.xml  these are only used by site crawlers that follow the robots.txt specification

Check your robots.txt @ https://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&from=35237&rd=1
Avatar of textex

ASKER

Let me be more specific. In BING webmaster tools, it is showing the following homepages as being indexed: http://www.mysite.com (this is what I want), https://www.mysite.com and http://mysite.com. All of our internal links point to http://www.mysite.com. I am really not sure how BING picked up the other versions but I need to make them disappear. I signed up for experts-exchange to get answers from experts. I don't know much about programming and need a specific example that I can use for my fix. Or, send me your email and we can contract (hope I am not violating TOS).
perhaps on the bing webmaster tools page you can exclude https://yoursitename.com
Avatar of textex

ASKER

No...can't find any way to do that.
301s via .htaccess can be a solution is it stops the robots getting to the pages in the first place. Note you will have to wait some time after implementing a 301 before the robots will crawl and then later update their index. This would be my preferred method as it enforces the use of the URL you want.

robots.txt may work, but it's not great at getting already index stuff removed. Again, you will have to wait to see if it works. It also does not help search engines realise the different URLs actually represent the same information. The pages you block will give you no credit if linked to.

The canonical tag is a good solution if you can't do 301s. Have every page use the canonical tag to indicate the non secure www URL is the official version.

Bing probably picked up those other versions because someone else linked to them or you linked to them at some point in the past.
> indexed: http://www.mysite.com (this is what I want), https://www.mysite.com and http://mysite.com. All of our internal links point to http://www.mysite.com. I am really not sure how BING picked up the other versions but I need to make them disappear.

you cant do nothing to prevent this except telling them (BING's admins) to remove the se links
mysite.com versus www.mysite.com is DNS not a webserver issue, means if you have listed mysite.com in DNS it can be reached that way
why do you care about mysite.com? it's just a shorter way to write the hostname, usually

according https vs. http, this is not a content but just a protocol issue, if you disable https on your webserver it will never respond to https:// requests

so, all in all, I don't see anything to do at your site except providing a proper robots.txt in your webservers DocumentRoot directory (that accessed using http://www.mysite.com/robots.txt )