Removing bad chars from URL with .htaccess

Posted on 2011-05-06
Last Modified: 2012-05-11
Hi Experts,

I am having a problem on one of my websites. Google bot keeps on picking up url's like this: /life/mortgage-life-%E2%80%8Binsurance

When the actual URL is this: life/mortgage-life-insurance

So those characters convert to something that definitely isn't in any of my source files, so I am assuming that this is a link in from an external site. Where they are coming from doesn't really bother me, so what I need to do is create a htaccess mod_rewrite rule to remove those bad characters from the URL.

What I have come up with so far (by googling) is attached. It removes it from the URL however it doesn't then put the rest of the URL back in, so when it redirects it goes to life/mortgage-life-

and I need it to remember to put the insurance on to the end of it and basically only just remove %E2%80%8B.

How can I do this? I have tried a few regex creators but none seem to be able to do..

Many thanks!

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%e2%80%8b(.*)\ HTTP/ [NC] 
RewriteRule ^.*$ [R=301,L]

Open in new window

Question by:temmygray
    LVL 7

    Expert Comment


    I have no clue about how apache works BUT I see your line #1 is not exactly what you are looking for : in *nix, %e2%80%8b is not equal to %E2%80%8B

    Have you tried it with the right case ?

    LVL 74

    Expert Comment

    by:käµfm³d 👽

    BUT I see your line #1 is not exactly what you are looking for : in *nix, %e2%80%8b is not equal to %E2%80%8B
    The "NC" flag stands for "ignore case"   ; )

    I might be missing something, but can you try adding the "no escape" flag?
    RewriteRule ^.*$ [R=301,NE,L]

    Open in new window

    LVL 74

    Accepted Solution

    Ooops! I see now that I am indeed missing something  = )

    Ignore the previous comment. You have two capture groups, but you are only referring to one of them in your replacment syntax, namely "%1". Try adding the second group to the replacement (below).

    Each set of parentheses acts as its own capture group, which are numbered sequentially, starting from 1, going from left to right. In your rule above, the group to the left of "\%e2%80%8b" is group 1; to the right is group 2.
    RewriteRule ^.*$ [R=301,L]

    Open in new window


    Author Closing Comment

    Perfect. Thanks!
    LVL 74

    Expert Comment

    by:käµfm³d 👽
    NP. Glad to help  = )

    Featured Post

    Maximize Your Threat Intelligence Reporting

    Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

    Join & Write a Comment

    Suggested Solutions

    In my time as an SEO for the last 2 years and in the questions I have assisted with on here I have always seen the need to redirect from non-www urls to their www versions. For instance redirecting ( to http…
    As Wikipedia explains 'robots.txt' as -- the robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a websit…
    Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

    728 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now