Link to home
Start Free TrialLog in
Avatar of davehamer
davehamer

asked on

Regular expression to fix html errors

Hi,

I'm having to edit another person's HTML code (hundreds of files) and so many times   is written as just &nbsp - which can cause some browsers to render it as text. Also all singular ampersands "&" should be written as & to be HTML4.01 compliant (and XHTML) - I think.

I'm using TextPad and I just need a simple regex that will let me replace an instance of "&nbsp" that ISN'T "&nsbp;" with " ". And also instances of JUST "&" (with possibly preceding and after letters, if written incorrectly) with &

Obviously will need two Regex's but simpler to ask in one question. I just can't seem to get my head round the regex's since most websites seem to have disgustingly complicated ways of explaining them. I've tried http://www.regular-expressions.info/ which seems useful if you have twenty hours of time to learn, which unfortunately I don't :)

So thanks in advance,

Dave
Avatar of Bob Learned
Bob Learned
Flag of United States of America image

Start with:

Replace &nsbp[^;] with &nsbp;

Bob
Avatar of davehamer
davehamer

ASKER

That unfortunately doesn't work;

I already tried that one from reading the examples, however in Textpad it selects the &nbsp and the following character.

For example" &nbsp<img" with that replace would become "&nbsp;img".

Perhaps there is a command that I am missing to "save" the end character so that it can be used in the replacement?

I've upped the points to 80; I'm sure that this is a simple answer tho.

Dave.
Try this:

&nsbp(?![;])

Bob
Doesn't match anything this time :(
Where are you running this from?  I tested it in VB.NET, but it should still be a valid expression.

I tested the expression with the specific case of &nbsp<img and got &nbsp;<img.

Bob
Hi Bob,

Thanks for your replies; I'm using TextPad 4.7.3 as specified in the original question. The reg-ex engine in this piece of software should be the same as any else AFAIK.

It is available as a trial download from:

http://www.textpad.com/download/index.html#downloads

Thanks;

Dave
Right, I have a few Regular Expression questions in play, and I just got a little confused.

Bob
BTW, not all Regular Expression engines are the same.

Bob
This is confusing, because I checked the Posix option for Regular Expressions in preferences, and I looked up that the '?!' is a negative lookahead expression character, and still it doesn't find anything.

(Scratching head)

Bob
ASKER CERTIFIED SOLUTION
Avatar of Bob Learned
Bob Learned
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for your further help;

I've improved the expression further:

By using :

\(&nbsp\)\([^;]\)

This still selects the next char as well, but I can then use the replacement syntax of:

\1;\2

The only problem is that the expression will ONLY match &nbsp that is followed by ANOTHER character. Unfortunately it won't match line breaks so if a line just contains &nbsp (which it does cos this guy is a muppet) it won't be matched. I suppose I could use a second regex to match those ones( The simplest being "&nbsp\n" - maybe you can build this into a single regex? I dont know because I'm still new to this.

Thanks for your help so far Bob, hopefully we can get this one kicked in the head; putting points upto 100 for when we get a completed answer.

Dave
I am fresh out of ideas, sorry :(

Bob
Since no-one else has contributed a correct answer; I will submit the points to Bob but with a lower grade due to a slightly incomplete answer.

Ty;
Dave