davehamer
asked on
Regular expression to fix html errors
Hi,
I'm having to edit another person's HTML code (hundreds of files) and so many times is written as just   - which can cause some browsers to render it as text. Also all singular ampersands "&" should be written as & to be HTML4.01 compliant (and XHTML) - I think.
I'm using TextPad and I just need a simple regex that will let me replace an instance of " " that ISN'T "&nsbp;" with " ". And also instances of JUST "&" (with possibly preceding and after letters, if written incorrectly) with &
Obviously will need two Regex's but simpler to ask in one question. I just can't seem to get my head round the regex's since most websites seem to have disgustingly complicated ways of explaining them. I've tried http://www.regular-expressions.info/ which seems useful if you have twenty hours of time to learn, which unfortunately I don't :)
So thanks in advance,
Dave
I'm having to edit another person's HTML code (hundreds of files) and so many times is written as just   - which can cause some browsers to render it as text. Also all singular ampersands "&" should be written as & to be HTML4.01 compliant (and XHTML) - I think.
I'm using TextPad and I just need a simple regex that will let me replace an instance of " " that ISN'T "&nsbp;" with " ". And also instances of JUST "&" (with possibly preceding and after letters, if written incorrectly) with &
Obviously will need two Regex's but simpler to ask in one question. I just can't seem to get my head round the regex's since most websites seem to have disgustingly complicated ways of explaining them. I've tried http://www.regular-expressions.info/ which seems useful if you have twenty hours of time to learn, which unfortunately I don't :)
So thanks in advance,
Dave
ASKER
That unfortunately doesn't work;
I already tried that one from reading the examples, however in Textpad it selects the   and the following character.
For example"  <img" with that replace would become " img".
Perhaps there is a command that I am missing to "save" the end character so that it can be used in the replacement?
I've upped the points to 80; I'm sure that this is a simple answer tho.
Dave.
I already tried that one from reading the examples, however in Textpad it selects the   and the following character.
For example"  <img" with that replace would become " img".
Perhaps there is a command that I am missing to "save" the end character so that it can be used in the replacement?
I've upped the points to 80; I'm sure that this is a simple answer tho.
Dave.
Try this:
&nsbp(?![;])
Bob
&nsbp(?![;])
Bob
ASKER
Doesn't match anything this time :(
Where are you running this from? I tested it in VB.NET, but it should still be a valid expression.
I tested the expression with the specific case of  <img and got <img.
Bob
I tested the expression with the specific case of  <img and got <img.
Bob
ASKER
Hi Bob,
Thanks for your replies; I'm using TextPad 4.7.3 as specified in the original question. The reg-ex engine in this piece of software should be the same as any else AFAIK.
It is available as a trial download from:
http://www.textpad.com/download/index.html#downloads
Thanks;
Dave
Thanks for your replies; I'm using TextPad 4.7.3 as specified in the original question. The reg-ex engine in this piece of software should be the same as any else AFAIK.
It is available as a trial download from:
http://www.textpad.com/download/index.html#downloads
Thanks;
Dave
Right, I have a few Regular Expression questions in play, and I just got a little confused.
Bob
Bob
BTW, not all Regular Expression engines are the same.
Bob
Bob
This is confusing, because I checked the Posix option for Regular Expressions in preferences, and I looked up that the '?!' is a negative lookahead expression character, and still it doesn't find anything.
(Scratching head)
Bob
(Scratching head)
Bob
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thanks for your further help;
I've improved the expression further:
By using :
\( \)\([^;]\)
This still selects the next char as well, but I can then use the replacement syntax of:
\1;\2
The only problem is that the expression will ONLY match   that is followed by ANOTHER character. Unfortunately it won't match line breaks so if a line just contains   (which it does cos this guy is a muppet) it won't be matched. I suppose I could use a second regex to match those ones( The simplest being " \n" - maybe you can build this into a single regex? I dont know because I'm still new to this.
Thanks for your help so far Bob, hopefully we can get this one kicked in the head; putting points upto 100 for when we get a completed answer.
Dave
I've improved the expression further:
By using :
\( \)\([^;]\)
This still selects the next char as well, but I can then use the replacement syntax of:
\1;\2
The only problem is that the expression will ONLY match   that is followed by ANOTHER character. Unfortunately it won't match line breaks so if a line just contains   (which it does cos this guy is a muppet) it won't be matched. I suppose I could use a second regex to match those ones( The simplest being " \n" - maybe you can build this into a single regex? I dont know because I'm still new to this.
Thanks for your help so far Bob, hopefully we can get this one kicked in the head; putting points upto 100 for when we get a completed answer.
Dave
I am fresh out of ideas, sorry :(
Bob
Bob
ASKER
Since no-one else has contributed a correct answer; I will submit the points to Bob but with a lower grade due to a slightly incomplete answer.
Ty;
Dave
Ty;
Dave
Replace &nsbp[^;] with &nsbp;
Bob