sirbounty
asked on
need to remove html characters using regex
The text changes sometimes daily, and I don't have the guarantee that the html codes will be the same, but presumably there's a regex pattern that can eliminate the need for filtering all this data out manually?
Here's an example of what my data is producing.
I only want the text, but would probably like to replace the new paragraph with a vbnewline or something so that it 'looks' appropriate...
Here's an example of what my data is producing.
I only want the text, but would probably like to replace the new paragraph with a vbnewline or something so that it 'looks' appropriate...
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CA
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBE
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION A
ETY.</FONT> &NBSP;</H4>
<H4>
<P></P>
<H3></H3>
<H5></H5>
<H5 CLASS="RIGHTALIGN"></H5>
<FONT COLOR="#000000">THE BURN NOTICE IS UPDATED DAILY AND DETERMINED BY
^C
To remove all tags you can simply remove all tag instances. Bare in mind that in-complete tags (like the ones at the start of your posting) will cause problems.
Regular Expression: \<[^\>]*?\>
ASKER
Not sure about that...I used it in vbscript this way...
objRegex.Pattern = "'/<[^>]*>|&[^;\s]*;/i"
strBodyData = objRegEx.Replace(strBodyDa ta, "")
and it still produced the following:
objRegex.Pattern = "'/<[^>]*>|&[^;\s]*;/i"
strBodyData = objRegEx.Replace(strBodyDa
and it still produced the following:
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CAN CAUS
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBED BY G
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION AND SAF
ETY.</FONT> &NBSP;</H4>
<H4>
<P><FONT COLOR="#3366FF"></FONT></P>
<H3><FONT COLOR="#FF0000"></FONT></H3>
<H5><FONT COLOR="#0000FF"></FONT></H5>
<H5 CLASS="RIGHTALIGN"><FONT COLOR="#0000FF"></FONT></H5>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hmm - that last one works except for the leading /FONT (which I can simply replace).
Thanx!
Thanx!
/FONT> doesn't get removed because it is an incomplete tag.
</FONT> should get replaced though.
ASKER
If you have a moment - could you interpret that for me?
Trying to learn regex's but having difficulty digesting them...
Trying to learn regex's but having difficulty digesting them...
Complete Expression: <[^>]*>|&[^;\s]*;
<[^>]*> - Matches all tags.
The [^>] bit prevents it replacing the entire input.
| - Or, alternatively
&[^;\s]*; - Matches all entities such as or < etc...
Sorry, my first entity example got transformed by EE.
It should have read (hoping that EE transforms them so they look correct):
Matches all entities such as &nbsp; or &lt; etc...
It should have read (hoping that EE transforms them so they look correct):
Matches all entities such as &nbsp; or &lt; etc...
Open in new window