• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 579
  • Last Modified:

need to remove html characters using regex

The text changes sometimes daily, and I don't have the guarantee that the html codes will be the same, but presumably there's a regex pattern that can eliminate the need for filtering all this data out manually?
Here's an example of what my data is producing.
I only want the text, but would probably like to replace the new paragraph with a vbnewline or something so that it 'looks' appropriate...
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CA
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBE
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION A
ETY.</FONT> &NBSP;</H4>
<H4>
<P></P>
<H3></H3>
<H5></H5>
<H5 CLASS="RIGHTALIGN"></H5>
<FONT COLOR="#000000">THE BURN NOTICE IS UPDATED DAILY AND DETERMINED BY
^C

Open in new window

0
sirbounty
Asked:
sirbounty
  • 5
  • 3
1 Solution
 
phpmonkeyCommented:
This should do the trick
preg_replace('/<[^>]*>|&[^;\s]*;/i', '', $source);

Open in new window

0
 
numberkruncherCommented:
To remove all tags you can simply remove all tag instances. Bare in mind that in-complete tags (like the ones at the start of your posting) will cause problems.

Regular Expression:  \<[^\>]*?\>

Open in new window

0
 
sirbountyAuthor Commented:
Not sure about that...I used it in vbscript this way...

objRegex.Pattern = "'/<[^>]*>|&[^;\s]*;/i"
strBodyData = objRegEx.Replace(strBodyData, "")

and it still produced the following:
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CAN CAUS
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBED BY G
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION AND SAF
ETY.</FONT> &NBSP;</H4>
<H4>
<P><FONT COLOR="#3366FF"></FONT></P>
<H3><FONT COLOR="#FF0000"></FONT></H3>
<H5><FONT COLOR="#0000FF"></FONT></H5>
<H5 CLASS="RIGHTALIGN"><FONT COLOR="#0000FF"></FONT></H5>

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
numberkruncherCommented:
Try:
objRegex.Pattern = "<[^>]*>|&[^;\s]*;"
strBodyData = objRegEx.Replace(strBodyData, "")

Open in new window

0
 
sirbountyAuthor Commented:
Hmm - that last one works except for the leading /FONT (which I can simply replace).
Thanx!
0
 
numberkruncherCommented:

/FONT> doesn't get removed because it is an incomplete tag.
 
</FONT> should get replaced though.

Open in new window

0
 
sirbountyAuthor Commented:
If you have a moment - could you interpret that for me?
Trying to learn regex's but having difficulty digesting them...
0
 
numberkruncherCommented:

Complete Expression:  <[^>]*>|&[^;\s]*;
 
 
<[^>]*>     - Matches all tags.
              The [^>] bit prevents it replacing the entire input.
 
|           - Or, alternatively
 
&[^;\s]*;   - Matches all entities such as &nbsp;  or &lt;   etc...

Open in new window

0
 
numberkruncherCommented:
Sorry, my first entity example got transformed by EE.

It should have read (hoping that EE transforms them so they look correct):

Matches all entities such as  &amp;nbsp; or &amp;lt;   etc...

0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 5
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now