• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 585
  • Last Modified:

need to remove html characters using regex

The text changes sometimes daily, and I don't have the guarantee that the html codes will be the same, but presumably there's a regex pattern that can eliminate the need for filtering all this data out manually?
Here's an example of what my data is producing.
I only want the text, but would probably like to replace the new paragraph with a vbnewline or something so that it 'looks' appropriate...
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CA
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBE
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION A
ETY.</FONT> &NBSP;</H4>
<H4>
<P></P>
<H3></H3>
<H5></H5>
<H5 CLASS="RIGHTALIGN"></H5>
<FONT COLOR="#000000">THE BURN NOTICE IS UPDATED DAILY AND DETERMINED BY
^C

Open in new window

0
sirbounty
Asked:
sirbounty
  • 5
  • 3
1 Solution
 
phpmonkeyCommented:
This should do the trick
preg_replace('/<[^>]*>|&[^;\s]*;/i', '', $source);

Open in new window

0
 
numberkruncherCommented:
To remove all tags you can simply remove all tag instances. Bare in mind that in-complete tags (like the ones at the start of your posting) will cause problems.

Regular Expression:  \<[^\>]*?\>

Open in new window

0
 
sirbountyAuthor Commented:
Not sure about that...I used it in vbscript this way...

objRegex.Pattern = "'/<[^>]*>|&[^;\s]*;/i"
strBodyData = objRegEx.Replace(strBodyData, "")

and it still produced the following:
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CAN CAUS
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBED BY G
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION AND SAF
ETY.</FONT> &NBSP;</H4>
<H4>
<P><FONT COLOR="#3366FF"></FONT></P>
<H3><FONT COLOR="#FF0000"></FONT></H3>
<H5><FONT COLOR="#0000FF"></FONT></H5>
<H5 CLASS="RIGHTALIGN"><FONT COLOR="#0000FF"></FONT></H5>

Open in new window

0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

 
numberkruncherCommented:
Try:
objRegex.Pattern = "<[^>]*>|&[^;\s]*;"
strBodyData = objRegEx.Replace(strBodyData, "")

Open in new window

0
 
sirbountyAuthor Commented:
Hmm - that last one works except for the leading /FONT (which I can simply replace).
Thanx!
0
 
numberkruncherCommented:

/FONT> doesn't get removed because it is an incomplete tag.
 
</FONT> should get replaced though.

Open in new window

0
 
sirbountyAuthor Commented:
If you have a moment - could you interpret that for me?
Trying to learn regex's but having difficulty digesting them...
0
 
numberkruncherCommented:

Complete Expression:  <[^>]*>|&[^;\s]*;
 
 
<[^>]*>     - Matches all tags.
              The [^>] bit prevents it replacing the entire input.
 
|           - Or, alternatively
 
&[^;\s]*;   - Matches all entities such as &nbsp;  or &lt;   etc...

Open in new window

0
 
numberkruncherCommented:
Sorry, my first entity example got transformed by EE.

It should have read (hoping that EE transforms them so they look correct):

Matches all entities such as  &amp;nbsp; or &amp;lt;   etc...

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

  • 5
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now