Solved

need to remove html characters using regex

Posted on 2009-04-07
9
572 Views
Last Modified: 2012-05-06
The text changes sometimes daily, and I don't have the guarantee that the html codes will be the same, but presumably there's a regex pattern that can eliminate the need for filtering all this data out manually?
Here's an example of what my data is producing.
I only want the text, but would probably like to replace the new paragraph with a vbnewline or something so that it 'looks' appropriate...
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CA
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBE
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION A
ETY.</FONT> &NBSP;</H4>
<H4>
<P></P>
<H3></H3>
<H5></H5>
<H5 CLASS="RIGHTALIGN"></H5>
<FONT COLOR="#000000">THE BURN NOTICE IS UPDATED DAILY AND DETERMINED BY
^C

Open in new window

0
Comment
Question by:sirbounty
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
9 Comments
 
LVL 4

Expert Comment

by:phpmonkey
ID: 24088832
This should do the trick
preg_replace('/<[^>]*>|&[^;\s]*;/i', '', $source);

Open in new window

0
 
LVL 13

Expert Comment

by:numberkruncher
ID: 24088884
To remove all tags you can simply remove all tag instances. Bare in mind that in-complete tags (like the ones at the start of your posting) will cause problems.

Regular Expression:  \<[^\>]*?\>

Open in new window

0
 
LVL 67

Author Comment

by:sirbounty
ID: 24088891
Not sure about that...I used it in vbscript this way...

objRegex.Pattern = "'/<[^>]*>|&[^;\s]*;/i"
strBodyData = objRegEx.Replace(strBodyData, "")

and it still produced the following:
/FONT></H2>
<H4><FONT COLOR="#0000FF">WIND GUSTS UP TO&NBSP;30 MPH ARE EXPECTED AND CAN CAUS
E FIRE TO SPREAD.</FONT></H4>
<H4><FONT COLOR="#FF0000">SUNDAY IS NOT A&NBSP;BURN DAY, AS&NBSP;PRESCRIBED BY G
WINNETT COUNTY CODE OF ORDINANCES,&NBSP;CHAPTER 46,&NBSP;FIRE PREVENTION AND SAF
ETY.</FONT> &NBSP;</H4>
<H4>
<P><FONT COLOR="#3366FF"></FONT></P>
<H3><FONT COLOR="#FF0000"></FONT></H3>
<H5><FONT COLOR="#0000FF"></FONT></H5>
<H5 CLASS="RIGHTALIGN"><FONT COLOR="#0000FF"></FONT></H5>

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 13

Accepted Solution

by:
numberkruncher earned 300 total points
ID: 24088905
Try:
objRegex.Pattern = "<[^>]*>|&[^;\s]*;"
strBodyData = objRegEx.Replace(strBodyData, "")

Open in new window

0
 
LVL 67

Author Comment

by:sirbounty
ID: 24088926
Hmm - that last one works except for the leading /FONT (which I can simply replace).
Thanx!
0
 
LVL 13

Expert Comment

by:numberkruncher
ID: 24088940

/FONT> doesn't get removed because it is an incomplete tag.
 
</FONT> should get replaced though.

Open in new window

0
 
LVL 67

Author Comment

by:sirbounty
ID: 24089029
If you have a moment - could you interpret that for me?
Trying to learn regex's but having difficulty digesting them...
0
 
LVL 13

Expert Comment

by:numberkruncher
ID: 24089082

Complete Expression:  <[^>]*>|&[^;\s]*;
 
 
<[^>]*>     - Matches all tags.
              The [^>] bit prevents it replacing the entire input.
 
|           - Or, alternatively
 
&[^;\s]*;   - Matches all entities such as &nbsp;  or &lt;   etc...

Open in new window

0
 
LVL 13

Expert Comment

by:numberkruncher
ID: 24089103
Sorry, my first entity example got transformed by EE.

It should have read (hoping that EE transforms them so they look correct):

Matches all entities such as  &amp;nbsp; or &amp;lt;   etc...

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
powershell script 9 81
need the count of duplicate records in a column 12 48
Subtraction v Hex2Dec in vbscript 6 35
VBA Script to return Folder File and Folder count 8 40
Well hello again!  Glad to see you've made it this far without giving up.  In this, the fourth installment of my popular series, I'm going to cover functions and subroutines, what they are, and why they are useful.  Just in case you stumbled onto th…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question