Solved

Regular expression pattern macth to weed out words with HTML character codes & entities

Posted on 2016-10-29
7
91 Views
Last Modified: 2016-11-18
I have a file that is riddled with words containing HTML character codes & entities such as: Müller, Étre or "Flying—High" -- however, there's also a lot of junk ones that need to corrected such as Teenage��������s or &sup1dsjÝhccbß

I'm using a perl script, so what would be a good regular expression to match these words with HTML codes & entities? Thanks
0
Comment
Question by:hadrons
  • 2
  • 2
  • 2
  • +1
7 Comments
 
LVL 26

Expert Comment

by:wilcoxon
ID: 41865196
If you just want to match words with escape sequences in the middle then this should work:
m{((?:\w+(?:&(?:#\d+|[a-z]+);)+)+\w*)}

Open in new window


That will match a word with one or more escapes (either in a row or separated by letters).  It will not match words with the escape at the beginning of the word.
0
 
LVL 25

Expert Comment

by:Dr. Klahn
ID: 41865213
I am not sure that there is a simple solution to this one.  But here's a start.  The numeric entities expressed in decimal can be matched by

&\#[0-9]+;

Open in new window


Unfortunately this doesn't match numeric entities expressed in hex, and it also does matches entities you probably want to retain that are expressed in numeric form rather than by the HTML name.  For example, it matches the numeric for the euro symbol, and you would probably not want to throw that character out.

&euro; <== interchangeable with ==> &#8364;

Open in new window


If this is a one-off situation, it might be easier to go through the file with a text editor to knock out the obvious and frequently unwanted entities, then see how the result looks.
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 41865245
I had forgotten that you can have hex.  This should also pick up hex (and correct to allow numbers in entity names):
m{((?:\w+(?:&(?:#[0-9a-f]+|[a-z]\w*);)+)+\w*)}

Open in new window


Isn't a semi-colon required to end an escape sequence?  Your "bad" example has &sup1dsj without a closing semi-colon.

If you want to keep some and remove others, there is no really easy way.  The simplest would be either a list of ones to keep or a hash map indicating which to keep and which to get rid of (but then you have to process them one by one).
0
ScreenConnect 6.0 Free Trial

Explore all the enhancements in one game-changing release, ScreenConnect 6.0, based on partner feedback. New features include a redesigned UI, app configurations and chat acknowledgement to improve customer engagement!

 
LVL 22

Expert Comment

by:Kim Walker
ID: 41865442
I have a file that is riddled with words containing HTML character codes...
If you only have one file, why not use a text editor for a one-time conversion. One of my favorite code editors, EditPad Pro, has the functionality to convert HTML character entities into unicode characters. A free trial version of EditPad Pro is available which is fully functional except for Spell Checker. I'm sure there are others out there that might be free.
0
 

Author Comment

by:hadrons
ID: 41865540
Hi, Kim, actually a bunch of files and it's ones I work with on a regular basis, but that info was unimportant to the question itself, so I just simplified the narrative - however, thanks for the EditPadPro - its something that could be useful in the future.

Dr. Klahn, I did use an expression similar to what you suggested: \&\#[0-9A-Za-z]+?; with various other expressions, but its somewhat limited.

But I'll follow up on Wilcoxon suggestion - it looks promising, but I have to wait until Monday. Thanks for the feedback all so far; Mike
0
 
LVL 22

Expert Comment

by:Kim Walker
ID: 41865595
Are you trying to convert the entities to their unicode character or are you trying to strip them out? You will not be able to convert them using a regex. But you can search for the regex match and then perform a lookup/replace from a reference table.
0
 

Author Comment

by:hadrons
ID: 41865785
The match is needed to isolate these words so I can grep them out and look them over
0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
login jsp example 24 50
Are there any non javascript based chart/graph solutions? 14 31
Changing alignment and creating border 6 29
Else condition 9 19
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…

778 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question