Regex for HTML exclusion.

Earlier an expert helped with a piece of regex designed to find whole words in amongst HTML with the expression :

\b(WordBeingSought)\b(?=[^>]*<)

However in the scenario below the HTML has been poorly formed and the words Managing Director occur cross HTML elements.
Can regex adapt to counter this and find it or is this beyonds its ability?  

Managing Dire</span><span style="font-family:Arial; font-size:11pt">c</span><span style="font-family:Arial; font-size:11pt">tor</span>
dgloverukAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
one option is before running your detection is to first strip out the
\<\/span\>\<span style.*\>

This should reduce the line you posted to:

Managing Director</span>
Before your other expression runs.

in which scheme is the regular expression used?
0
ozoCommented:
M(?:<[^>]*>)*a(?:<[^>]*>)*n(?:<[^>]*>)*a(?:<[^>]*>)*g(?:<[^>]*>)*i(?:<[^>]*>)*n(?:<[^>]*>)*g(?:<[^>]*>)*\s+(?:<[^>]*>)*D(?:<[^>]*>)*i(?:<[^>]*>)*r(?:<[^>]*>)*e(?:<[^>]*>)*c(?:<[^>]*>)*t(?:<[^>]*>)*o(?:<[^>]*>)*r

But it may be less unwieldy to remove the tags first, especially if you need to deal with things like
<IMG SRC = "foo.gif" ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
dgloverukAuthor Commented:
Hi Tom, Ozo,
Thank you for your suggestions.
Unfortunately because I need to replace the detected words with markup and keep the original html document intact I cannot strip the markup out.
I can try what you have suggested Ozo, I think I can create pattern as needed that would do what you have suggested.
The other option I have which I could try if this fails is that I am using a 3rd party product called aspose which turns word documents into HTML and I think it has the ability to find and format words before they become HTML.  I prefferred the regex solution since I assumed it would be faster than aspose and aspose may in fact suffer the same problem of being unable to find words that have split formats.
Regards,
0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

arnoldCommented:
So you want the pattern match to include results such as

Managing Dire</span><span style="font-family:Arial; font-size:11pt">c</span><span style="font-family:Arial; font-size:11pt">tor
Managing Dire</span><span style="font-family:Arial; font-size:11pt">ctor
M</span><span style="font-family:Arial; font-size:11pt">anaging Dire</span><span style="font-family:Arial; font-size:11pt">ctor
and the various variations of that?
0
dgloverukAuthor Commented:
Exactly Arnold, in theory if I just replaced that with a new ,<span style"NewStyle">Managing Director</span> I could highlight the words as needed.
There is a risk of breaking the markup e.g in a similar scenario you posed :
Managing Dire</span><span style="font-family:Arial; font-size:11pt">cto</span>r
you would end up replacing 1 opening tag and 2 closing tags leaving the HTML invalid but I don't think it would cause a rendering issue which is the main thing.
I could always count the spans in the matched statement after the original match and then try to make sure the inserted markup was done keeping the span tags in balance.
0
arnoldCommented:
That is easier said than done.  The difficulty in this case is that you do not know which portion of  Managing director is contiguous.

What scripting/programming language is this being used?

$pattern =the closing span/opening span if space between the closing/opening span is possible the \s* between them may be needed.
The pattern match is a composite starting on ([a-z]+($pattern*[a-z]+)*)
The second portion will deal with validating that the pattern extracted is Managing Director after assigning the matched pattern to a variable and then stripping all HTML tags and comparing it to make sure it  matches.


Are you doing a pattern match /substitution?
0
arnoldCommented:
Oh, to avoid this situation in the future add a comment surrounding the replacement pattern
<!-- start managing director -->Managing director<!-- end managing director -->
This way it is much simpler in the future.
0
ozoCommented:
Unfortunately because I need to replace the detected words with markup and keep the original html document intact I cannot strip the markup out.
How do you replace the detected words with markup and keep the original html document intact if the detected words contain markup?
0
dgloverukAuthor Commented:
Arnold : Can you just explain that last comment with the commented markup, I do not understand what that is helping with :)  I am using asp. NET so I can work with the strings a fair bit if needed prior to outputting.
I am essentially replacing what is matched with a more simple <span style="...">Managing Director</span>

Ozo : I can't keep the original markup intact with the way I was proposing..  although I am less worried about losing formatting on the original markup than disrupting something.  
I have been using the approach you introduced Ozo to dynamically create the patterns in the style you suggested, it is an ugly pattern but it does seem to be working.

I will do some more testing on Ozo's pattern to see if it works 99% of the time, which is good enough really.

Thanks guys, great options!
0
arnoldCommented:
Using comments you can tag a replaceable section. Such that when you run the search and replace, you can use the markers to identify the section to be replaced modified.

What if any contribution do the subsections of the span within Managing Director add?

The spans are not uniquely identifiable to alter one to a different color, style, position if any.
They alll seem to have the same font and size assignment.


Asp.net is the source, what are you using to find and replace?
0
dgloverukAuthor Commented:
Hi Arnold,
The subsections don't do anything useful at all but I cannot control their presence but it is also why I am not worried about overwriting their contents.  As you probably know, when you write a document in Microsoft word it leaves obsolete formatting all over the place.  When you save the doc to HTML you get a HTML version including obsolete formatting.  It is this output that I am working with and have to highlight matches within so as you can appreciate my options for tidying the markup before this point are limited.  There are about a quarter of a million documents in .doc which change over time that my software converts as a user requests it into HTML.  

I was using Asp.net to find and replace.. like so :
     For Each word As String In Searchbuilder.FreeTextTermsQuoted
            highlightedAnswer = Regex.Replace(highlightedAnswer, pattern,FormattedSpanStart + word + FormattedSpanEnd)
        Next

Open in new window

where pattern was the long made up pattern that ozo suggested and the formatted span Start and end are the new opening span tags and formattings and closing tag.  HighlightedAnswer is the HTML document.
Searchbuilder.FreeTextTerms was the collection of strings that needed highlighting.

Hopefully that explains my situation better?
0
arnoldCommented:
My advice would be to strip out this pattern
\<\/span\>\<span [a-zA-Z0-9=;:\ \"\-]*\>
After running it on any combination
0
dgloverukAuthor Commented:
Thanks Arnold, I'll do that, hopefully the output will be cleaner for it!
Regards,
0
arnoldCommented:
An alternate option you could try exploring converting the .doc to PDF and then pdf2html to see if this process produces a cleaner HTML.

Have you looked as using asp.net to convert the .doc to PDF and then accessing the PDF structure?
0
dgloverukAuthor Commented:
I could try that fairly easily... as I mentioned earlier I am using Aspose for words .net which can port in any word processing format any direction.  I am not sure it won't still use the formats even in pdf but I can try!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.