Link to home
Start Free TrialLog in
Avatar of Morgan Fuentez
Morgan FuentezFlag for United States of America

asked on

Help to construct a regular expression.

Hello all, I hope things at this time are going OK for you and your families. I am working on a project, and I need some help with regular expressions.  I am trying to create a regular expression that can remove certain types of HTML tags, another one that will ignore them and lastly, one that selects both types. I have made some attempts. However, there is always something missing. I am learning more about regular expressions in the process.

From here on out, I will refer to regular expressions as regex. I want one regex that catches all words surrounded by three types of tags. I only need these three. The tags I need to make part of my pattern are <em>, <span>, and <strong>. The <span> is the only one with attributes like, e.g. <span style="color: #3a9ee3;"> I also need to select the closing tags for each. In the example, I show I am highlighting what I want. However, I don’t think I am doing this in the best way.

At times I will need an inverted version of the request above. I need to select words that are not wrapped in any tag. Coming up with the correct regex for this has been more difficult. I seem always to be selecting something I do not want. I am always selecting an unexpected tag or a semicolon or something near the word. I sometimes need to select open and closed parenthesis. I have not been able to make the closed one work one time.

I need the last regex to select the words between the tags and those without any tag wrapped around. I also need the tags selected for the ones that have them.

Could somebody help construct this type of regex?

Here is a link to test at: https://regex101.com/r/5VUKsi/1

Avatar of NVIT
NVIT
Flag of United States of America image

Like this?
(<span.+>.+<\/span>|<em>.+<\/em>|<strong>.+<\/strong>)

Open in new window

Avatar of Morgan Fuentez

ASKER

Hi NVIT, thanks for the reply. That regex you post just everything between the the first <span> and grabs everthign in between. That is not what I want to select.
Hi Morgan,

Tweaking MVIT's answer as follows might do part of what you want:
(<span.+?>.+?<\/span>|<em>.+?<\/em>|<strong>.+?<\/strong>)

Open in new window


That doesn't handle everything you've asked for, but it's a start.
Here it is on regex101:
  https://regex101.com/r/720Ujy/1

I would expect that doing such things properly yourself could be very hard, to cater for all possible cases, so usually people would use modules which have already been written.

Is this for PHP or what?
P.S. You could change the '+' to '*' to cater for the possibility of 0 chars which match '.', like this:
(<span.*?>.*?<\/span>|<em>.*?<\/em>|<strong>.*?<\/strong>)

Open in new window


Is this for PHP or what?                
Hey tel2, this is for an applicaiton we bilt in Xojo. We are using a web like interface with other plugins. I am just fixing this web part. The only thing missing is the correct regex. That is a big help to see much less syntax do so much. I was trying to make the regex capture only the tags around the word of choice. This word gets populated dynamicaly.  In this case I want the word in any form (Proper case, lower case, plural) of Alligator (e.g. alligator, alligators). I do not want a match when a different word is in the regex. So in the example Lion surround by tags should not get highlighted. Do I have to modify every place I see a .*? This is super helpful. Thank you :)
Hi Morgan,
Do I have to modify every place I see a .*?
Modify it how?  What are you referring to?
I have already changed the 4 instances of:
  .+?
to:
  .*?
in my last post.
I thought it was .+? that was making be OK to select "anything" between the tags. I though if I make it specific I could get what want. However, it does not work. This is what I tried.
(<span.*?>[Aa]lligator(s)?<\/span>|<em>[Aa]lligator(s)?<\/em>|<strong>[Aa]lligator(s)?<\/strong>)

Open in new window

Any reason you want to use a RegEx and not an HTML parser that will allow you to deal with the document in a structured way?

There are some really great tools out there that will allow you to parse HTML and manipulate the document.

What is the use case?
If this is in a browser, you can use JavaScript to parse the HTML using this kind of processing. It is recommended to not use RegExp to parse HTML

https://jsfiddle.net/mplungjan/1nb6tx3L/

const partial = document.createElement('div');
partial.innerHTML = str;
const spanContent = [...partial.querySelectorAll('span')].map(span => span.textContent)
console.log(spanContent)
const nonSpanContent = [...partial.firstChild.childNodes].filter(node => node.nodeType === 3).map(node => node.textContent.trim())
console.log(nonSpanContent)

Open in new window

I thought it was .+? that was making be OK to select "anything" between the tags. I though if I make it specific I could get what want. However, it does not work. This is what I tried.
'.' (dot) matches any 1 character.
'+' matches 1 or more of the previous character.
'*' matches 0 or more of the previous character.
'?' after a '+' or '*' makes the greedy match a minimal match.

I added the '?'s to NVIT's solution so make the matches match the least they can (i.e. minimal), otherwise they'll match as much as they can (i.e. greedy).

I then changed the '+' to '*' so that if there was ever nothing at that position then it would still match, but with your current test data I don't think you need that.

Since 's' is a single character, you can abbreviate this:
    [Aa]lligator(s)?
to this:
    [Aa]lligators?
if you want to.
Any reason you want to use a RegEx and not an HTML parser...
Hi Julian Hansen, yes, the reason is this is built on top of an HTML that uses a plugin. It is as simple as making changes to current unparsed HTML. When we mess with JavaScript, it causes significant delays on how stuff natively works. This idea I am asking for help on is clean and fast. Using the regex, I have come up with, I see this is totally possible. I need the correct regex. What is excellent is a script can run it more than once. So if I come up with the correct regex to select the word surrounded by tags (also selecting the tags) and the regex that only selects the word not surrounded by tags it will be great. Thanks so much for the reply and excellent question.
Hello Michel Plungjan, I hope all is well. I wish I could just fire up jQuery to accomplish this. I explained in the message above that this is a special case where we need a solution to work the HTML parser already being used.
It is recommended to not use RegExp to parse HTML 
I did not know that. I will keep that in mind for future projects. Thanks for the reply and your time.

Hi again Morgan,

Did you see my last post?  Did you understand it?

Does this do what you want?
(<(span).+?><(strong|em)>(.+?)<\/\3><\/\2>)

Open in new window

https://regex101.com/r/dn8Xqx/1

Or this, which also captures the entire (<strong>...</strong>) or (<em>...</em>) tags.
(<(span).+?>(<(strong|em)>(.+?)<\/\4>)<\/\2>)

Open in new window

https://regex101.com/r/osEQw7/1

Or here's that last one written slightly simpler:
(<span.+?>(<(strong|em)>(.+?)<\/\3>)<\/span>)

Open in new window

https://regex101.com/r/4fAnxk/1

If the above don't do what you want, how are they not meeting your requirements?

All of the above solutions assume that all <span> tags contain (either a <strong> OR <em> tag (not both), and no other tags).

Note: Things like this:
    <(span).+?>
can also be written like this:
    <(span)[^>]+>
which works differently but I expect gives the same result.
Hi, and thanks for all the help. Essential to what I match is the word alligator in different forms. You notice that all your versions also capture the word Lion when it is in tags. I think that is the challenge. Alligator represents dynamic text which will then make the pattern unique. To be clear, I only want to capture tags that have a specific word. Not anything wrapped in the tag. Yes I want the tags and the word captured. The exact word is important. The words are replaced as needed.
Can you show how to write the result where it captures Alligator, alligator, Alligators, alligators? These words are only placeholders for all words will swap for alligator. However, when we change alligator to chicken, it should only select chicken wrapped in tags, not Lion or alligator wrapped in tags. Thanks so much

ASKER CERTIFIED SOLUTION
Avatar of tel2
tel2
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I appreciate the great help and advice on the first part. Now I would like to get some help on the second part. That a regex that would select only the specific word when it is not wrapped in <em> <strong> or <span>. The <span with has a color attribute a lot of the time. The challenge I have had with this one is working close to the tags, e.g., <p>, <div>, and their closing tags. Thanks

https://regex101.com/r/EGDMWv/1

Hi Morgan.

I see from your link above that you're using this:
/(?!>)\b[Aa]lligator(s)?\b(?!(<\/s[pt]|<\/e))/gm

Open in new window

Questions:
Q1. Why are you using "[Aa]lligator(s)?" instead of "[Aa]lligators?" which I suggested before?  Although both will work, the former is less concise and does an extra unnecessary (capture) which just adds (slightly) to the processing required.  See how the "s" at the end of the 1st occurence of "Alligators" is green at regex101?  That's because it's a separate (capture).
Q2. What is the "[pt]" and "/e" for?
Q3. What is wrong with the results you're currently getting at the link regex101 demo link you've provided?
Q1a. I did not know what you explained. Now I can work smarter, that is why I came here. :)
 Q2a. I was thinking that and <s[pt] would select the beginning part of <span> and <strong> They are what I do not want on the other side of the word (Alligators).
 Q3a. I see that your good eye for this and questions exposed what is wrong with my choices. I wanted to make sure I write regex that is best practice. I just getting the job done can cause problems when it is going to get more explicit later.
These are great helpful questions. This forum is surpassing my expectations. What you and the other members provide is worth every penny. Thank You.

Instead of writing "Q1a" (Question 1 answer) etc, maybe try "A1" (Answer 1) etc.   Simpler.  No extra charge for that advice.  8)
Q1a. I did not know what you explained. Now I can work smarter, that is why I came here. :) 
I explained it at the bottom of this post.
OK, but I did suggest that method before, and you still didn't use it.

Before I help you any further, what is the answer to the 2nd part of Q2?
That also what I do not want the the word (Alligators) to be near. Like a boundary and then not any of the closing part of </span>, </strong>, </em>.

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks again this help was super from you and all that contributed. :)
Hi Scott,

Call me biased, but I think this post of mine should be accepted as a solution:
Reason: See Morgan's response which starts with "I appreciate the great help and advice on the first part. Now I would like to get some help on the second part."

And I think this post of mine should be accepted as a solution or at least as being helpful.
Reason: Again, see Morgan's response.

The comments from the other people were probably helpful for future purposes (so could be awarded as "helpful"), but it looks as if they aren't what the asker wanted in this particular case.

tel2