Avatar of Morgan Fuentez
Morgan Fuentez
Flag for United States of America asked on

Help to construct a regular expression.

Hello all, I hope things at this time are going OK for you and your families. I am working on a project, and I need some help with regular expressions.  I am trying to create a regular expression that can remove certain types of HTML tags, another one that will ignore them and lastly, one that selects both types. I have made some attempts. However, there is always something missing. I am learning more about regular expressions in the process.

From here on out, I will refer to regular expressions as regex. I want one regex that catches all words surrounded by three types of tags. I only need these three. The tags I need to make part of my pattern are <em>, <span>, and <strong>. The <span> is the only one with attributes like, e.g. <span style="color: #3a9ee3;"> I also need to select the closing tags for each. In the example, I show I am highlighting what I want. However, I don’t think I am doing this in the best way.

At times I will need an inverted version of the request above. I need to select words that are not wrapped in any tag. Coming up with the correct regex for this has been more difficult. I seem always to be selecting something I do not want. I am always selecting an unexpected tag or a semicolon or something near the word. I sometimes need to select open and closed parenthesis. I have not been able to make the closed one work one time.

I need the last regex to select the words between the tags and those without any tag wrapped around. I also need the tags selected for the ones that have them.

Could somebody help construct this type of regex?

Here is a link to test at: https://regex101.com/r/5VUKsi/1

Regular ExpressionsHTML

Avatar of undefined
Last Comment
tel2

8/22/2022 - Mon
NVIT

Like this?
(<span.+>.+<\/span>|<em>.+<\/em>|<strong>.+<\/strong>)

Open in new window

Morgan Fuentez

ASKER
Hi NVIT, thanks for the reply. That regex you post just everything between the the first <span> and grabs everthign in between. That is not what I want to select.
tel2

Hi Morgan,

Tweaking MVIT's answer as follows might do part of what you want:
(<span.+?>.+?<\/span>|<em>.+?<\/em>|<strong>.+?<\/strong>)

Open in new window


That doesn't handle everything you've asked for, but it's a start.
Here it is on regex101:
  https://regex101.com/r/720Ujy/1

I would expect that doing such things properly yourself could be very hard, to cater for all possible cases, so usually people would use modules which have already been written.

Is this for PHP or what?
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
tel2

P.S. You could change the '+' to '*' to cater for the possibility of 0 chars which match '.', like this:
(<span.*?>.*?<\/span>|<em>.*?<\/em>|<strong>.*?<\/strong>)

Open in new window


Morgan Fuentez

ASKER
Is this for PHP or what?                
Hey tel2, this is for an applicaiton we bilt in Xojo. We are using a web like interface with other plugins. I am just fixing this web part. The only thing missing is the correct regex. That is a big help to see much less syntax do so much. I was trying to make the regex capture only the tags around the word of choice. This word gets populated dynamicaly.  In this case I want the word in any form (Proper case, lower case, plural) of Alligator (e.g. alligator, alligators). I do not want a match when a different word is in the regex. So in the example Lion surround by tags should not get highlighted. Do I have to modify every place I see a .*? This is super helpful. Thank you :)
tel2

Hi Morgan,
Do I have to modify every place I see a .*?
Modify it how?  What are you referring to?
I have already changed the 4 instances of:
  .+?
to:
  .*?
in my last post.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Morgan Fuentez

ASKER
I thought it was .+? that was making be OK to select "anything" between the tags. I though if I make it specific I could get what want. However, it does not work. This is what I tried.
(<span.*?>[Aa]lligator(s)?<\/span>|<em>[Aa]lligator(s)?<\/em>|<strong>[Aa]lligator(s)?<\/strong>)

Open in new window

Julian Hansen

Any reason you want to use a RegEx and not an HTML parser that will allow you to deal with the document in a structured way?

There are some really great tools out there that will allow you to parse HTML and manipulate the document.

What is the use case?
Michel Plungjan

If this is in a browser, you can use JavaScript to parse the HTML using this kind of processing. It is recommended to not use RegExp to parse HTML

https://jsfiddle.net/mplungjan/1nb6tx3L/

const partial = document.createElement('div');
partial.innerHTML = str;
const spanContent = [...partial.querySelectorAll('span')].map(span => span.textContent)
console.log(spanContent)
const nonSpanContent = [...partial.firstChild.childNodes].filter(node => node.nodeType === 3).map(node => node.textContent.trim())
console.log(nonSpanContent)

Open in new window

Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
tel2

I thought it was .+? that was making be OK to select "anything" between the tags. I though if I make it specific I could get what want. However, it does not work. This is what I tried.
'.' (dot) matches any 1 character.
'+' matches 1 or more of the previous character.
'*' matches 0 or more of the previous character.
'?' after a '+' or '*' makes the greedy match a minimal match.

I added the '?'s to NVIT's solution so make the matches match the least they can (i.e. minimal), otherwise they'll match as much as they can (i.e. greedy).

I then changed the '+' to '*' so that if there was ever nothing at that position then it would still match, but with your current test data I don't think you need that.

Since 's' is a single character, you can abbreviate this:
    [Aa]lligator(s)?
to this:
    [Aa]lligators?
if you want to.
Morgan Fuentez

ASKER
Any reason you want to use a RegEx and not an HTML parser...
Hi Julian Hansen, yes, the reason is this is built on top of an HTML that uses a plugin. It is as simple as making changes to current unparsed HTML. When we mess with JavaScript, it causes significant delays on how stuff natively works. This idea I am asking for help on is clean and fast. Using the regex, I have come up with, I see this is totally possible. I need the correct regex. What is excellent is a script can run it more than once. So if I come up with the correct regex to select the word surrounded by tags (also selecting the tags) and the regex that only selects the word not surrounded by tags it will be great. Thanks so much for the reply and excellent question.
Morgan Fuentez

ASKER
Hello Michel Plungjan, I hope all is well. I wish I could just fire up jQuery to accomplish this. I explained in the message above that this is a special case where we need a solution to work the HTML parser already being used.
It is recommended to not use RegExp to parse HTML 
I did not know that. I will keep that in mind for future projects. Thanks for the reply and your time.

Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
tel2

Hi again Morgan,

Did you see my last post?  Did you understand it?

Does this do what you want?
(<(span).+?><(strong|em)>(.+?)<\/\3><\/\2>)

Open in new window

https://regex101.com/r/dn8Xqx/1

Or this, which also captures the entire (<strong>...</strong>) or (<em>...</em>) tags.
(<(span).+?>(<(strong|em)>(.+?)<\/\4>)<\/\2>)

Open in new window

https://regex101.com/r/osEQw7/1

Or here's that last one written slightly simpler:
(<span.+?>(<(strong|em)>(.+?)<\/\3>)<\/span>)

Open in new window

https://regex101.com/r/4fAnxk/1

If the above don't do what you want, how are they not meeting your requirements?

All of the above solutions assume that all <span> tags contain (either a <strong> OR <em> tag (not both), and no other tags).

Note: Things like this:
    <(span).+?>
can also be written like this:
    <(span)[^>]+>
which works differently but I expect gives the same result.
Morgan Fuentez

ASKER
Hi, and thanks for all the help. Essential to what I match is the word alligator in different forms. You notice that all your versions also capture the word Lion when it is in tags. I think that is the challenge. Alligator represents dynamic text which will then make the pattern unique. To be clear, I only want to capture tags that have a specific word. Not anything wrapped in the tag. Yes I want the tags and the word captured. The exact word is important. The words are replaced as needed.
Can you show how to write the result where it captures Alligator, alligator, Alligators, alligators? These words are only placeholders for all words will swap for alligator. However, when we change alligator to chicken, it should only select chicken wrapped in tags, not Lion or alligator wrapped in tags. Thanks so much

ASKER CERTIFIED SOLUTION
tel2

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
Morgan Fuentez

ASKER
I appreciate the great help and advice on the first part. Now I would like to get some help on the second part. That a regex that would select only the specific word when it is not wrapped in <em> <strong> or <span>. The <span with has a color attribute a lot of the time. The challenge I have had with this one is working close to the tags, e.g., <p>, <div>, and their closing tags. Thanks

https://regex101.com/r/EGDMWv/1

Your help has saved me hundreds of hours of internet surfing.
fblack61
tel2

Hi Morgan.

I see from your link above that you're using this:
/(?!>)\b[Aa]lligator(s)?\b(?!(<\/s[pt]|<\/e))/gm

Open in new window

Questions:
Q1. Why are you using "[Aa]lligator(s)?" instead of "[Aa]lligators?" which I suggested before?  Although both will work, the former is less concise and does an extra unnecessary (capture) which just adds (slightly) to the processing required.  See how the "s" at the end of the 1st occurence of "Alligators" is green at regex101?  That's because it's a separate (capture).
Q2. What is the "[pt]" and "/e" for?
Q3. What is wrong with the results you're currently getting at the link regex101 demo link you've provided?
Morgan Fuentez

ASKER
Q1a. I did not know what you explained. Now I can work smarter, that is why I came here. :)
 Q2a. I was thinking that and <s[pt] would select the beginning part of <span> and <strong> They are what I do not want on the other side of the word (Alligators).
 Q3a. I see that your good eye for this and questions exposed what is wrong with my choices. I wanted to make sure I write regex that is best practice. I just getting the job done can cause problems when it is going to get more explicit later.
These are great helpful questions. This forum is surpassing my expectations. What you and the other members provide is worth every penny. Thank You.

tel2

Instead of writing "Q1a" (Question 1 answer) etc, maybe try "A1" (Answer 1) etc.   Simpler.  No extra charge for that advice.  8)
Q1a. I did not know what you explained. Now I can work smarter, that is why I came here. :) 
I explained it at the bottom of this post.
OK, but I did suggest that method before, and you still didn't use it.

Before I help you any further, what is the answer to the 2nd part of Q2?
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Morgan Fuentez

ASKER
That also what I do not want the the word (Alligators) to be near. Like a boundary and then not any of the closing part of </span>, </strong>, </em>.

SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Morgan Fuentez

ASKER
Thanks again this help was super from you and all that contributed. :)
tel2

Hi Scott,

Call me biased, but I think this post of mine should be accepted as a solution:
Reason: See Morgan's response which starts with "I appreciate the great help and advice on the first part. Now I would like to get some help on the second part."

And I think this post of mine should be accepted as a solution or at least as being helpful.
Reason: Again, see Morgan's response.

The comments from the other people were probably helpful for future purposes (so could be awarded as "helpful"), but it looks as if they aren't what the asker wanted in this particular case.

tel2
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23