Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1294
  • Last Modified:

Using regex to remove everything between 2 tags

I am trying to use regex replace to remove everything in a string strtext between two tags, but it is not working. "Everything" here is html code spread over a couple of lines, basically a table with plenty of spaces etc. At the start is the tag "#StartPreviousEmail#, at the end the tag #EndPreviousEmail#

Here is what I am currently using, but it does not work:-
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "#StartPreviousEmail#([^#]+)#EndPreviousEmail#"
strText = regEx.Replace(strText,"")

Grateful for correct way to do this. Thanks
0
pld51
Asked:
pld51
  • 7
  • 5
1 Solution
 
käµfm³d 👽Commented:
What does the data look like and what kind of result are you receiving?
0
 
pld51Author Commented:
The data I am trying to remove, between the 'tags', is below.

The result, right now, is the same appears on the document rather than nothing between the tags.

#StartPreviousEmail#<table width="100%" border="0" cellspacing="2" cellpadding="2">
  <tr>
    <td><table width="98%" border="1" align="center" cellpadding="2" cellspacing="0" bordercolor="#B9B9B9">
          <tr>
            <td bgcolor="#FFFFFF"><table width="99%" border="0" align="center" cellpadding="2" cellspacing="1">
                <tr>
                  <td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">To</font></strong></span></div></td>
                  <td><strong>REPLACE_ToPerson</strong> (REPLACE_ToEmail)</td>
                  <td width="20%" align="center">&nbsp;</td>
                </tr>

                <tr>
                  <td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">From</font> </strong></span></div></td>
                  <td colspan="2"><strong>REPLACE_FromPerson</strong> (REPLACE_FromEmail)</td>
                </tr>

                <tr>
                  <td width="40" align="right"><div align="left"><strong><font color="#909090">Sent</font></strong></div></td>
                  <td colspan="2">REPLACE_DateMsge REPLACE_TimeMsge</td>
                </tr>
                <tr>
                  <td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">Re</font></strong> </span></div></td>
                  <td colspan="2">REPLACE_Subject</td>
                </tr>

                <tr>
                  <td colspan="3" align="right" valign="top"><table width="98%" border="0" align="center" cellpadding="2" cellspacing="0" class="t-topborder2">
                    <tr>
                      <td><div align="left">REPLACE_previousmessage</div></td>
                    </tr>
                  </table></td>
              </tr>
            </table></td>
          </tr>
        </table></td>
  </tr>
</table>#EndPreviousEmail#

Open in new window

0
 
käµfm³d 👽Commented:
Here's you biggest roadblock:

...
<td bgcolor="#FFFFFF">
...

Open in new window


Note that you are using "[^#]" in your pattern. What happens when the pattern encounters that # in the color string? This is why your pattern fails. Let's try modifying the pattern to account for such occurrences:

regEx.Pattern = "#StartPreviousEmail#((?:[^#]|#(?!EndPreviousEmail#))+)#EndPreviousEmail#"

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
pld51Author Commented:
well spotted, will test out. otherwise perhaps using another symbol will do the trick.
0
 
pld51Author Commented:
Ths works perfectly and is the answer, thanks. Before clicking to reward & close, much appreciated if you could please indicate the logic as it is not clear to me?
0
 
käµfm³d 👽Commented:
Sure. I added a negative lookahead [ (?! ... ) ] to ensure that if a # was found, it was not followed by the target string "EndPreviousEmail#". This, combined with the existing logic, allows you to search for any character that is either not a #, or if it is a #, then it is not followed by the remainder of the sentinel value. The net effect should be that any solitary (or repeated) # within the text should be captured by the regex before the sentinel is found.

One thing you might need to change--if the sentinels can occur more than once within the same block of text--is changing the plus to a non-greedy version (+?); otherwise, the regex engine might capture more than you expect it to.
0
 
pld51Author Commented:
Thanks indeed. Extremely fast, clear comments, great solution!
0
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
I know the points are already gone, and maybe I'm missing something but couldn't you just use this...

regEx.Pattern = "#StartPreviousEmail#(.*?)#EndPreviousEmail#"

Open in new window

0
 
pld51Author Commented:
Thanks mccarl for the suggestion. I just tried it and it didn't work. Pity, as was elegantly simple.
0
 
käµfm³d 👽Commented:
Thanks mccarl for the suggestion. I just tried it and it didn't work. Pity, as was elegantly simple.
That's because dot does not match newlines by default. As far as I recall, there is no option to change this in VB(Script). However, you can fake it by using a modified character class instead:

regEx.Pattern = "#StartPreviousEmail#([\s\S]*?)#EndPreviousEmail#"

Open in new window

0
 
pld51Author Commented:
Pity the points already allocated, because that does seem to work OK and it remains simple. Any chance you could explain what you mean by faking it with modified character class? Thanks anyway!
0
 
pld51Author Commented:
Sorry kaufmed, just noticed it was you that provided the modified solution. Unfortunately no double points for double solutions!
0
 
käµfm³d 👽Commented:
As I mentioned, dot does not match newlines; however it does match every other character. In order to make dot match newlines, you typically use the "single-line" option--which, as I mentioned, VB does not have. With single-line turned on, dot does match every character. So in order to reproduce the functionality of matching every single character we use two character classes--one being a negation of the other. It doesn't really matter which two classes we use so long as one is the negation of the other. In my example, I used "any whitespace"--"\s"--and any non-whitespace--"\S". I'm sure you'd agree that "any non-whitespace" is the opposite of "any whitespace". The net effect is that we match any character, be it a whitespace character or a non-whitespace character.

As I mentioned, we could use any character class. For example, we could have used word characters:

[\w\W]

Open in new window


...or digits:

[\d\D]

Open in new window


etc. We just need to have opposites (basically, the lowercase and uppercase version of the class). If we do this, we can simulate dot matches all and newline.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 7
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now