asked on

Using regex to remove everything between 2 tags

I am trying to use regex replace to remove everything in a string strtext between two tags, but it is not working. "Everything" here is html code spread over a couple of lines, basically a table with plenty of spaces etc. At the start is the tag "#StartPreviousEmail#, at the end the tag #EndPreviousEmail#

Here is what I am currently using, but it does not work:-
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "#StartPreviousEmail#([^#]+)#EndPreviousEmail#"
strText = regEx.Replace(strText,"")

Grateful for correct way to do this. Thanks

kaufmed

What does the data look like and what kind of result are you receiving?

pld51

ASKER

The data I am trying to remove, between the 'tags', is below.

The result, right now, is the same appears on the document rather than nothing between the tags.

#StartPreviousEmail#<table width="100%" border="0" cellspacing="2" cellpadding="2">
  <tr>
    <td><table width="98%" border="1" align="center" cellpadding="2" cellspacing="0" bordercolor="#B9B9B9">
          <tr>
            <td bgcolor="#FFFFFF"><table width="99%" border="0" align="center" cellpadding="2" cellspacing="1">
                <tr>
                  <td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">To</font></strong></span></div></td>
                  <td><strong>REPLACE_ToPerson</strong> (REPLACE_ToEmail)</td>
                  <td width="20%" align="center">&nbsp;</td>
                </tr>

                <tr>
                  <td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">From</font> </strong></span></div></td>
                  <td colspan="2"><strong>REPLACE_FromPerson</strong> (REPLACE_FromEmail)</td>
                </tr>

                <tr>
                  <td width="40" align="right"><div align="left"><strong><font color="#909090">Sent</font></strong></div></td>
                  <td colspan="2">REPLACE_DateMsge REPLACE_TimeMsge</td>
                </tr>
                <tr>
                  <td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">Re</font></strong> </span></div></td>
                  <td colspan="2">REPLACE_Subject</td>
                </tr>

                <tr>
                  <td colspan="3" align="right" valign="top"><table width="98%" border="0" align="center" cellpadding="2" cellspacing="0" class="t-topborder2">
                    <tr>
                      <td><div align="left">REPLACE_previousmessage</div></td>
                    </tr>
                  </table></td>
              </tr>
            </table></td>
          </tr>
        </table></td>
  </tr>
</table>#EndPreviousEmail#

Open in new window

kaufmed

Here's you biggest roadblock:

...
<td bgcolor="#FFFFFF">
...

Open in new window

Note that you are using "[^#]" in your pattern. What happens when the pattern encounters that # in the color string? This is why your pattern fails. Let's try modifying the pattern to account for such occurrences:

regEx.Pattern = "#StartPreviousEmail#((?:[^#]|#(?!EndPreviousEmail#))+)#EndPreviousEmail#"

Open in new window

pld51

ASKER

well spotted, will test out. otherwise perhaps using another symbol will do the trick.

pld51

ASKER

Ths works perfectly and is the answer, thanks. Before clicking to reward & close, much appreciated if you could please indicate the logic as it is not clear to me?

ASKER CERTIFIED SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pld51

ASKER

Thanks indeed. Extremely fast, clear comments, great solution!

mccarl

I know the points are already gone, and maybe I'm missing something but couldn't you just use this...

regEx.Pattern = "#StartPreviousEmail#(.*?)#EndPreviousEmail#"

Open in new window

pld51

ASKER

Thanks mccarl for the suggestion. I just tried it and it didn't work. Pity, as was elegantly simple.

kaufmed

Thanks mccarl for the suggestion. I just tried it and it didn't work. Pity, as was elegantly simple.

That's because dot does not match newlines by default. As far as I recall, there is no option to change this in VB(Script). However, you can fake it by using a modified character class instead:

regEx.Pattern = "#StartPreviousEmail#([\s\S]*?)#EndPreviousEmail#"

Open in new window

pld51

ASKER

Pity the points already allocated, because that does seem to work OK and it remains simple. Any chance you could explain what you mean by faking it with modified character class? Thanks anyway!

pld51

ASKER

Sorry kaufmed, just noticed it was you that provided the modified solution. Unfortunately no double points for double solutions!

kaufmed

As I mentioned, dot does not match newlines; however it does match every other character. In order to make dot match newlines, you typically use the "single-line" option--which, as I mentioned, VB does not have. With single-line turned on, dot does match every character. So in order to reproduce the functionality of matching every single character we use two character classes--one being a negation of the other. It doesn't really matter which two classes we use so long as one is the negation of the other. In my example, I used "any whitespace"--"\s"--and any non-whitespace--"\S". I'm sure you'd agree that "any non-whitespace" is the opposite of "any whitespace". The net effect is that we match any character, be it a whitespace character or a non-whitespace character.

As I mentioned, we could use any character class. For example, we could have used word characters:

[\w\W]

Open in new window

...or digits:

[\d\D]

Open in new window

etc. We just need to have opposites (basically, the lowercase and uppercase version of the class). If we do this, we can simulate dot matches all and newline.