pld51
asked on
Using regex to remove everything between 2 tags
I am trying to use regex replace to remove everything in a string strtext between two tags, but it is not working. "Everything" here is html code spread over a couple of lines, basically a table with plenty of spaces etc. At the start is the tag "#StartPreviousEmail#, at the end the tag #EndPreviousEmail#
Here is what I am currently using, but it does not work:-
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "#StartPreviousEmail#([^#] +)#EndPrev iousEmail# "
strText = regEx.Replace(strText,"")
Grateful for correct way to do this. Thanks
Here is what I am currently using, but it does not work:-
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "#StartPreviousEmail#([^#]
strText = regEx.Replace(strText,"")
Grateful for correct way to do this. Thanks
What does the data look like and what kind of result are you receiving?
ASKER
The data I am trying to remove, between the 'tags', is below.
The result, right now, is the same appears on the document rather than nothing between the tags.
The result, right now, is the same appears on the document rather than nothing between the tags.
#StartPreviousEmail#<table width="100%" border="0" cellspacing="2" cellpadding="2">
<tr>
<td><table width="98%" border="1" align="center" cellpadding="2" cellspacing="0" bordercolor="#B9B9B9">
<tr>
<td bgcolor="#FFFFFF"><table width="99%" border="0" align="center" cellpadding="2" cellspacing="1">
<tr>
<td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">To</font></strong></span></div></td>
<td><strong>REPLACE_ToPerson</strong> (REPLACE_ToEmail)</td>
<td width="20%" align="center"> </td>
</tr>
<tr>
<td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">From</font> </strong></span></div></td>
<td colspan="2"><strong>REPLACE_FromPerson</strong> (REPLACE_FromEmail)</td>
</tr>
<tr>
<td width="40" align="right"><div align="left"><strong><font color="#909090">Sent</font></strong></div></td>
<td colspan="2">REPLACE_DateMsge REPLACE_TimeMsge</td>
</tr>
<tr>
<td width="40" align="right"><div align="left"><span class="style16"><strong><font color="#909090">Re</font></strong> </span></div></td>
<td colspan="2">REPLACE_Subject</td>
</tr>
<tr>
<td colspan="3" align="right" valign="top"><table width="98%" border="0" align="center" cellpadding="2" cellspacing="0" class="t-topborder2">
<tr>
<td><div align="left">REPLACE_previousmessage</div></td>
</tr>
</table></td>
</tr>
</table></td>
</tr>
</table></td>
</tr>
</table>#EndPreviousEmail#
Here's you biggest roadblock:
Note that you are using "[^#]" in your pattern. What happens when the pattern encounters that # in the color string? This is why your pattern fails. Let's try modifying the pattern to account for such occurrences:
...
<td bgcolor="#FFFFFF">
...
Note that you are using "[^#]" in your pattern. What happens when the pattern encounters that # in the color string? This is why your pattern fails. Let's try modifying the pattern to account for such occurrences:
regEx.Pattern = "#StartPreviousEmail#((?:[^#]|#(?!EndPreviousEmail#))+)#EndPreviousEmail#"
ASKER
well spotted, will test out. otherwise perhaps using another symbol will do the trick.
ASKER
Ths works perfectly and is the answer, thanks. Before clicking to reward & close, much appreciated if you could please indicate the logic as it is not clear to me?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks indeed. Extremely fast, clear comments, great solution!
I know the points are already gone, and maybe I'm missing something but couldn't you just use this...
regEx.Pattern = "#StartPreviousEmail#(.*?)#EndPreviousEmail#"
ASKER
Thanks mccarl for the suggestion. I just tried it and it didn't work. Pity, as was elegantly simple.
Thanks mccarl for the suggestion. I just tried it and it didn't work. Pity, as was elegantly simple.That's because dot does not match newlines by default. As far as I recall, there is no option to change this in VB(Script). However, you can fake it by using a modified character class instead:
regEx.Pattern = "#StartPreviousEmail#([\s\S]*?)#EndPreviousEmail#"
ASKER
Pity the points already allocated, because that does seem to work OK and it remains simple. Any chance you could explain what you mean by faking it with modified character class? Thanks anyway!
ASKER
Sorry kaufmed, just noticed it was you that provided the modified solution. Unfortunately no double points for double solutions!
As I mentioned, dot does not match newlines; however it does match every other character. In order to make dot match newlines, you typically use the "single-line" option--which, as I mentioned, VB does not have. With single-line turned on, dot does match every character. So in order to reproduce the functionality of matching every single character we use two character classes--one being a negation of the other. It doesn't really matter which two classes we use so long as one is the negation of the other. In my example, I used "any whitespace"--"\s"--and any non-whitespace--"\S". I'm sure you'd agree that "any non-whitespace" is the opposite of "any whitespace". The net effect is that we match any character, be it a whitespace character or a non-whitespace character.
As I mentioned, we could use any character class. For example, we could have used word characters:
...or digits:
etc. We just need to have opposites (basically, the lowercase and uppercase version of the class). If we do this, we can simulate dot matches all and newline.
As I mentioned, we could use any character class. For example, we could have used word characters:
[\w\W]
...or digits:
[\d\D]
etc. We just need to have opposites (basically, the lowercase and uppercase version of the class). If we do this, we can simulate dot matches all and newline.