[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1015
  • Last Modified:

Problem with Regular Expression to replace some HTML in VBA

Hi,

I am trying to use a regular expression in VBA to replace some HTML in a file with new HTML.  What I want to replace is the following:

"<tr> blah blah blah Preliminary blah blah blah</tr>"  

There can be white space or whatever between the 'Preliminary' and the '</tr>'.

Here is my Regular Expression:  re.Pattern = "<tr>.*(Preliminary){1}[.\s]*tr>"

It's not working for me.  Nothing gets replaced.

Can anyone help me out?

Thanks a lot,
Maureen
0
mfb47
Asked:
mfb47
  • 9
  • 6
  • 4
  • +2
1 Solution
 
dlwyatt82Commented:
I'm not sure I see what you're trying to accomplish here...  Please give one or two "before and after" examples of what you want your HTML tags to look like.
0
 
mfb47Author Commented:
Hi, What I'm doing is replacing a preliminary set of totals with a final set of totals, for the top line of this report by year.  So I have to replace a chunk of HTML that starts with <tr> and ends with </tr> and has 'Preliminary' in it.  This HTML I need to replace is after a bunch of HTML that puts out a header row, and before a bunch of HTML that is for previous years.

Here is a before:
      <th align="center"><font size="2">Grades 10 - 12</font></th>
      <th align="center"><font size="2">Ungraded</font></th>
      <th align="center"><font size="2">Total</font></th>
</tr>

<tr><td align="right"><font size="2">2005/2006 Preliminary</font></td>
      <td align="right"><font size="2">77,877</font></td>
      
      <td align="right"><font size="2">22,088 </font></td>
      <td align="right"><font size="2">1,000 </font></td>
      <td align="right"><font size="2">40,916 </font></td>
</tr>

<tr><td align="right"><font size="2">2004/2005</font></td>
      <td align="right"><font size="2">77,877</font></td>
      
      <td align="right"><font size="2">22,088 </font></td>
      <td align="right"><font size="2">1,000 </font></td>
</tr>



Here is the After:

<th align="center"><font size="2">Grades 10 - 12</font></th>
      <th align="center"><font size="2">Ungraded</font></th>
      <th align="center"><font size="2">Total</font></th>
</tr>

<tr><td align="right"><font size="2">2005/2006 </font></td>
      <td align="right"><font size="2">99,999</font></td>
      
      <td align="right"><font size="2">99,999</font></td>
      <td align="right"><font size="2">99,999</font></td>
      <td align="right"><font size="2">99,999</font></td>
</tr>

<tr><td align="right"><font size="2">2004/2005</font></td>
      <td align="right"><font size="2">77,877</font></td>
      
      <td align="right"><font size="2">22,088 </font></td>
      <td align="right"><font size="2">1,000 </font></td>
</tr>

Thanks

0
 
dlwyatt82Commented:
Did you deliberately change some of the numbers to 99,999 in your example (as part of what you want your regex to do), or is your intention just to remove the word "Preliminary" from the row header?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
lojk.Net and Infrastructure ConsultantCommented:
If

<tr><td align="right"><font size="2">2005/2006 Preliminary</font></td>
     <td align="right"><font size="2">#%VAR1%#</font></td>
     
     <td align="right"><font size="2">#%VAR2%#</font></td>
     <td align="right"><font size="2">#%VAR3%#</font></td>
     <td align="right"><font size="2">#%VAR4%#</font></td>
</tr>

Why not do a

TextToFind="#%Var1%#"
TextToReplace="Whatever the value should be"

AllHTML=replace(1,AllHTML,TextToFind,TextToReplace)

for each variable that you need replacing? Assuming you can see how this can be extrapolated for the whole document...
0
 
mfb47Author Commented:
Hi

dlwyatt82: Sorry for being unclear - I changed those to 99,999 to indicate that some new number is coming in there.  What I am doing is replacing one hunk of HTML with another hunk of HTML:

re.Pattern = "<tr>.*(Preliminary){1}[.\s]*tr>"        
fileContent = re.Replace(outFile.ReadAll, outString)

outString has my new stuff in it.

lojk: I don't think I can do that, but I will look at it more carefully (I'm pretty new at VBA)

Thanks!
0
 
wnrossCommented:
re.Pattern = "<tr>.*(Preliminary){1}[.\s]*tr>"

1) You don't need a quantity for single expressions
re.Pattern = "<tr>.*(Preliminary)[.\s]*tr>"

2) Your closing tag looks like </tr>, so an expression of tr> will not match
re.Pattern = "<tr>.*(Preliminary)[.\s]*</tr>"

3) You have additional text  "</font>" which your class "[.\s]" does not match, either
a) refine
re.Pattern = "<tr>.*(Preliminary)[.\s]*</font></tr>"

b) generalize
re.Pattern = "<tr>.*(Preliminary).*</tr>"

Oh, long term, <font> is not an HTML tag, you might want to get rid of them entirely.
http://www.w3.org


0
 
dlwyatt82Commented:
Be very very careful with this one.  wnross's final solution will NOT do what you expect.  Unfortunately, the VBScript implementation does not support Atomic Groups or Posessive Quantifiers, which would be perfect in this case.  IF these regexp implementations supported Atomic Groups, this would be the correct pattern:

<tr>[\s\S]*?Preliminary[\s\S]*?(?></tr>)

Notice the (?>  )  surrounding the "</tr>".  The reason this is needed is so your expression doesn't match multiple <tr> .... </tr> blocks at once.  There may not be a way to do this in VBA with a simple call to Replace.  Instead you might need to work around this limitation with a little more complicated macro.
0
 
dlwyatt82Commented:
Here is the only way I could find to accomplish this task using VBScript (or VBA, since it's the same Regular Expression engine either way).  It works by matching each <tr>...</tr> block in your file, one at a time (this is accomplished with a simple regular expression using a lazy quantifier).  Then, using normal VB string manipulation, it checks for the presence of "preliminary", and replaces the entire contents of that match in the "fileContents" string, if found.  This version will only find / replace one occurence of a "Preliminary" row, which seems to have been your intention.

Note that two new object variables are used in this code, "colMatches" and "objMatch".  If your VBA macro has Option Explicit enabled, make sure you declare those two variables with Dim statements at some point.

'*******************************************

fileContent = outFile.ReadAll

re.IgnoreCase = True
re.Global = True
re.Pattern = "<tr>[\s\S]*?</tr>"

Set colMatches = re.Execute(fileContent)

For Each objMatch In colMatches
  If (InStr(LCase(objMatch.value), "preliminary") > 0) Then
    fileContent = _
      Left(fileContent, objMatch.FirstIndex) & _
      outString & _
      Mid(fileContent, objMatch.FirstIndex + objMatch.Length + 1)
    Exit For
  End If
Next

'***********************************************
0
 
dlwyatt82Commented:
One other thing.  In a VBA macro, you may need to change "Next" to "Next objMatch".  I was testing that code as a VBScript.
0
 
mfb47Author Commented:
Hi wnross,

I've already tried the generalize one you suggested.  

I also took all the font tags out of my input file (but I was wondering why my pattern would not have match the font tags - it seems like it should).

Also, if I take out the \s, I think white space characters like new line won't be matched.



Thanks!
0
 
mfb47Author Commented:
dlwyatt82:

That looks promising!  Thanks a lot, I wouldn't have thought of that.

I don't understand why my expression might match multiple <tr> </tr> blocks when there is the literal 'preliminary' specified as part of the pattern, and that only occurs once in the file.  I don't quite have a handle on these regular expressions, I guess!

0
 
wnrossCommented:
I suspect mfb47 was matching line-by-line so globbing the entie line didn't seem to be a concern.

However, the MS Regex implementation does respect perl "greediness" modifiers

So modify:
re.Pattern = "<tr>.*(Preliminary).*</tr>"
to include a "greediness" modifer
re.Pattern = "<tr>.*(Preliminary).*?</tr>"

Oh dlwyatt, that wasn't my final answer, I gave two options to explore.  This is just the generalization version.

Cheers,
-Bill
0
 
dlwyatt82Commented:
It's sometimes hard to see how the RegEx engine is working.  I'll try to explain why doing this all with one pattern and Replace call doesn't work (unless your regex engine supports Atomic Groups)

Say your search string is this:

<tr>SomeStuff
STuff
Stuff
Stuff
</tr>

<tr>SomeStuff Preliminary
Stuff
Stuff
Stuff
</tr>

And you try to match the pattern "<tr>[\s\S]*?Preliminary[\s\S]*?</tr>", here's what happens:

<tr> matches the first <tr>.

[\s\S]*? matches "STuff
Stuff
Stuff
</tr>

<tr>SomeStuff "

Preliminary matches that word, literally.

[\s\S]*? matches "

Stuff
Stuff
Stuff
"

And finally, </tr> matches that tag literally.

Notice how the "everything" pattern [\s\S]*? matched an instance of </tr> and another <tr> because it couldn't find the word "Preliminary" in the first block.
0
 
ozoCommented:
If  re.Pattern doesn't understand *? then how about
"<tr>[^<>]*Preliminary[^<>]*</tr>"
0
 
dlwyatt82Commented:
I'm not trying to be a jerk here, just pointing out that this is a very tricky question, due in part to Microsoft's incomplete implementation of RegEx.  Greedy / Lazy quantifiers are only part of the problem, and Atomic Groups are something entirely different.  If you read through my previous post explaining how the wild cards are matching, you'll see what I mean.
0
 
wnrossCommented:
mfb47: Your <font> tag wasn't getting caught because
> I also took all the font tags out of my input file (but I was wondering why my pattern would not have match the font tags - it seems like it should).


 re.Pattern = "<tr>.*(Preliminary){1}[.\s]*tr>"
-------------------------------------------^^^
[.\s] matches "." and white space.  

"." loses its special meaning when being used as part of a class.

> Also, if I take out the \s, I think white space characters like new line won't be matched.
.* matches everything.  Thats also the answer to
> I don't understand why my expression might match multiple <tr> </tr> blocks ....

Cheers,
-Bill
0
 
dlwyatt82Commented:
Unfortunately it's very difficult to explain debugging a regular expression in words, and I'm not good at it.  In any case, the solution I posted works as intended with the Microsoft VB / VBScript RegExp implementation.
0
 
dlwyatt82Commented:
. does NOT match everything.  It matches everything but newline characters.  Microsoft's VBScript engine doesn't support the "single line" mode, aka "dot matches newline".  However, you can get around this limitation by using the [\s\S] character class instead of .  (as I did in my examples).
0
 
wnrossCommented:
dlwyatt82:
Cool.
0
 
mfb47Author Commented:
HI!

wnross: No, I took the font tag out later - I was wondering about what you said earlier:"3) You have additional text  "</font>" which your class "[.\s]" does not match, either ".  I didn't know that the meaning of the . changes when it's in a class, so that explains why I don't need the /s.  Thanks a lot!

dlwyatt82: thanks for stepping through regular expressions - I am going to have to ponder this for awhile.  Still have not been able to try your solution.  


 


0
 
mfb47Author Commented:
dlwyatt82, What a great expert you are.  Thanks a lot - it's FINALLY working, and I learnt a lot.

0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 9
  • 6
  • 4
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now