Solved

Avoiding Greedy matches and capturing Multiple lines using Regular Expressions (VB)

Posted on 2004-08-24
7
260 Views
Last Modified: 2010-08-05
Hi
i need to extract some lines of strings between a specified "start" and "end". I made use of Microsoft Vbscript Regular Expressions 5.5, here is VB my code

'-------------------------------------------------------
Dim objReg As New RegExp
Dim objMatchCol As MatchCollection
With objReg
    .Global = True
    .IgnoreCase = True
    .Pattern ="start(.*?)end"
    sSourceString="start : Welcome to Experts-Exchnge : end"
    Set objMatchCol = .Execute(sSourceString)
End With
 '--------------------------------------------------------------
 It works smoothly but it is unable to catch multiple lines between start and end. See the following text
'------------------------------------------
start

 : Welcome
         to
        Experts-Exchnge
 :
 end
'---------------------------------------
How can i get these lines betwen 'start' and 'end' as a single match?
Please help
Thanks
Shiju

0
Comment
Question by:Shiju Sasidharan
  • 2
  • 2
  • 2
  • +1
7 Comments
 
LVL 84

Accepted Solution

by:
ozo earned 80 total points
ID: 11887980
.Pattern ="start([\s\S]*?)end"
0
 
LVL 1

Assisted Solution

by:sgartner
sgartner earned 20 total points
ID: 11889393
First, I'm not a VB expert, but the Microsoft web site indicated that the regular expression syntax for VB is the same as VBScript.  The period stands for any character in a RE, except for the carriage return.  So, for the example you have in VB you could simply use "^start(.*)end$", but this did not match the longer sample because the period *will not*match a line end character.

To match all of the text between start and end including multiple lines: "^start((.|\n)*)end$"

This assumes, like both samples, that the entire string begins with "start" and ends with "end".  If you expect other text to be in the middle you would remove the carat and dollar sign.

What this says is that the matched string would begin with "start" then be followed by either: any character (".") or a carriage return ("\n") zero or more times "*".  The outer parens will create the match block for the MatchCollection, the inner ones will be ignored as a block and only serve to group the two items.

One last thing that confuses me:  In VBScript (and Perl, etc.) the construct ".*?" is not legal, so you may need to adjust my RE syntax to match that of VB.

Sorry for submitting this twice, the first time I submitted early, and found that I can't edit my submissions on this site.  If a moderator reading this has the ability, please remove my previous post.
0
 

Expert Comment

by:akhil_thakur
ID: 11890485
Which OS are u using
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Expert Comment

by:akhil_thakur
ID: 11890708
Oh its Microsoft.....
No regular expressions match multiple lines. Your problem can be solved by writing a Perl script.
install the free software ActivePerl for Windows from
http://www.activestate.com/Products/ActivePerl/

Copy the folowing to a .pl file say startend.pl

$inSection = 0;
while (<STDIN>)
{
  if (m/end/)
  {
    $inSection = 0;
  }
  if ($inSection)
  {
    print;
  }
  if (m/start/)
  {
    $inSection = 1;
  }
}

Assume the source file is called temp.txt

use the following command
perl startend.pl < temp.txt

You may need to provide the whole path if the perl script file or the temp.txt file are not in the same directory.
The required section will be printed to STDOUT which also be redirected to a file.


0
 
LVL 14

Author Comment

by:Shiju Sasidharan
ID: 11899816
Hello
Thank u all for posting comments...

hi sgartner
    I tried ur code  
          "^start((.|\n)*)end$"
but this doesnt seem to solve it. This code works only if the entire match is with in a single line.
Sample text i gave in question can be in the middle of a large string. I need to find all such matches from the entire string.

Hi ozo ,
    i tried ur code
                            .Pattern ="start([\s\S]*?)end"
 this also has the same effect, only takes matches with in a single line
Well, I am using VB6 , Os is Windows 2000

Hoping more comments....

Shiju
         



0
 
LVL 1

Expert Comment

by:sgartner
ID: 11899837
Shiju,

The ^ and $ force it to be by itself, which is how your samples were.  Just remove those two characters and it should work.

Scott
0
 
LVL 14

Author Comment

by:Shiju Sasidharan
ID: 11939350
Hi Ozo
Thank u for ur code, i am accepting ur answer
Code given by u is working good, I am really sorry that i posted a comment indicating that code given by u was not working, it was my mistake i gave an invalid string for verification.
it took only 20 milliseconds to execute my source string with the pattern u gave.

             .Pattern ="start([\s\S]*?)end"

Hi Scott
Thank u for ur posting, i accept ur answer as a supporting one
  after removing ^ and $ it looked greedy
       but putting a ? in ur code made successful matching
                     
             .Pattern ="^start((.|\n)*?)end$"

  This pattern solved my problem, but it took 90 milliseconds to execute my source string

Regards
Shiju



0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now