Regular expresional with optional match

Hi,

I am trying to figure out a regular expression to parse the following lines of text line by line.

Send Request 208JRB03~    Job:        JQZ1881H  Destination:  AA.test.internal~    State:      Working   Status:    Active     Action:    None~    Total size: 0       k Remaining: 3668    k Heartbeat: 17:46~    Start:      12:00     Finish:    12:00      Retry:     17:46~    SWD PkgID:  X0300053  SWD Pkg Version: 3~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.144-660><thread=7796 (0x1E74)>
Send Request 208JTB03~    Job:        JYKKQX84  Destination:  BB.test.internal~    State:      Working   Status:    Active     Action:    None~    Total size: 0       k Remaining: 3246    k Heartbeat: 17:46~    Start:      12:00     Finish:    12:00      Retry:     17:46~    SWD PkgID:  X0300052  SWD Pkg Version: 2~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.157-660><thread=7796 (0x1E74)>
Send Request 208JUB03~    Job:        JBBYPU4Y  Destination:  CC.test.internal~    State:      Working   Status:    Active     Action:    None~    Total size: 0       k Remaining: 20      k Heartbeat: 17:46~    Start:      12:00     Finish:    12:00      Retry:     17:46~    SWD PkgID:  X0300052  SWD Pkg Version: 2~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.178-660><thread=7796 (0x1E74)>
Send Request 208JWB03~    Job:        JXQT2T11  Destination:  DD.test.internal~    State:      Working   Status:    Active     Action:    None~    Total size: 0       k Remaining: 2419040 k Heartbeat: 17:46~    Start:      12:00     Finish:    12:00      Retry:     17:46~    SWD PkgID:  X0300054  SWD Pkg Version: 4~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.192-660><thread=7796 (0x1E74)>
Send Request 208JXB03~    Job:        J2ODKBP0  Destination:  AA.test.internal~    State:      Working   Status:    Active     Action:    None~    Total size: 0       k Remaining: 2375789 k Heartbeat: 17:46~    Start:      12:00     Finish:    12:00      Retry:     17:46~    SWD PkgID:  X0300054  SWD Pkg Version: 4~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.204-660><thread=7796 (0x1E74)>
Send Request 208K1B03~    Job:        JB1BVVWP  Destination:  GG.test.internal~    State:      Working   Status:    Active     Action:    None~    Total size: 0       k Remaining: 3142207 k Heartbeat: 17:46~    Start:      12:00     Finish:    12:00      Retry:     17:46~    SWD PkgID:  X0300055  SWD Pkg Version: 4~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.217-660><thread=7796 (0x1E74)>
Send Request 208JYB03~    Job:        JXL6Q1VY  Destination:  TT.test.internal~    State:      Pending   Status:               Action:    None~    Total size: 0       k Remaining: 0       k Heartbeat: 15:55~    Start:      12:00     Finish:    12:00      Retry:          ~    SWD PkgID:  X0300053  SWD Pkg Version: 3~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.223-660><thread=7796 (0x1E74)>
Send Request 208JZB03~    Job:        J49K1PHU  Destination:  YY.test.internal~    State:      Pending   Status:               Action:    None~    Total size: 0       k Remaining: 0       k Heartbeat: 15:55~    Start:      12:00     Finish:    12:00      Retry:          ~    SWD PkgID:  X0300053  SWD Pkg Version: 3~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.229-660><thread=7796 (0x1E74)>
Send Request 208K0B03~    Job:        JUVML621  Destination:  ZZ.test.internal~    State:      Pending   Status:               Action:    None~    Total size: 0       k Remaining: 0       k Heartbeat: 15:55~    Start:      12:00     Finish:    12:00      Retry:          ~    SWD PkgID:  X0300054  SWD Pkg Version: 4~  $$<SMS_PACKAGE_TRANSFER_MANAGER><01-07-2014 17:46:26.236-660><thread=7796 (0x1E74)>

Open in new window


Here is my regular expression:

Send Request\s(?<send_request>\S*)\s*Job:\s*(?<job>\S*)\s*Destination:\s*(?<destination>\S*)~\s*State:\s*(?<state>\S*)\s*Status:\s*(?<status>\S*)\s*Action:\s*(?<action>\S*)~\s*Total size:\s(?<total_size>\d*).*Remaining:\s(?<remaining>\d*).*Heartbeat:\s*(?<heartbeat>\d\d:\d\d)~\s*Start:\s*(?<start>\d\d:\d\d)\s*Finish:\s*(?<finish>\d\d:\d\d)\s*Retry:\s*(?<retry>\d\d:\d\d)~\s*SWD PkgID:\s*(?<package_id>\S*)\s*SWD Pkg Version:\s*(?<package_version>\d*)

Open in new window


The problem I am having is with the status field as you can see above the status is an optional field. So I still want to parse the last 3 lines but have status not picked up or returned as blank.
So how can I change my regular expression to deal with this ?

PS. I have attached the question as a text file to make it easier for people to read.

Thanks,

Ward.
question.txt
LVL 1
whorsfallAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Kent DyerIT Security Analyst SeniorCommented:
Can't you simplify this down to the following?

\b.test.internal~    State:      Working   Status:    Active\b

Open in new window


I just checked that in EditPad and it works great!

Ref - http://www.regular-expressions.info/wordboundaries.html
Q-28332143-results.txt
0
whorsfallAuthor Commented:
Kent,

Thanks for responding - I realized reading your response and my original question I did not state it correctly - my fault :)

What I wanted to do was get the regular expression is to handle all the lines in the file and *optionally* match the word after "Status:" and before "Action:" if there is one.  So for the first six lines it would capture "Active" into the named capture group "status".

Now for the last the lines nothing or blank would be captured in the capture group "status"
however all the other capture groups would match though. So this is why I am calling it an optional capture. As Status might or might not have data there.

Hope this make sense :)

Thanks,

Ward
0
ozoCommented:
\S* seems to work fine for the optional <status>
but then \d\d:\d\d fails for the missing <retry>
So try
Send Request\s(?<send_request>\S*)\s*Job:\s*(?<job>\S*)\s*Destination:\s*(?<destination>\S*)~\s*State:\s*(?<state>\S*)\s*Status:\s*(?<status>\S*)\s*Action:\s*(?<action>\S*)~\s*Total size:\s(?<total_size>\d*).*Remaining:\s(?<remaining>\d*).*Heartbeat:\s*(?<heartbeat>\d\d:\d\d)~\s*Start:\s*(?<start>\d\d:\d\d)\s*Finish:\s*(?<finish>\d\d:\d\d)\s*Retry:\s*(?<retry>\d\d:\d\d)?~\s*SWD PkgID:\s*(?<package_id>\S*)\s*SWD Pkg Version:\s*(?<package_version>\d*)
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Fernando SotoRetiredCommented:
Hi whorsfall;

I think you will find this pattern to work for both cases. There is a second issue with the Retry field one has digits and the other does not so I took care of that as well.

string pattern = @"Send Request\s(?<send_request>\S*)\s*Job:\s*(?<job>\S*)\s*Destination:\s*(?<destination>\S*)~\s*State:\s*(?<state>\S*)\s*Status:\s*(?<status>.*)Action:\s*(?<action>\S*)~\s*Total size:\s(?<total_size>\d*).*Remaining:\s(?<remaining>\d*).*Heartbeat:\s*(?<heartbeat>\d\d:\d\d)~\s*Start:\s*(?<start>\d\d:\d\d)\s*Finish:\s*(?<finish>\d\d:\d\d)\s*Retry:\s*(?<retry>(?:\d\d:\d\d~)|(?:~))\s*SWD PkgID:\s*(?<package_id>\S*)\s*SWD Pkg Version:\s*(?<package_version>\d*)";

Open in new window

0
ozoCommented:
http:#a39763594 changes not only how your last 3 lines match, but also how your first 6 lines match.  
http:#a39761680 assumed that you were satisfied with how your original expression was matching the first 6  lines, and that you only wanted to change it to also match the last 3 lines.
0
Derek JensenCommented:
Wow, well...you've definitely got a head-scratcher there.
Let me go ahead and post what I was going to say, before I tested out my regex and found it to be much, much more difficult than I first anticipated(I could've sworn I've done this a dozen times before!), and I will close with my proposed solution:


Apologies, I'm not a C# guru, but I do know regex, and if what I'm seeing is correct, http:#a39763594 's post, changing
Status:\s*(?<status>\S*)\s*Action:\s*(?<action>\S*)
to:
Status:\s*(?<status>.*)Action:\s*(?<action>\S*)
still doesn't solve the problem.
The capture of \s* after "Status:" is still a greedy match, and so will still match up to but not including the first letter of "Action:", causing the entire regex to fail to match lines where status is not present.

However, I believe one of the following two changes should work:
Concerning the portion of relevant regex after "Status" to before "Action:",
\s*(?<status>\S*)\s*
should become:
\s*(?<status>.*)?\s*
This tries to make the entire named capture group <status> optional, meaning there may or may not exist at all a var named status after each line, or it may break your regex entirely(preliminary research suggests it won't).

Alternatively:
(?<status>.*?)
This simply turns the capture regex for populating <status> into a non-greedy "match everything" search (.*?). This of course means it's going to capture all the spaces before/after the "Active" or whatever word might be there, so you'll have to strip those out separately.



...Okay, I think I got it! :-D

So forget all the above regexes I said to try, and try this one out instead:
string pattern = @"Send Request\s*(?<send_request>\S+)~\s*Job:\s*(?<job>\S+)\s*Destination:\s*(?<destination>\S+)~\s*State:\s*(?<state>\S+)\s*Status:\s*(?!Action:)(?|(?<status>\S+)\s*?|(?<status>(\s|\S)*?))\s*Action:\s*(?<action>\S+)~\s*Total size:\s*(?<total_size>\d+)(\s|k)*Remaining:\s*(?<remaining>\d+)(\s|k)*Heartbeat:\s*(?<heartbeat>\d\d:\d\d)~\s*Start:\s*(?<start>\d\d:\d\d)\s*Finish:\s*(?<finish>\d\d:\d\d)\s*Retry:\s*(?!SWD)(?|(?<retry>\d\d:\d\d)~|((\s|\S)*?)~)\s*SWD PkgID:\s*(?<package_id>\S+)\s*SWD Pkg Version:\s*(?<package_version>\d+).*";

Open in new window

I also found you were having the same problem with Retry as you were with Status, as your .* after <total_size> was eating up the rest of the line...so I fixed the remaining regex also. :-)
0
skullnobrainsCommented:
if my understanding is correct, the problem you have is the 3 last lines do not match at all

if I'm correct, this is how to break it up :

Status:                     matches "Status:"
\s*                           matches all the whitespace
(?<status>\S*)          matches "Action:"
\s*                           matches nothing
Action                      not found so the ereg does not match

i guess switching to ungreedy mode should be enough to make your existing expression work. i'd also change the \s* to \s+ for more safety

of course if you know what might be a valid action, you can always try something like

Status:\s*(?<status>(?:Active|Inactive|))\s*Action

allowing for "Active" "Inactive" or "" statuses
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.