Solved

Perl Regex If-Then-Else Problem

Posted on 2013-05-10
22
421 Views
Last Modified: 2013-05-21
All,

I am trying to get a regex if-then-else expression working "correctly". Here's the scenario:

I normally have numeric data, a float, in a field of my report. Sometimes instead of numeric data I get a series of dashes (--,----). Here is my regex to try and deal with it correctly:
'(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--,----))'

Open in new window

What I am trying to do is lookahead and see it it's a number - if so great! Capture that number (this seems to work). If it's a series of dashes then copy the dashes. What happens now is that if it's not a number, the regex query just merrily skips on to someplace else in the file and grabs that number - sometimes 6 or 8 lines later in the file. And of course this unexpected grab is the wrong data entirely. From there the regex never seems to recover correctly.

How do I check if it's a number, match if it is, or just return the dashes or a zero if it's not? A zero might be bet in the end, but dashes are in the report.

Thanks!
0
Comment
Question by:AllThumbsGeek
  • 10
  • 5
  • 5
  • +1
22 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39156267
Are the "dashes or a zero" a part of the data? Regex cannot generate characters, it can only match them.
0
 

Author Comment

by:AllThumbsGeek
ID: 39156288
Dashes are the data. Example:

Successfully captured with the expression listed above:
Hop char:          rtt = 7.498663 ms, bw = 5932.673547 Kbps

Open in new window

This fails and skips past to lands undiscovered:
Hop char:          rtt = 1.171581 ms, bw = --.--- Kbps

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39156327
Your regex shows a comma inside the dashes. Your data example shows a period/decimal point.
0
 

Author Comment

by:AllThumbsGeek
ID: 39156360
Curious. The data file for sure is a comma. Copying the line into txt2re translates it to a period. It's reporting a speed in Kbps so it's a comma.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39156469
The regex has 4 dashes while the data you posted only has 3.  I'd try this slight variation (2 or more dashes):
'(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--,--+))'

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
ID: 39157094
How about simplifying the regex and use the Scalar::Util module to test if it's a number?

For example:
#!/usr/bin/perl

use strict;
use warnings;
use Scalar::Util qw(looks_like_number);

my @str = ('Hop char:          rtt = 7.498663 ms, bw = 5932.673547 Kbps',
           'Hop char:          rtt = 1.171581 ms, bw = --.--- Kbps'
);

foreach my $str (@str) {
    my ($num) = $str =~ /bw \s+ = \s+ (\S+)/x;
    if ( looks_like_number($num) ) {
        print "$num is a number\n";
    }
    else {
        print "$num is not a number\n";
    }
}

Open in new window


Results:
[root@099-91-RKB-2 ~]# ./test.pl
5932.673547 is a number
--.--- is not a number
0
 

Author Comment

by:AllThumbsGeek
ID: 39161697
@wilcoxon - I love the elegance of your approach but I am unable to get that to work comma or period. I do not believe I am implementing the if-then-else regex correctly.

@FishMonger - that seems to work, but I am clueless how to implement/integrate it in my Perl parser. Should I just search for the the desired field (such as bw, r2, avg...) and then use the same approach? If that will work it seems to me maybe I should just implement a series of 'if' conditionals? Any thoughts?

THANKS! I have hope this will end this morning!!!
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39161821
I was assuming your original regex worked as posted.  However, when I actually tried it, it doesn't match even the correct numeric line (I'm assuming issues with copy-paste, txt2re, or EE).  Try this simpler regex.  It works on both and will always capture the bandwidth (numeric or dashes).  I made it work with either comma or period since there's some question on which is actually present.

m{\bbw\s+=\s+(\d+(?:[.,]\d+)?|-+(?:[.,]-+)?)\s+Kbps\b};

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
ID: 39161892
Should I just search for the the desired field (such as bw, r2, avg...) and then use the same approach? If that will work it seems to me maybe I should just implement a series of 'if' conditionals? Any thoughts?
Hard to answer that question since you haven't provided enough info about your data or what you're needing to accomplish.

Have you come across any cases where the simple regex that I gave does not work as expected when searching for the bw data?

Can you provide a more complete sample of your data and what you need to extract/accomplish?
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 39161965
Additionally, you should post a reasonable sample of the related code.  An even better suggestion would be to post a short but complete script that demonstrates the problem so that we can test various adjustments.  Often you will end up solving the problem while writing the short test script.  But if not, we should be able to help solve it.
0
 

Author Comment

by:AllThumbsGeek
ID: 39171828
FishMonger and Wilcoxon:

Just back from a business trip. I have attached my original source code and data file as requested.

The primary problem I have is that when a regex expression does not match (usually due to the dashes) then it skips to the next block of data where it does match - which depending on the data file maybe 3 or more blocks of data out of order.
a1.pl.txt
pignose1.txt
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 28

Expert Comment

by:FishMonger
ID: 39171854
Please repost your script attachment.  I'm getting a permission error when accessing that link.
0
 

Author Comment

by:AllThumbsGeek
ID: 39172010
#!/usr/bin/perl 
#===============================================================================
#
#         FILE: a1.pl
#
#        USAGE: ./a1.pl  
#
#        STATUS: PARTIALLY WORKING - DO NOT BREAK IT!   May 8, 2013 
#
#===============================================================================

use strict;
use warnings;

use Scalar::Util qw(looks_like_number);

my $int1;
my $ipaddress1;
my $int2;
my $int3;
my $int4;
my $int5;
my $int6;
my $float1;
my $float2;
my $float5;
my $float6;
my $float7;

my $line;
my $file = $ARGV[0];
my $fh;
my $test;

my $re1='(\\d+)';
my $re2='(:)';
my $re3='.*?';
my $re4='((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\\d])';   # IPv4 IP Address 1
my $re5='.*?'; # Non-greedy match on filler
my $re6='(Partial)';   # Word 1
my $re7='.*?'; # Non-greedy match on filler
my $re8='(loss)';  # Word 2
my $re9='.*?'; # Non-greedy match on filler
my $re10='(\\d+)'; # Integer Number 2
my $re11='.*?';    # Non-greedy match on filler
my $re12='(\\d+)'; # Integer Number 3
my $re13='.*?';    # Non-greedy match on filler
my $re14='(\\d+)'; # Integer Number 4
my $re15='.*?';    # Non-greedy match on filler
my $re16='(Partial)';  # Word 3
my $re17='.*?';    # Non-greedy match on filler
my $re18='(char)'; # Word 4
my $re19='.*?';    # Non-greedy match on filler
my $re20='(rtt)';  # Word 5
my $re21='.*?';    # Non-greedy match on filler
my $re22='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 1
my $re23='.*?';    # Non-greedy match on filler
my $re24='(r2)';   # Alphanum 1
my $re25='.*?';    # Non-greedy match on filler
my $re26='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 2
my $re27='.*?';    # Non-greedy match on filler
my $re28='(stddev)';   # Word 6
my $re29='.*?';    # Non-greedy match on filler
my $re30='(rtt)';  # Word 7
my $re31='.*?';    # Non-greedy match on filler
my $re32='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 3
my $re33='.*?';    # Non-greedy match on filler
my $re34='(Partial)';  # Word 8
my $re35='.*?';    # Non-greedy match on filler
my $re36='(queueing)'; # Word 9
my $re37='.*?';    # Non-greedy match on filler
my $re38='(avg)';  # Word 10
my $re39='.*?';    # Non-greedy match on filler
my $re40='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 4
my $re41='.*?';    # Non-greedy match on filler
my $re42='(\\d+)'; # Integer Number 5
my $re43='.*?';    # Non-greedy match on filler
my $re44='(Hop)';  # Word 11
my $re45='.*?';    # Non-greedy match on filler
my $re46='(char)'; # Word 12
my $re47='.*?';    # Non-greedy match on filler
my $re48='(rtt)';  # Word 13
my $re49='.*?';    # Non-greedy match on filler
my $re50='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 5
my $re51='.*?';    # Non-greedy match on filler
my $re52='(bw)';   # Word 14
my $re53='.*?';    # Non-greedy match on filler
#my $re54='(?=(\d+))([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--,----)';
my $re54='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 6
my $re55='.*?';    # Non-greedy match on filler
my $re56='(queueing)'; # Word 15
my $re57='.*?';    # Non-greedy match on filler
my $re58='(avg)';  # Word 16
my $re59='.*?';    # Non-greedy match on filler
my $re60='(?=(\\d+))([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+)';   # Float 7
my $re61='.*?';    # Non-greedy match on filler
my $re62='(\\d+)'; # Integer Number 6
my $re63='.*?';    # Non-greedy match on filler
my $re64='(bytes)';    # Word 17

my $re=$re3.$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8.$re9.$re10.$re11.$re12.$re13.$re14.$re15.$re16.$re17.$re18.$re19.$re20.$re21.$re22.$re23.$re24.$re25.$re26.$re27.$re28.$re29.$re30.$re31.$re32.$re33.$re34.$re35.$re36.$re37.$re38.$re39.$re40.$re41.$re42.$re43.$re44.$re45.$re46.$re47.$re48.$re49.$re50.$re51.$re52.$re53.$re54.$re55.$re56.$re57.$re58.$re59.$re60.$re61.$re62.$re63.$re64;
my $reFirst=$re3.$re1.$re2.$re3.$re4;
my $reSecond=$re5.$re6.$re7.$re8.$re9.$re10.$re11.$re12.$re13.$re14.$re15.$re16.$re17.$re18.$re19.$re20.$re21.$re22.$re23.$re24.$re25.$re26.$re27.$re28.$re29.$re30.$re31.$re32.$re33.$re34.$re35.$re36.$re37.$re38.$re39.$re40.$re41.$re42.$re43.$re44.$re45.$re46.$re47.$re48.$re49.$re50.$re51.$re52.$re53.$re54.$re55.$re56.$re57.$re58.$re59.$re60.$re61.$re62.$re63.$re64;

my $data = '';
open ($fh, "<", $file) or die ("Can not open data file");
{
    local $/;
    $data = <$fh>;
}
close $fh;
$line=$data;
    while ($line =~ m/$reFirst/igsx ) {
#    if ($line =~ m/$reFirst/igs) {   
        $int1=$1; 
        $ipaddress1=$3;

        print "$int1  $ipaddress1 \n";
        }
        
        while ($line =~ m/$reSecond/isgx) {
            $int2=$3;
            $int3=$4;
            $int4=$5;
#            if ($9 > 0) {
                $float1=$9; #sprintf("%.2f",$9);
#            } else {
#                $float1="N/A ";
#            }
#            if ($11 > 0) {
                $float2=$12; #sprintf("%.2f",($11 * 100));  # Reliability
#            } else {
#                $float2="N/A ";
#            }
            $int5=$23;  # PQ_Value
#            if ($23 > 0) {
                $float5=$27; # HC_RTT
#            } else {
#                $float5="N/A ";
#            }
#            if ($25 > 0) {
                $float6=$30; #sprintf("%.3f",($25 / 1000)); # BW
#            } else {
#                $float6="N/A ";
#            }
#            if ($28 > 0) {
                $float7=$35;
#            } else {
#                $float7="N/A  ";
#            }
#            $int6=$36;

            print "PL:$int2   Tests:$int3   PL%:$int4   PC_RTT:$float1   Rel:$float2   PQ_Value:$int5   BW:$float6   HQ_AVG:$float7 \n";
        }
        
#        print "$int1  $ipaddress1  $int2  $int3  $int4  $float1  $float2  $float6  $float7 $int6\n";
#}

Open in new window

0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39172096
Honestly, I would rewrite that script using named captures and greatly simplifying the regexes (if nothing else, you're capturing a lot of things you never use).  I'll see what i can do today or tomorrow...
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 39172117
That's a good example of fragile and very difficult to maintain regex coding.

You may want to rethink your approach.
0
 

Author Comment

by:AllThumbsGeek
ID: 39172181
No doubt, regex is my mortal enemy and seeks my ruination. As you probably guessed I went to txt2re and used it's output. If the file fully completes, it runs just fine. If any matches don't complete (usually because of the dashes) and then it skips.

FM - I am horrible at regex and usually use the RE Cookbook for a recipe or txt2re, but I have to process this report into either a tabular form or CSV for loading into a web report - and I am weeks late, so all of your help is enormously appreciated and I am happy to reciprocate anyway I can.
0
 

Author Comment

by:AllThumbsGeek
ID: 39174775
Wilcoxon - Thank You!
0
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 500 total points
ID: 39177271
Sorry I didn't have time to do this until this morning...
Let me know if you have any questions on this code...
#!/usr/bin/perl
#===============================================================================
#
#         FILE: a1.pl
#
#        USAGE: ./a1.pl
#
#        STATUS: PARTIALLY WORKING - DO NOT BREAK IT!   May 8, 2013
#
#===============================================================================

use strict;
use warnings;
use File::Slurp;

my $file = $ARGV[0] or die "Usage: $0 input_file\n";

# could put read_file directly into @data but I find this easier to read
my $txt = read_file($file) or die "could not read data file $file: $!";
my @lines = split /\n/, $txt;
$txt = undef;

# get rid of "header" rows
while ($lines[0] !~ m{^\s?\d+:\s+}) {
    shift @lines;
}

# capture all data for later manipulation/output
my (@data, $idx);
while (@lines) {
    my $ln = shift @lines;
    if ($ln =~ m{^\s?(\d+):\s.*?\b((?:\d{1,3}\.){3}\d{1,3})\b}) {
        $idx = $1;
        if (ref $data[$idx]{ip}) {
            push @{$data[$idx]{ip}}, $2;
        } else {
            $data[$idx]{ip} = [$2];
        }
    } elsif ($ln =~ m{^\s?(\d+):\s+no\s+probe\s+responses\s*$}) {
        $idx = $1;
    } elsif ($ln =~ m{^\s+Partial\s+loss:\s+(\d+)\s*/\s*(\d+)\s*\((\d+)%\)}) {
        $data[$idx]{PL} = "PL:$1\tTests:$2\tPL%:$3";
    } elsif ($ln =~ m{^\s+Partial\s+char:\s+rtt\s+=\s+(\d+\.\d+)\s+ms,.*?,\s+r2\s+=\s+(\d+\.\d+)}) {
        $data[$idx]{PC} = "PC_RTT:$1\tRel:$2";
    } elsif ($ln =~ m{^\s+stddev\s+rtt\b}) {
        # do nothing
    } elsif ($ln =~ m{^\s+Partial\s+queueing:\s.*?\((\d+)\s+bytes\)}) {
        $data[$idx]{PQ} = "PQ_Value:$1";
    } elsif ($ln =~ m{^\s+Hop\s+char:\s.*?,\s+bw\s+=\s+(\d+\.\d+|-+\.-+)\s+Kbps}) {
        $data[$idx]{BW} = "BW:$1";
    } elsif ($ln =~ m{^\s+Hop\s+queueing:\s+avg\s+=\s+(-?\d+\.\d+)\s+ms}) {
        $data[$idx]{HQ} = "HQ_AVG:$1";
    } elsif ($ln =~ m{^\s+End\s+of\s+path\s+not\s+reached\s+after\b}
             or $ln =~ m{^\s+(?:Start|End)\s+time:\s+}) {
        # do nothing
    } else {
        die "don't know what to do with line in section $idx:\n$ln\n";
    }
}

for my $i (0..@data-1) {
    next unless defined($data[$i]);
    next unless defined($data[$i]{ip});
    print "$i  $_\n" for @{$data[$i]{ip}};
}

for my $i (0..@data-1) {
    next unless defined($data[$i]);
    print join("\t", @{$data[$i]}{qw(PL PC PQ BW HQ)}), "\n";
}

Open in new window

0
 

Author Comment

by:AllThumbsGeek
ID: 39181076
Wilcoxon - the parsing works. How do I align the rest of the data by block to the Hop and IP?

Like this (all in one line):

0  192.168.1.105  PL:40      Tests:100      PL%:40      PC_RTT:0.541331      Rel:0.962127      PQ_Value:59      BW:26104.265185      HQ_AVG:0.000018
1  192.168.1.1  PL:0      Tests:100      PL%:0      PC_RTT:8.039994      Rel:0.376515      PQ_Value:1340      BW:5932.673547      HQ_AVG:0.001728
2  10.88.64.1  PL:0      Tests:100      PL%:0      PC_RTT:9.211575      Rel:0.188959      PQ_Value:1340      BW:--.---      HQ_AVG:-0.000143

and so on.
0
 

Author Comment

by:AllThumbsGeek
ID: 39182138
I think this will work, but I am having trouble wrapping the printed Hop# and IP in XML tags.

for my $i (0..@data-1) {
    next unless defined($data[$i]);
    print "$i  $_\n" for @{$data[$i]{ip}};
    print join("\t", @{$data[$i]}{qw(PL PC PQ BW HQ)}), "\n";
}

Open in new window

This at least places the data directly following the IP address, which is great. I have wrapped the PL through HQ_AVG in XML tags for easy connection to my web control, but am trying to figure out hop to do that for the Hop# and IP. But I'm working on it...

ATG
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 39182356
What XML tags would you like to use?  Do you want the other XML nested under hop?  If not, how do you want to handle hops that have multiple IP addresses?

This should be close...
for my $i (0..@data-1) {
    next unless defined($data[$i]);
    if ($data[$i]{ip}) {
        print "<hop>$i</hop><ip>$_</ip>\n" for @{$data[$i]{ip}};
    }
    print join("\t", @{$data[$i]}{qw(PL PC PQ BW HQ)}), "\n";
}

Open in new window

0
 

Author Closing Comment

by:AllThumbsGeek
ID: 39184443
Much simpler to maintain than what I was attempting and it loops through the multiline file correctly.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now