Perl Regex If-Then-Else Problem

All,

I am trying to get a regex if-then-else expression working "correctly". Here's the scenario:

I normally have numeric data, a float, in a field of my report. Sometimes instead of numeric data I get a series of dashes (--,----). Here is my regex to try and deal with it correctly:
'(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--,----))'

Open in new window

What I am trying to do is lookahead and see it it's a number - if so great! Capture that number (this seems to work). If it's a series of dashes then copy the dashes. What happens now is that if it's not a number, the regex query just merrily skips on to someplace else in the file and grabs that number - sometimes 6 or 8 lines later in the file. And of course this unexpected grab is the wrong data entirely. From there the regex never seems to recover correctly.

How do I check if it's a number, match if it is, or just return the dashes or a zero if it's not? A zero might be bet in the end, but dashes are in the report.

Thanks!
AllThumbsGeekAsked:
Who is Participating?
 
wilcoxonConnect With a Mentor Commented:
What XML tags would you like to use?  Do you want the other XML nested under hop?  If not, how do you want to handle hops that have multiple IP addresses?

This should be close...
for my $i (0..@data-1) {
    next unless defined($data[$i]);
    if ($data[$i]{ip}) {
        print "<hop>$i</hop><ip>$_</ip>\n" for @{$data[$i]{ip}};
    }
    print join("\t", @{$data[$i]}{qw(PL PC PQ BW HQ)}), "\n";
}

Open in new window

0
 
käµfm³d 👽Commented:
Are the "dashes or a zero" a part of the data? Regex cannot generate characters, it can only match them.
0
 
AllThumbsGeekAuthor Commented:
Dashes are the data. Example:

Successfully captured with the expression listed above:
Hop char:          rtt = 7.498663 ms, bw = 5932.673547 Kbps

Open in new window

This fails and skips past to lands undiscovered:
Hop char:          rtt = 1.171581 ms, bw = --.--- Kbps

Open in new window

0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
käµfm³d 👽Commented:
Your regex shows a comma inside the dashes. Your data example shows a period/decimal point.
0
 
AllThumbsGeekAuthor Commented:
Curious. The data file for sure is a comma. Copying the line into txt2re translates it to a period. It's reporting a speed in Kbps so it's a comma.
0
 
wilcoxonCommented:
The regex has 4 dashes while the data you posted only has 3.  I'd try this slight variation (2 or more dashes):
'(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--,--+))'

Open in new window

0
 
FishMongerCommented:
How about simplifying the regex and use the Scalar::Util module to test if it's a number?

For example:
#!/usr/bin/perl

use strict;
use warnings;
use Scalar::Util qw(looks_like_number);

my @str = ('Hop char:          rtt = 7.498663 ms, bw = 5932.673547 Kbps',
           'Hop char:          rtt = 1.171581 ms, bw = --.--- Kbps'
);

foreach my $str (@str) {
    my ($num) = $str =~ /bw \s+ = \s+ (\S+)/x;
    if ( looks_like_number($num) ) {
        print "$num is a number\n";
    }
    else {
        print "$num is not a number\n";
    }
}

Open in new window


Results:
[root@099-91-RKB-2 ~]# ./test.pl
5932.673547 is a number
--.--- is not a number
0
 
AllThumbsGeekAuthor Commented:
@wilcoxon - I love the elegance of your approach but I am unable to get that to work comma or period. I do not believe I am implementing the if-then-else regex correctly.

@FishMonger - that seems to work, but I am clueless how to implement/integrate it in my Perl parser. Should I just search for the the desired field (such as bw, r2, avg...) and then use the same approach? If that will work it seems to me maybe I should just implement a series of 'if' conditionals? Any thoughts?

THANKS! I have hope this will end this morning!!!
0
 
wilcoxonCommented:
I was assuming your original regex worked as posted.  However, when I actually tried it, it doesn't match even the correct numeric line (I'm assuming issues with copy-paste, txt2re, or EE).  Try this simpler regex.  It works on both and will always capture the bandwidth (numeric or dashes).  I made it work with either comma or period since there's some question on which is actually present.

m{\bbw\s+=\s+(\d+(?:[.,]\d+)?|-+(?:[.,]-+)?)\s+Kbps\b};

Open in new window

0
 
FishMongerCommented:
Should I just search for the the desired field (such as bw, r2, avg...) and then use the same approach? If that will work it seems to me maybe I should just implement a series of 'if' conditionals? Any thoughts?
Hard to answer that question since you haven't provided enough info about your data or what you're needing to accomplish.

Have you come across any cases where the simple regex that I gave does not work as expected when searching for the bw data?

Can you provide a more complete sample of your data and what you need to extract/accomplish?
0
 
FishMongerCommented:
Additionally, you should post a reasonable sample of the related code.  An even better suggestion would be to post a short but complete script that demonstrates the problem so that we can test various adjustments.  Often you will end up solving the problem while writing the short test script.  But if not, we should be able to help solve it.
0
 
AllThumbsGeekAuthor Commented:
FishMonger and Wilcoxon:

Just back from a business trip. I have attached my original source code and data file as requested.

The primary problem I have is that when a regex expression does not match (usually due to the dashes) then it skips to the next block of data where it does match - which depending on the data file maybe 3 or more blocks of data out of order.
a1.pl.txt
pignose1.txt
0
 
FishMongerCommented:
Please repost your script attachment.  I'm getting a permission error when accessing that link.
0
 
AllThumbsGeekAuthor Commented:
#!/usr/bin/perl 
#===============================================================================
#
#         FILE: a1.pl
#
#        USAGE: ./a1.pl  
#
#        STATUS: PARTIALLY WORKING - DO NOT BREAK IT!   May 8, 2013 
#
#===============================================================================

use strict;
use warnings;

use Scalar::Util qw(looks_like_number);

my $int1;
my $ipaddress1;
my $int2;
my $int3;
my $int4;
my $int5;
my $int6;
my $float1;
my $float2;
my $float5;
my $float6;
my $float7;

my $line;
my $file = $ARGV[0];
my $fh;
my $test;

my $re1='(\\d+)';
my $re2='(:)';
my $re3='.*?';
my $re4='((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\\d])';   # IPv4 IP Address 1
my $re5='.*?'; # Non-greedy match on filler
my $re6='(Partial)';   # Word 1
my $re7='.*?'; # Non-greedy match on filler
my $re8='(loss)';  # Word 2
my $re9='.*?'; # Non-greedy match on filler
my $re10='(\\d+)'; # Integer Number 2
my $re11='.*?';    # Non-greedy match on filler
my $re12='(\\d+)'; # Integer Number 3
my $re13='.*?';    # Non-greedy match on filler
my $re14='(\\d+)'; # Integer Number 4
my $re15='.*?';    # Non-greedy match on filler
my $re16='(Partial)';  # Word 3
my $re17='.*?';    # Non-greedy match on filler
my $re18='(char)'; # Word 4
my $re19='.*?';    # Non-greedy match on filler
my $re20='(rtt)';  # Word 5
my $re21='.*?';    # Non-greedy match on filler
my $re22='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 1
my $re23='.*?';    # Non-greedy match on filler
my $re24='(r2)';   # Alphanum 1
my $re25='.*?';    # Non-greedy match on filler
my $re26='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 2
my $re27='.*?';    # Non-greedy match on filler
my $re28='(stddev)';   # Word 6
my $re29='.*?';    # Non-greedy match on filler
my $re30='(rtt)';  # Word 7
my $re31='.*?';    # Non-greedy match on filler
my $re32='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 3
my $re33='.*?';    # Non-greedy match on filler
my $re34='(Partial)';  # Word 8
my $re35='.*?';    # Non-greedy match on filler
my $re36='(queueing)'; # Word 9
my $re37='.*?';    # Non-greedy match on filler
my $re38='(avg)';  # Word 10
my $re39='.*?';    # Non-greedy match on filler
my $re40='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 4
my $re41='.*?';    # Non-greedy match on filler
my $re42='(\\d+)'; # Integer Number 5
my $re43='.*?';    # Non-greedy match on filler
my $re44='(Hop)';  # Word 11
my $re45='.*?';    # Non-greedy match on filler
my $re46='(char)'; # Word 12
my $re47='.*?';    # Non-greedy match on filler
my $re48='(rtt)';  # Word 13
my $re49='.*?';    # Non-greedy match on filler
my $re50='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 5
my $re51='.*?';    # Non-greedy match on filler
my $re52='(bw)';   # Word 14
my $re53='.*?';    # Non-greedy match on filler
#my $re54='(?=(\d+))([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--,----)';
my $re54='(?(?=.*[0-9])([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+))';   # Float 6
my $re55='.*?';    # Non-greedy match on filler
my $re56='(queueing)'; # Word 15
my $re57='.*?';    # Non-greedy match on filler
my $re58='(avg)';  # Word 16
my $re59='.*?';    # Non-greedy match on filler
my $re60='(?=(\\d+))([+-]?\\d*\\.\\d+)(?![-+0-9\\.])|(--.--+)';   # Float 7
my $re61='.*?';    # Non-greedy match on filler
my $re62='(\\d+)'; # Integer Number 6
my $re63='.*?';    # Non-greedy match on filler
my $re64='(bytes)';    # Word 17

my $re=$re3.$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8.$re9.$re10.$re11.$re12.$re13.$re14.$re15.$re16.$re17.$re18.$re19.$re20.$re21.$re22.$re23.$re24.$re25.$re26.$re27.$re28.$re29.$re30.$re31.$re32.$re33.$re34.$re35.$re36.$re37.$re38.$re39.$re40.$re41.$re42.$re43.$re44.$re45.$re46.$re47.$re48.$re49.$re50.$re51.$re52.$re53.$re54.$re55.$re56.$re57.$re58.$re59.$re60.$re61.$re62.$re63.$re64;
my $reFirst=$re3.$re1.$re2.$re3.$re4;
my $reSecond=$re5.$re6.$re7.$re8.$re9.$re10.$re11.$re12.$re13.$re14.$re15.$re16.$re17.$re18.$re19.$re20.$re21.$re22.$re23.$re24.$re25.$re26.$re27.$re28.$re29.$re30.$re31.$re32.$re33.$re34.$re35.$re36.$re37.$re38.$re39.$re40.$re41.$re42.$re43.$re44.$re45.$re46.$re47.$re48.$re49.$re50.$re51.$re52.$re53.$re54.$re55.$re56.$re57.$re58.$re59.$re60.$re61.$re62.$re63.$re64;

my $data = '';
open ($fh, "<", $file) or die ("Can not open data file");
{
    local $/;
    $data = <$fh>;
}
close $fh;
$line=$data;
    while ($line =~ m/$reFirst/igsx ) {
#    if ($line =~ m/$reFirst/igs) {   
        $int1=$1; 
        $ipaddress1=$3;

        print "$int1  $ipaddress1 \n";
        }
        
        while ($line =~ m/$reSecond/isgx) {
            $int2=$3;
            $int3=$4;
            $int4=$5;
#            if ($9 > 0) {
                $float1=$9; #sprintf("%.2f",$9);
#            } else {
#                $float1="N/A ";
#            }
#            if ($11 > 0) {
                $float2=$12; #sprintf("%.2f",($11 * 100));  # Reliability
#            } else {
#                $float2="N/A ";
#            }
            $int5=$23;  # PQ_Value
#            if ($23 > 0) {
                $float5=$27; # HC_RTT
#            } else {
#                $float5="N/A ";
#            }
#            if ($25 > 0) {
                $float6=$30; #sprintf("%.3f",($25 / 1000)); # BW
#            } else {
#                $float6="N/A ";
#            }
#            if ($28 > 0) {
                $float7=$35;
#            } else {
#                $float7="N/A  ";
#            }
#            $int6=$36;

            print "PL:$int2   Tests:$int3   PL%:$int4   PC_RTT:$float1   Rel:$float2   PQ_Value:$int5   BW:$float6   HQ_AVG:$float7 \n";
        }
        
#        print "$int1  $ipaddress1  $int2  $int3  $int4  $float1  $float2  $float6  $float7 $int6\n";
#}

Open in new window

0
 
wilcoxonCommented:
Honestly, I would rewrite that script using named captures and greatly simplifying the regexes (if nothing else, you're capturing a lot of things you never use).  I'll see what i can do today or tomorrow...
0
 
FishMongerCommented:
That's a good example of fragile and very difficult to maintain regex coding.

You may want to rethink your approach.
0
 
AllThumbsGeekAuthor Commented:
No doubt, regex is my mortal enemy and seeks my ruination. As you probably guessed I went to txt2re and used it's output. If the file fully completes, it runs just fine. If any matches don't complete (usually because of the dashes) and then it skips.

FM - I am horrible at regex and usually use the RE Cookbook for a recipe or txt2re, but I have to process this report into either a tabular form or CSV for loading into a web report - and I am weeks late, so all of your help is enormously appreciated and I am happy to reciprocate anyway I can.
0
 
AllThumbsGeekAuthor Commented:
Wilcoxon - Thank You!
0
 
wilcoxonConnect With a Mentor Commented:
Sorry I didn't have time to do this until this morning...
Let me know if you have any questions on this code...
#!/usr/bin/perl
#===============================================================================
#
#         FILE: a1.pl
#
#        USAGE: ./a1.pl
#
#        STATUS: PARTIALLY WORKING - DO NOT BREAK IT!   May 8, 2013
#
#===============================================================================

use strict;
use warnings;
use File::Slurp;

my $file = $ARGV[0] or die "Usage: $0 input_file\n";

# could put read_file directly into @data but I find this easier to read
my $txt = read_file($file) or die "could not read data file $file: $!";
my @lines = split /\n/, $txt;
$txt = undef;

# get rid of "header" rows
while ($lines[0] !~ m{^\s?\d+:\s+}) {
    shift @lines;
}

# capture all data for later manipulation/output
my (@data, $idx);
while (@lines) {
    my $ln = shift @lines;
    if ($ln =~ m{^\s?(\d+):\s.*?\b((?:\d{1,3}\.){3}\d{1,3})\b}) {
        $idx = $1;
        if (ref $data[$idx]{ip}) {
            push @{$data[$idx]{ip}}, $2;
        } else {
            $data[$idx]{ip} = [$2];
        }
    } elsif ($ln =~ m{^\s?(\d+):\s+no\s+probe\s+responses\s*$}) {
        $idx = $1;
    } elsif ($ln =~ m{^\s+Partial\s+loss:\s+(\d+)\s*/\s*(\d+)\s*\((\d+)%\)}) {
        $data[$idx]{PL} = "PL:$1\tTests:$2\tPL%:$3";
    } elsif ($ln =~ m{^\s+Partial\s+char:\s+rtt\s+=\s+(\d+\.\d+)\s+ms,.*?,\s+r2\s+=\s+(\d+\.\d+)}) {
        $data[$idx]{PC} = "PC_RTT:$1\tRel:$2";
    } elsif ($ln =~ m{^\s+stddev\s+rtt\b}) {
        # do nothing
    } elsif ($ln =~ m{^\s+Partial\s+queueing:\s.*?\((\d+)\s+bytes\)}) {
        $data[$idx]{PQ} = "PQ_Value:$1";
    } elsif ($ln =~ m{^\s+Hop\s+char:\s.*?,\s+bw\s+=\s+(\d+\.\d+|-+\.-+)\s+Kbps}) {
        $data[$idx]{BW} = "BW:$1";
    } elsif ($ln =~ m{^\s+Hop\s+queueing:\s+avg\s+=\s+(-?\d+\.\d+)\s+ms}) {
        $data[$idx]{HQ} = "HQ_AVG:$1";
    } elsif ($ln =~ m{^\s+End\s+of\s+path\s+not\s+reached\s+after\b}
             or $ln =~ m{^\s+(?:Start|End)\s+time:\s+}) {
        # do nothing
    } else {
        die "don't know what to do with line in section $idx:\n$ln\n";
    }
}

for my $i (0..@data-1) {
    next unless defined($data[$i]);
    next unless defined($data[$i]{ip});
    print "$i  $_\n" for @{$data[$i]{ip}};
}

for my $i (0..@data-1) {
    next unless defined($data[$i]);
    print join("\t", @{$data[$i]}{qw(PL PC PQ BW HQ)}), "\n";
}

Open in new window

0
 
AllThumbsGeekAuthor Commented:
Wilcoxon - the parsing works. How do I align the rest of the data by block to the Hop and IP?

Like this (all in one line):

0  192.168.1.105  PL:40      Tests:100      PL%:40      PC_RTT:0.541331      Rel:0.962127      PQ_Value:59      BW:26104.265185      HQ_AVG:0.000018
1  192.168.1.1  PL:0      Tests:100      PL%:0      PC_RTT:8.039994      Rel:0.376515      PQ_Value:1340      BW:5932.673547      HQ_AVG:0.001728
2  10.88.64.1  PL:0      Tests:100      PL%:0      PC_RTT:9.211575      Rel:0.188959      PQ_Value:1340      BW:--.---      HQ_AVG:-0.000143

and so on.
0
 
AllThumbsGeekAuthor Commented:
I think this will work, but I am having trouble wrapping the printed Hop# and IP in XML tags.

for my $i (0..@data-1) {
    next unless defined($data[$i]);
    print "$i  $_\n" for @{$data[$i]{ip}};
    print join("\t", @{$data[$i]}{qw(PL PC PQ BW HQ)}), "\n";
}

Open in new window

This at least places the data directly following the IP address, which is great. I have wrapped the PL through HQ_AVG in XML tags for easy connection to my web control, but am trying to figure out hop to do that for the Hop# and IP. But I'm working on it...

ATG
0
 
AllThumbsGeekAuthor Commented:
Much simpler to maintain than what I was attempting and it loops through the multiline file correctly.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.