Link to home
Start Free TrialLog in
Avatar of summer_soccer
summer_soccer

asked on

perl script takes too long to finish

I have wrote a perl script to parse and process a lot of large gzipped text files line by line. There are about 700k .gz files, and the total size of these gzip files are around 120G. The decompressed files should be tens of magnitude larger.

I found that it takes about 8 hours to process even 800 gz files. So it will take about one year to finish all of them with such processing speed.

I am wondering why perl takes so long to process them. will it be possible for me to improve the running speed by re-write the code in c?
Avatar of Tintin
Tintin

You haven't showed us any Perl code, or even told us what type of processing you do, so it's impossible to say where the bottleneck is.

Are you decompressing the files first before processing?  If so, that will increase the processing time quite considerably.
As Tintin says, if we saw your code and had some idea of the manipulations you were performing on the files, maybe we could offer suggestions to speed the task up.  In answer to your other question, a C program would very likely be much faster, but we don't know yet what part of the operation is causing the process to be so slow -- the manipulations done by the script on the contents of the archives, or the actions done on the archives themselves (decompressing and perhaps recompressing, for example).
Avatar of summer_soccer

ASKER

Okay, I have copied my code below. It is very long, more than 1000 lines.

#!/usr/bin/perl -w


use strict;
use File::Find;
use File::Basename;
use DBI;
use Net::IP;
use Net::Patricia;
use Time::Local;


## start-time is the selected starting time, end-time is the selected ending time for traceroute data processing, prefixasfile is the prefix-as mapping file name, inconsistent-as-path-output is the file to store inconsistent aspath entries, inconsistent-pop-path-output is the file to store inconsistent poppath entries, discarding-stats-output is the file to store files and traceroutes being discarded

if($#ARGV != 7) {
    print "usage: process-traceroute.pl start-time end-time good-traceroute-file-list corrupted-traceroute-file-list prefix-as-mapping-file inconsistent-as-path-output inconsistent-pop-path-output policy-filtering-stats-output\n";
    print "start-time and end-time in format YYMMDDHHMMSS\n";
    exit(1);
}

my ($startingtime, $endingtime, $goodfilelist, $corruptedfilelist, $prefixasfile, $aspathoutput, $poppathoutput, $policyoutput) = @ARGV;


## open the prefix-as mapping file and store them in the Patricia handler
open(INPUT1, "<$prefixasfile") || die "cannot open $prefixasfile file for read.";

## open the good file list
open(INPUT2, "<$goodfilelist") || die "cannot open $goodfilelist file for read.";

## open the corrupted file list
open(INPUT3, "<$corruptedfilelist") || die "cannot open $corruptedfilelist file for read.";

## open the inconsistent as-path file for write
open(OUTPUT1, ">$aspathoutput") || die "cannot open $aspathoutput file for write.";

## open the inconsistent pop-path file for write
open(OUTPUT2, ">$poppathoutput") || die "cannot open $poppathoutput file for write.";

## open the policy-filtering-stats-output file for write
open(OUTPUT3, ">$policyoutput") || die "cannot open $policyoutput file for write.";


my $pt = new Net::Patricia;
my $prefixt = new Net::Patricia;

$startingtime =~ /(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/;
my ($yy1, $mm1, $dd1, $hh1, $min1, $ss1) = ($1, $2, $3, $4, $5, $6);

$endingtime =~ /(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/;
my ($yy2, $mm2, $dd2, $hh2, $min2, $ss2) = ($1, $2, $3, $4, $5, $6);


print OUTPUT3 "Traceroute files between 20$yy1-$mm1-$dd1 $hh1:$min1:$ss1 and 20$yy2-$mm2-$dd2 $hh2:$min2:$ss2 are checked.\n";


my $fileasn;


my $begintime = time();  ## get time in seconds since 1970

print OUTPUT3 "The code starts time is $begintime\n";


my @filecache = ();

while(my $oneline = <INPUT1>) {
    push(@filecache, $oneline);
}


foreach my $oneline (@filecache) {
    chomp($oneline);

    my ($oneprefix, $oneas) = split(/\s+/, $oneline);

    if($oneprefix =~ /\d{1,3}(.\d{1,3}){3}\/\d{1,2}/) {

        $pt->add_string($oneprefix, $oneas);
        $prefixt->add_string($oneprefix);
    }
}

close(INPUT1);

my %timefilehash = ();

my $totalfiles = 0;
my $corruptedfiles = 0;
my $workingfiles = 0;


@filecache = ();

while(my $file=<INPUT2>) {
    push(@filecache, $file);
}


foreach my $file (@filecache) {
    chomp($file);

    if( -f $file && $file =~ /_(\d{12})_re_\d+/ ) {
        my $cnttime = $1;

        my $diff1 = to_seconds($cnttime) - to_seconds($startingtime);
        my $diff2 = to_seconds($endingtime) - to_seconds($cnttime);

        if( $diff1 >= 0 && $diff2 >= 0 ) {
            $totalfiles++;
            $workingfiles++;

            my $timefiles = $timefilehash{$cnttime};

            if(not(defined($timefiles))) {
                $timefilehash{$cnttime} = [$file];
            }
            else {
                push(@{$timefilehash{$cnttime}}, $file);
            }
        }
    }
}

@filecache = ();

while(my $file=<INPUT3>) {
    push(@filecache, $file);
}

foreach my $file (@filecache) {
    chomp($file);

    if(-f $file && $file =~ /_(\d{12})_re_\d+/ ) {
        my $cnttime = $1;

        my $diff1 = to_seconds($cnttime) - to_seconds($startingtime);
        my $diff2 = to_seconds($endingtime) - to_seconds($cnttime);

        if( $diff1 >= 0 && $diff2 >= 0 ) {
            $totalfiles++;
            $corruptedfiles++;
        }
    }
}


my @files = ();


sub to_seconds
{
    use integer;

    my $x = $_[0];

    my $year = "20".substr($x,0,2);
    my $mo = substr($x,2,2);
    my $day = substr($x,4,2);
    my $hour = substr($x,6,2);
    my $minute = substr($x,8,2);
    my $second = substr($x,10,2);

    my $t = timelocal($second,$minute,$hour,$day,$mo - 1,$year - 1900);
    return($t);
}


my $numgroups= keys %timefilehash;

for my $onetime (sort keys %timefilehash) {
    my @cntset = @{$timefilehash{$onetime}};
    my $goodfiles = $#cntset+1;
    foreach my $onefile (@{$timefilehash{$onetime}}) {
        push(@files, $onefile);
        print "The file is $onefile\n";
    }
}


# connect to mySQL database for later data query and retrieval
my $dsn = "DBI:mysql:test_bm";   # data source name
my $user_name = "root";          # user name
my $password = "NewPw";          # password


my %ipasntable = ();
my %iplockeytable = ();
my %lockeyloctable = ();
my %iploctable = ();


my %bgphash = ();
my %igphash = ();


# connect to database
my $dbh = DBI->connect ($dsn, $user_name, $password,
    { RaiseError => 1, PrintError => 0 });


## select ipAddress, asn, lockey from the ipAddress table
my $sth = $dbh->prepare("SELECT ipAddress, asn, locKey FROM ipAddress");
$sth->execute();

## fetch query results from ipAddress table
while(my @ary = $sth->fetchrow_array()) {
    my ($cntip, $cntasn, $cntkey) = @ary;

    if($cntasn ne "NULL") {
        if($cntasn > 0) {
            $ipasntable{$cntip} = $cntasn;
        }
    }
    else {
## look up asn value from the prefix-as mapping patricia handler
        $cntasn = $pt->match_string($cntip);

        if(defined($cntasn)) {
            $ipasntable{$cntip} = $cntasn;
        }
    }

    if($cntkey ne "NULL") {
        if($cntkey > 1) {
            $iplockeytable{$cntip} = $cntkey;
        }
    }
}

# connect to database
$dbh = DBI->connect ($dsn, $user_name, $password,
    { RaiseError => 1, PrintError => 0 });


## select lockey, locName from the location table
$sth = $dbh->prepare("SELECT locKey, locName FROM location");
$sth->execute();

## fetch query results from location table
while(my @ary = $sth->fetchrow_array()) {
    my ($cntkey, $cntloc) = @ary;

    if($cntkey ne "NULL") {
        if($cntkey > 1) {
            $lockeyloctable{$cntkey} = $cntloc;
        }
    }
}

while ( my ($oneip, $onekey) = each(%iplockeytable) ) {
    my $oneloc = $lockeyloctable{$onekey};
    my $oneasn = $ipasntable{$oneip};

    # print "For ip $oneip, its ASN is $oneasn, its PoP is $oneloc\n";

    $iploctable{$oneip} = $oneloc;
}

## release iplockeytable and lockeyloctable memory
%iplockeytable = ();
%lockeyloctable = ();

# connect to database
$dbh = DBI->connect ($dsn, $user_name, $password,
    { RaiseError => 1, PrintError => 0 });


## drop inferred BGP table if it exists
my $bgpdrop = "
DROP TABLE IF EXISTS bgp";

$sth = $dbh->prepare($bgpdrop);
$sth->execute();


## create inferred BGP table
my $bgpcreate = "
CREATE TABLE bgp (
bkey      int(12) unsigned NOT NULL auto_increment,
ptime     datetime NOT NULL,
tstart    datetime NOT NULL,
vpip      varchar(16) NOT NULL,
dip   varchar(24) NOT NULL,
cntas     int(8) unsigned NOT NULL,
cntpop    varchar(32) NOT NULL,
nextas    int(8) unsigned,
nextpop   varchar(32),
aspath    varchar(64),
PRIMARY KEY (bkey)
)
ENGINE=InnoDB DEFAULT CHARSET=utf8";


$sth = $dbh->prepare($bgpcreate);
$sth->execute();


## drop intra-AS PoP-path table if it exists
my $poppathdrop = "
DROP TABLE IF EXISTS poppath";

$sth = $dbh->prepare($poppathdrop);
$sth->execute();


## create intra-AS PoP-path table
my $poppathcreate = "
CREATE TABLE poppath (
pkey      int(12) unsigned NOT NULL auto_increment,
ptime     datetime NOT NULL,
tstart    datetime NOT NULL,
vpip      varchar(16) NOT NULL,
dip       varchar(16) NOT NULL,
asn       int(8) unsigned NOT NULL,
srcpop    varchar(32) NOT NULL,
dstpop    varchar(32) NOT NULL,
poppath   varchar(256) NOT NULL,
ippathlen int(4)  NOT NULL,
PRIMARY KEY (pkey)
)
ENGINE=InnoDB DEFAULT CHARSET=utf8";

$sth = $dbh->prepare($poppathcreate);
$sth->execute();


## subroutine to check whether an ASN is a targeted ASN
sub istargetas {
    my $asn = $_;

    if($asn eq "1239" || $asn eq "16631" || $asn eq "1668" || $asn eq "209" ||
        $asn eq "2828" || $asn eq "2856" || $asn eq "2914" || $asn eq "3257" ||
        $asn eq "3320" || $asn eq "3356" || $asn eq "3549" || $asn eq "3561" ||
        $asn eq "5511" || $asn eq "6395" || $asn eq "6453" || $asn eq "6461" ||
        $asn eq "701" || $asn eq "7018") {
        return 1;
    }
    else {
        return 0;
    }

}


sub bgpcontains {
    my ($first, $second) = @_;

    foreach my $one (@{$first}) {

        if($one eq $second) {
            return 1;
        }
    }

    return 0;
}


sub igpcontains {
    my ($first, $second) = @_;

    foreach my $one (@{$first}) {
        if($one eq $second) {
            return 1;
        }
    }

    return 0;
}


my $totalnextingress = 0;
my $nextingressmorethanonestarskipped = 0;
my $nextingressonestarnounknownincluded = 0;    ## current last PoP is not NULL, next hop is *
my $nextingressonestarunknownincluded = 0;      ## current last PoP is NULL, next hop is *
my $nextingressnostarnounknownincluded = 0;     ## current last PoP is not NULL, next hop is non-*
my $nextingressnostarunknownincluded = 0;       ## current last PoP is NULL, next hop is non-*;

my $totalsameasnexthop = 0;
my $sameasnexthopegresstwounknownsdiscarded = 0;
my $sameasnexthopegressnotunknown = 0;
my $sameasnexthopegressunknown = 0;

my $totalaspath = 0;
my $aspathmorethanonestarskipped = 0;
my $aspathonestarincluded = 0;
my $aspathnostarincluded = 0;


my $totalpoppath = 0;
my $poppathmorethanoneunknownskipped = 0;
my $poppathoneunknownincluded = 0;
my $poppathnounknownincluded = 0;


## subroutine to process one traceroute, create bgp entries and poppath entries, and insert entries into bgp table and poppath table
sub processoneprobe {

    my ($probetime, $starttime, @hops) = @_;
    my $asstr = "";
    my $lastasn="-1";
    my @asgroups = ();
    my $cntgroup = "";
    my $cntindex = 0;

    my $srchop = $hops[0];
    my $dsthop = $hops[$#hops];

    my ($srcip, $dummy1, $dummy2, $srcasn, $srcpop) = split(/:/, $srchop);
    my ($dstip, $dummy3, $dummy4, $dstasn, $dstpop) = split(/:/, $dsthop);

    my @noduplicates = ();

    ## remove duplicate IPs in the hops
    my $lastip = "-1";

    for(my $i=0; $i<=$#hops; $i++) {

        if($hops[$i] eq "*") {
            push(@noduplicates, $hops[$i]);
        }
        else {
            my ($cntip, $cntdummy1, $cntdummy2, $cntasn, $cntpop) = split(/:/, $hops[$i]);

            if($cntip ne $lastip) {
                push(@noduplicates, $hops[$i]);
                $lastip = $cntip;
            }
        }
    }

    @hops = @noduplicates;

    ## get as-path and divide hops into AS groups by getting each AS group's hop indices
    foreach my $onehop (@hops) {

        ## skip stars
        if($onehop eq "*") {
            $cntindex++;
            next;
        }

        my ($cntIP, $cntrtt, $cntttl, $cntasn, $cntpop) = split(/:/, $onehop);

        ## this is the first AS in the traceroute path
        if($lastasn eq "-1") {
            if($cntasn ne "NULL" && $cntasn ne "0") {
                $asstr = $cntasn;
                $cntgroup = $cntindex;
                $lastasn = $cntasn;
            }
        }
        else {  ## Non-first AS in the traceroute path
            if($cntasn ne "NULL" && $cntasn ne "0" && $cntasn ne $lastasn) {

                push(@asgroups, $cntgroup);

                $asstr .= ">$cntasn";
                $cntgroup = $cntindex;
                $lastasn = $cntasn;

            }
            elsif($cntasn ne "NULL" && $cntasn ne "0" && $cntasn eq $lastasn) {
                $cntgroup .= ":$cntindex";
            }

            if($cntindex == $#hops) {
                push(@asgroups, $cntgroup);
            }
        }

        $cntindex++;
    }

    ## get the PoP-level paths for the as groups
    my @groups = ();

    foreach my $onegroup (@asgroups) {
        my (@indices) = split(/:/, $onegroup);
        my $firstindex = $indices[0];
        my $lastindex = $indices[$#indices];


        my $groupstr = "";
        my $lastpop = "-1";
        my $i;

        for($i=$firstindex; $i<=$lastindex; $i++) {
            my $cnthop = $hops[$i];

            if($cnthop eq "*") {

                if($groupstr eq "") {
                    $groupstr .= $cnthop;
                    $lastpop = "NULL";
                }
                else {
                    $groupstr .= "|$cnthop";
                    $lastpop = "NULL";
                }
            }
            else {
                my ($cntIP, $cntrtt, $cntttl, $cntasn, $cntpop) = split(/:/, $cnthop);

                if($cntpop eq "NULL") {
                    if($groupstr eq "") {
                        $groupstr .= "$cntIP:$cntasn:$cntpop:$i";  ## i is the index of the hop in hops
                        $lastpop = "NULL";
                    }
                    else {
                        $groupstr .= "|$cntIP:$cntasn:$cntpop:$i";
                        $lastpop = "NULL";
                    }

                }
                elsif($cntpop ne $lastpop) {
                    if($groupstr eq "") {
                        $groupstr .= "$cntIP:$cntasn:$cntpop:$i";
                        $lastpop = $cntpop;
                    }
                    else {
                        $groupstr .= "|$cntIP:$cntasn:$cntpop:$i";
                        $lastpop = $cntpop;
                    }
                }
            }
        } ## end for($i ... ...)

        push(@groups, $groupstr);
    }


    ## begin to process these groups one by one

    my $ii;    ## index for groups
    for($ii=0; $ii<=$#groups; $ii++) {
        my $onegroup = $groups[$ii];

        my (@cntpops) = split(/\|/, $onegroup);

        my ($cntIP, $cntasn, $cntpop, $cntindex) = split(/:/, $cntpops[0]);

        if($cntasn != $fileasn) {
            next;
        }

        my $cntlastasnpop = $cntpops[$#cntpops];

        my $afterstars = 0;
        my $afterhasstar = 0;

        if($cntlastasnpop eq "*") {   ## the current last hop is not *
            print "Current asnpop is *. \n";
            print "Something is wrong with the code. Please check and fix it.\n";
            exit(1);
        }

        my $tmpi;
        my ($cntlastIP, $cntlastasn, $cntlastpop, $cntlastindex) = split(/:/, $cntlastasnpop);

        for($tmpi=$cntlastindex+1; $tmpi<=$#hops; $tmpi++) {
            if($hops[$tmpi] eq "*") {
                $afterstars++;
                $afterhasstar = 1;

                if($afterstars == 2) {
                    last;
                }
            }
            else {
                $afterstars = 0;
            }
        }

        my $partialaspath;

        if($afterstars < 2) {

            ### get the partial as path from current as to the destination as
            my (@ashops) = split(/>/, $asstr);
            my $asindex;
            my $jj;

            for($jj=0; $jj<=$#ashops; $jj++) {
                my $cntasn = $ashops[$jj];

                if($cntasn == $cntlastasn) {
                    $asindex = $jj;
                    next;
                }
            }

            $partialaspath = $ashops[$asindex];

            for($jj=$asindex+1; $jj<=$#ashops; $jj++) {
                $partialaspath .= ">$ashops[$jj]";
            }
        }
        else {
            $partialaspath = "NULL";   ## do not use the AS path if there exists two consecutive "*" after the current AS
        }

        ## begin to test previous egress--next ingress PoP entry and populate it into the bgp table
        if($ii<$#groups) {      ## it is not the last AS
            my $nextgroup = $groups[$ii+1];
            my (@nextpops) = split(/\|/, $nextgroup);
            my $nextfirstasnpop = $nextpops[0];

            my $nextegresstype = 0;
            my ($cntsecondlastIP, $cntsecondlastasn, $cntsecondlastpop, $secondcntlastindex);

            if($cntlastasn ne "NULL" && $cntlastpop ne "NULL")  {  ## current last pop is valid
                $nextegresstype = 1;
            }
            elsif($#cntpops >= 1) {
                my $cntsecondlastasnpop = $cntpops[$#cntpops-1];

                if($cntsecondlastasnpop ne "*") {
                    ($cntsecondlastIP, $cntsecondlastasn, $cntsecondlastpop, $secondcntlastindex) = split(/:/, $cntsecondlastasnpop);

                    if($cntsecondlastasn ne "NULL" && $cntsecondlastpop ne "NULL") {
                        $nextegresstype = 2;
                    }
                }
            }

            if($nextegresstype > 0) {  

                if($nextfirstasnpop ne "*") {
                    my ($nextfirstIP, $nextfirstasn, $nextfirstpop, $nextfirstindex) = split(/:/, $nextfirstasnpop);


                    ## check whether there are more than 1 consecutive * hop between the last PoP and next-AS ingress IP
                    my $betweenstars = 0;

                    for($tmpi=$cntlastindex+1; $tmpi<$nextfirstindex; $tmpi++) {
                        if($hops[$tmpi] eq "*") {
                            $betweenstars++;
                        }
                    }

                    my $key;

                    if($nextegresstype == 1) {
                        $key = "$srcip<$dstip<$cntlastasn<$cntlastpop<$starttime";
                    }
                    else {
                        $key = "$srcip<$dstip<$cntsecondlastasn<$cntsecondlastpop<$starttime";
                    }

                    my $oneentry;

                    if($betweenstars >= 2) {
                        $oneentry = "$probetime<$nextfirstasn<NULL<$partialaspath";
                    }
                    else {
                        $oneentry = "$probetime<$nextfirstasn<$nextfirstIP<$partialaspath";
                    }

                    my $entries = $bgphash{$key};

                    if(not(defined($entries))) {
                        $totalnextingress++;

                        if($betweenstars >= 2) {
                            $nextingressmorethanonestarskipped++;
                        }
                        else {

                            if($betweenstars == 1 && $nextegresstype == 1) {
                                $nextingressonestarnounknownincluded++;
                            }
                            elsif($betweenstars == 1 && $nextegresstype == 2) {
                                $nextingressonestarunknownincluded++;
                            }
                            elsif($betweenstars == 0 && $nextegresstype == 1) {
                                $nextingressnostarnounknownincluded++;
                            }
                            elsif($betweenstars == 0 && $nextegresstype == 2) {
                                $nextingressnostarunknownincluded++;
                            }
                        }  ## end if($betweenstars >= 2) { ... } else { ... }

                        $totalaspath++;

                        if($afterstars >= 2) {
                            $aspathmorethanonestarskipped++;
                        }
                        elsif($afterhasstar == 1) {
                            $aspathonestarincluded++;
                        }
                        else {
                            $aspathnostarincluded++;
                        }

                        if($betweenstars < 2 || $afterstars < 2) {
                            @{$bgphash{$key}} = ($oneentry);
                        }
                    }
                    else {
                        if(bgpcontains(\@{$entries}, $oneentry) == 0) {
                            $totalnextingress++;

                            if($betweenstars >= 2) {
                                $nextingressmorethanonestarskipped++;
                            }
                            else {

                                if($betweenstars == 1 && $nextegresstype == 1) {
                                    $nextingressonestarnounknownincluded++;
                                }
                                elsif($betweenstars == 1 && $nextegresstype == 2) {
                                    $nextingressonestarunknownincluded++;
                                }
                                elsif($betweenstars == 0 && $nextegresstype == 1) {
                                    $nextingressnostarnounknownincluded++;
                                }
                                elsif($betweenstars == 0 && $nextegresstype == 2) {
                                    $nextingressnostarunknownincluded++;
                                }
                            }  ## end if($betweenstars >= 2) { ... } else { ... }

                            $totalaspath++;

                            if($afterstars >= 2) {
                                $aspathmorethanonestarskipped++;
                            }
                            elsif($afterhasstar == 1) {
                                $aspathonestarincluded++;
                            }
                            else {
                                $aspathnostarincluded++;
                            }

                            if($betweenstars < 2 || $afterstars < 2) {
                                push(@{$bgphash{$key}}, $oneentry);
                            }
                        }
                    }
                }  ## end if($nextfirstasnpop ne "*")
            } ## end if($nextegresstype > 0)
        }


        ## begin to generate bgp and IGP PoP-path entries for this group

        my (@asnpops) = split(/\|/, $onegroup);

        if($asnpops[0] eq "*") {
            print "The first pop is *.\n";
            print "The asnpops are @asnpops\n";
            print "Something is wrong with the code. Please check and fix it.\n";
            exit(1);
        }

        if($asnpops[$#asnpops] eq "*") {
            print "The last PoP is *\n";
            print "The asnpops are @asnpops\n";
            print "Something is wrong with the code. Please check and fix it.\n";
            exit(1);
        }

        my ($egressIP, $egressasn, $egresspop, $egressindex) = split(/:/, $asnpops[$#asnpops]);

        my $egresstype = 0;   ## this variable keeps the type of the egress

        if($egresspop ne "NULL") {
            $egresstype = 1;
        }
        else {
            if($#asnpops>=1) {
                if($asnpops[$#asnpops-1] ne "*") {
                    ($egressIP, $egressasn, $egresspop, $egressindex) = split(/:/, $asnpops[$#asnpops-1]);
                    if($egresspop ne "NULL") {
                        $egresstype = 2;
                    }
                }
            }
        }

        my $i;
        my $j;


        for($i=0; $i<=$#asnpops-1; $i++) {

            if($asnpops[$i] eq "*") {
                next;
            }

            my ($startIP, $startasn, $startpop, $startindex) = split(/:/, $asnpops[$i]);

            if($startasn eq "NULL" || $startpop eq "NULL" || $startpop eq $egresspop) {
                next;
            }

            $totalaspath++;

            if($afterstars >= 2) {
                $aspathmorethanonestarskipped++;
            }
            elsif($afterhasstar == 1) {
                $aspathonestarincluded++;
            }
            else {
                $aspathnostarincluded++;
            }


            my $key = "$srcip<$dstip<$startasn<$startpop<$starttime";

            my $entries = $bgphash{$key};
            my $oneentry = "$probetime<$egressasn<$egresspop<$partialaspath";


            if(not(defined($entries))) {
                $totalsameasnexthop++;

                if($egresstype != 0) {
                    if($egresstype == 1) {
                        $sameasnexthopegressnotunknown++;
                    }
                    elsif($egresstype == 2) {
                        $sameasnexthopegressunknown++;
                    }

                    @{$bgphash{$key}} = ($oneentry);
                }
                else {
                    $sameasnexthopegresstwounknownsdiscarded++;
                }
            }
            else {
                if(bgpcontains(\@{$entries}, $oneentry) == 0) {
                    $totalsameasnexthop++;

                    if($egresstype != 0) {
                        if($egresstype == 1) {
                            $sameasnexthopegressnotunknown++;
                        }
                        elsif($egresstype == 2) {
                            $sameasnexthopegressunknown++;
                        }
                        push(@{$bgphash{$key}}, $oneentry);
                    }
                    else {
                        $sameasnexthopegresstwounknownsdiscarded++;
                    }
                }
            }

            my $poppath = $startpop;

            ## begin to process the right-side paths of the current pop
            for($j=$i+1; $j<=$#asnpops; $j++) {
                if($asnpops[$j] eq "*") {
                    $poppath .= ">*";
                    next;
                }
                else {
                    my ($endip, $endasn, $endpop, $endindex) = split(/:/, $asnpops[$j]);

                    if($endasn ne "NULL" && $endasn ne $startasn) {
                        print "The end pop asn is $endasn, and the start asn is $startasn\n";
                        print "The group is $onegroup\n";
                        print "Something is wrong with the code. Please check and fix it.\n";
                        exit(1);
                    }

                    if($endpop eq "NULL") {
                        $poppath .= ">NULL";
                        next;
                    }
                    else {
                        $poppath .= ">$endpop";

                        ## begin to inspect and compress the PoP-path

                        my @pops = split(/>/, $poppath);
                        my @knownpops = ();

                        for(my $tmpkk=0; $tmpkk<=$#pops; $tmpkk++) {

                            ## keep known PoP indices into PoP list
                            if($pops[$tmpkk] ne "*" && $pops[$tmpkk] ne "NULL") {
                                push(@knownpops, $tmpkk);
                            }
                        }

                        if($knownpops[0] != 0 || $knownpops[$#knownpops] != $#pops) {
                            print "The first known PoP index is $knownpops[0], and the last known PoP index is $knownpops[$#knownpops].\n";
                            print "Something is wrong with the code. Please check and fix it.\n";
                            exit(1);
                        }

                        my $twounknownnotbetweensamePoP = 0;
                        my $oneunknownnotbetweensamePoP = 0;

                        my $newpoppath = "$pops[0]";
                        my $lastkeptPoP = $pops[0];
                        my $lastPoPindex = 0;

                        for(my $tmpkk=1; $tmpkk<=$#knownpops; $tmpkk++) {
                            my $cntPoP = $pops[$knownpops[$tmpkk]];

                            if($cntPoP ne $lastkeptPoP) {
                                ## there are more than one NULL PoP or * between current PoP and last known PoP
                                if($knownpops[$tmpkk]-$lastPoPindex > 2) {
                                    $twounknownnotbetweensamePoP = 1;
                                    last;
                                }
                                ## there is one NULL PoP or * between current PoP and last known PoP, add a wild card
                                elsif($knownpops[$tmpkk]-$lastPoPindex == 2) {
                                    if($oneunknownnotbetweensamePoP == 0) {  ## this is the first NULL PoP or * between two known PoPs
                                        $oneunknownnotbetweensamePoP = 1;
                                        $lastPoPindex = $knownpops[$tmpkk];
                                        $lastkeptPoP = $cntPoP;
                                        $newpoppath .= ">*>$cntPoP";
                                    }
                                    else {
                                        $twounknownnotbetweensamePoP = 1;
                                        last;
                                    }
                                }
                                else {     ## this is no NULL PoP or * between two known PoPs
                                    $lastPoPindex = $knownpops[$tmpkk];
                                    $lastkeptPoP = $cntPoP;
                                    $newpoppath .= ">$cntPoP";
                                }
                            }
                        }

                        ## begin to find the earliest IP index in the same PoP as startIP
                        my $earliestindex = $startindex;
                        for(my $lll=$startindex; $lll>=0; $lll--) {
                            my $cnthop = $hops[$lll];

                            if($hops[$lll] ne "*") {
                                my ($cntip, $cntdummy1, $cntdummy2, $cntasn, $cntpop) = split(/:/, $hops[$lll]);

                                if($cntasn eq $startasn && $cntpop eq $startpop) {
                                    $earliestindex = $lll;
                                }
                                elsif($cntpop ne "NULL" && $cntpop ne $startpop) {
                                    last;
                                }
                                elsif($cntasn ne "NULL" && $cntasn ne $startasn) {
                                    last;
                                }
                            }
                        }

                        ## begin to find the latest IP index in the same PoP as endIP
                        my $latestindex = $endindex;
                        for(my $lll=$endindex; $lll<=$#hops; $lll++) {
                            my $cnthop = $hops[$lll];

                            if($hops[$lll] ne "*") {
                                my ($cntip, $cntdummy1, $cntdummy2, $cntasn, $cntpop) = split(/:/, $hops[$lll]);

                                if($cntasn eq $endasn && $cntpop eq $endpop) {
                                    $latestindex = $lll;
                                }
                                elsif($cntpop ne "NULL" && $cntpop ne $endpop) {
                                    last;
                                }
                                elsif($cntasn ne "NULL" && $cntasn ne $endasn) {
                                    last;
                                }
                            }
                        }



                        my $ippathlen = $latestindex - $earliestindex + 1;

                        my $key = "$srcip<$dstip<$startasn<$startpop<$endpop<$starttime";
                        my $oneentry = "$probetime<$newpoppath<$ippathlen";

                        my $entries = $igphash{$key};

                        if(not(defined($entries))) {
                            $totalpoppath++;

                            if($twounknownnotbetweensamePoP == 1) {
                                $poppathmorethanoneunknownskipped++;
                            }
                            else {
                                if($oneunknownnotbetweensamePoP == 1) {
                                    $poppathoneunknownincluded++;
                                }
                                else {
                                    $poppathnounknownincluded++;
                                }
                                @{$igphash{$key}} = ($oneentry);
                            }
                        }
                        else {

                            if(igpcontains(\@{$entries}, $oneentry) == 0) {
                                $totalpoppath++;

                                if($twounknownnotbetweensamePoP == 1) {
                                    $poppathmorethanoneunknownskipped++;
                                }
                                else {
                                    if($oneunknownnotbetweensamePoP == 1) {
                                        $poppathoneunknownincluded++;
                                    }
                                    else {
                                        $poppathnounknownincluded++;
                                    }

                                    push(@{$igphash{$key}}, $oneentry);
                                }
                            }  ## end if(igpcontains(\@{$entries}, $oneentry) == 0) { ... }
                        }  ## end if(not(defined($entries))) { ... } else { ... }
                    }  ## end  if($endpop eq "NULL") { ... } else { ... }
                }  ## end  if($asnpops[$j] eq "*") { ... } else { ... }
            }  ## end  for($j=$i+1; $j<=$#asnpops; $j++)
        } ## end for($i=0; $i<=$#asnpops-1; $i++)
    } ## end for($ii=0; $ii<=$#groups; $ii++)
}



sub populatehashes {

# connect to database
    $dbh = DBI->connect ($dsn, $user_name, $password,
        { RaiseError => 1, PrintError => 0 });


    my $str = "LOCK TABLES bgp WRITE";
    $sth = $dbh->prepare($str);
    $sth->execute();

    $str = "INSERT INTO bgp (ptime, tstart, vpip, dip, cntas, cntpop, nextas, nextpop, aspath) VALUES ";
    my $first = 1;
    my $entrycount = 0;

    while ( my ($key, $val) = each(%bgphash) ) {
        my ($srcip, $dstip, $asn, $pop, $starttime) = split(/</, $key);

        my $count=0;
        my $strtowrite = "";

        $entrycount++;

        foreach my $one (@{$val}) {
            my ($probetime, $nextasn, $nextpop, $aspath) = split(/</, $one);

            $count++;

            if($first == 1) {
                $str .= "('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$pop', $nextasn, '$nextpop', '$aspath')";
                $first = 0;
            }
            else {
                $str .= ", ('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$pop', $nextasn, '$nextpop', '$aspath')";
            }

            $strtowrite .= "$probetime\t$starttime\t$srcip\t$dstip\t$asn\t$pop\t$nextasn\t$nextpop\t$aspath\n";
        }

        if($count>1) {
            print OUTPUT2 "$strtowrite";
        }

        if($entrycount >= 3000) {
            $sth = $dbh->prepare($str);
            $sth->execute();

            $str = "INSERT INTO bgp (ptime, tstart, vpip, dip, cntas, cntpop, nextas, nextpop, aspath) VALUES ";
            $first = 1;
            $entrycount = 0;
        }

    }

    if($entrycount > 0) {
        $sth = $dbh->prepare($str);
        $sth->execute();
    }


    $str = "LOCK TABLES poppath WRITE";
    $sth = $dbh->prepare($str);
    $sth->execute();


    $str = "INSERT INTO poppath (ptime, tstart, vpip, dip, asn, srcpop, dstpop, poppath, ippathlen) VALUES ";
    $first = 1;
    $entrycount = 0;


    while ( my ($key, $val) = each(%igphash) ) {
        my ($srcip, $dstip, $asn, $srcpop, $dstpop, $starttime) = split(/</, $key);
        my $count = 0;
        my $strtowrite="";

        $entrycount++;

        foreach my $one (@{$val}) {
            my ($probetime, $poppath, $ippathlen) = split(/</, $one);

            $count++;

            if($first == 1) {
                $str .= "('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$srcpop', '$dstpop', '$poppath', '$ippathlen')";
                $first = 0;
            }
            else {
                $str .= ", ('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$srcpop', '$dstpop', '$poppath', '$ippathlen')";
            }

            $strtowrite .= "$probetime\t$starttime\t$srcip\t$dstip\t$asn\t$srcpop\t$dstpop\t$poppath\t$ippathlen\n";
        }

        if($count>1) {
            print OUTPUT1 "$strtowrite";
        }

        if($entrycount >= 3000) {
            $sth = $dbh->prepare($str);
            $sth->execute();

            $str = "INSERT INTO poppath (ptime, tstart, vpip, dip, asn, srcpop, dstpop, poppath, ippathlen) VALUES ";
            $first = 1;
            $entrycount = 0;
        }
    }

    if($entrycount>0) {
        $sth = $dbh->prepare($str);
        $sth->execute();
    }


    $str = "UNLOCK TABLES";

    $sth = $dbh->prepare($str);
    $sth->execute();

    %bgphash = ();
    %igphash = ();

}



my $filecount = 0;

my $discardedIPloops = 0;
my $discardedPoPloops = 0;
my $discardedASloops = 0;
my $processedtraceroutes = 0;


sub discardpath {
    my (@hops) = @_;

    ## check for IP, PoP, AS loops
    my %iphash = ();
    my %pophash = ();
    my %ashash = ();
    my $lastip = "-1";
    my $lastasn = -1;
    my $lastpop = "-1";

    my $aspath = "";

    foreach my $onehop (@hops) {

        if($onehop ne "*") {
            my ($ip, $rtt, $ttl, $asn, $pop) = split(/:/, $onehop);

            if($ip ne $lastip) {
                if(not(defined($iphash{$ip}))) {
                    $iphash{$ip} = 1;
                    $lastip = $ip;
                }
                else {
                    $discardedIPloops++;

                    ## print "The traceroute @hops contains an IP loop.\n";
                    ## print "$ip appeared more than once.\n";
                    return 1;
                }
            }

            if($asn ne "NULL") {

                if($pop ne "NULL") {
                    my $cntpop = "$asn->$pop";

                    if($cntpop ne $lastpop) {

                        if(not(defined($pophash{$cntpop}))) {
                            $pophash{$cntpop} = 1;
                            $lastpop = $cntpop;
                        }
                        else {
                            $discardedPoPloops++;

                            ## print "The traceroute @hops contains a PoP loop.\n";
                            return 2;
                        }
                    }
                }

                $aspath .= "$asn|";

                if($asn ne $lastasn) {
                    if(not(defined($ashash{$asn}))) {
                        $ashash{$asn} = 1;
                        $lastasn = $asn;
                    }
                    else {

                        $discardedASloops++;

                        return 3;
                    }

                }
            }

        }
    }

    return 0;
}



my $totaltraceroutes = 0;

my $processbegintime = time();  ## get time in seconds since 1970
my $lastprocesstime = $processbegintime;
my $cntprocesstime;
my $totalprocesstime = 0;


# begin to process traceroute plain texts, and map intermediate IPs to their AS numbers and locations (POPs)
foreach my $file (@files) {

    my $lastinserttime="-1";     ## this stores the old data collection hour

    # this is traceroute plain text file
    if(-f $file) {


        open(INPUT1, "zcat $file | ") || die "can't open file $file for read";

        $filecount++;

        print "File $filecount is $file\n";

        my $tag;
        my $date;
        my $time;
        my $srcaddr;
        my $arrow;
        my $dstaddr;
        my $icmpstatus;
        my $hopcount;
        my @hops = ();
        my $lastindex=0;
        my $dummy;
        my $cntindex=0;
        my $cntIP;
        my $cntrtt;
        my $cntttl;
        my $firsthop;
        my $lasthop;


        ## begin to extract probing time from the file name
        #  extension is in the format of .*

        $file =~ /.*\/([^\/]+)/;
        my $filename = $1;
        my ($site, $datetime, $re, $targetasn) = split(/_/, $filename);

        $fileasn = $targetasn;

        $datetime =~ /(\d\d)(\d\d)(\d\d)(\d\d)(\d\d)(\d\d)/;
        my ($yy, $mm, $dd, $hh, $min, $ss) = ($1, $2, $3, $4, $5, $6);

        my $yyyy;
        my $probetime;
        my $starttime;

        $probetime = "20$yy-$mm-$dd $hh:$min:$ss";

        my $startmin = 15*int(($min/15));
        $starttime = "20$yy-$mm-$dd $hh:$startmin:00";

        @filecache = ();

        while(my $line = <INPUT1>) {
            push(@filecache, $line);
        }

        ## begin to process lines iteratively
        foreach my $line (@filecache) {
            chomp($line);

            ## this is the beginning of a new traceroute probe
            if($line =~ /->/) {
                ($tag, $date, $time, $srcaddr, $arrow, $dstaddr, $icmpstatus, $hopcount) = split(/\s+/, $line);

                ## find out the asn and pop for the first IP
                my $oneIP;

                if(not($oneIP = new Net::IP($srcaddr))) {
                    print "Net::IP::Error()\n";
                    print "The file is $filename, the line is $line\n";
                    last;
                }

                my $IPnum = $oneIP->intip();

                my $cntasn;
                my $cntpop;

                $cntasn = $ipasntable{$IPnum};

                if(not(defined($cntasn))) {
                    $cntasn = "NULL";
                }

                if(not(defined($cntasn))) {

                    ## look up asn value from the prefix-as mapping patricia handler
                    $cntasn = $pt->match_string($srcaddr);

                    if(not(defined($cntasn))) {
                        $cntasn = "NULL";
                    }
                }



                $cntpop = $iploctable{$IPnum};

                if(not(defined($cntpop))) {
                    $cntpop = "NULL";
                }

                $firsthop = "$srcaddr:-1:$-1:$cntasn:$cntpop";


                if($lastinserttime ne "-1") {

                    my $oldtime = to_seconds($lastinserttime);
                    my $newtime = to_seconds("$yy$mm$dd$hh$min$ss");

                    my $timediff = $newtime - $oldtime;

                    ## populate bgp and igp hash into database if it has been 3 hours since last insertion
                    if( $timediff > 10800  ) {
                        $cntprocesstime = time();
                        $totalprocesstime += $cntprocesstime - $lastprocesstime;
                        $lastprocesstime = $processbegintime;

                        populatehashes();
                        $lastinserttime = "$yy$mm$dd$hh$min$ss";
                    }

                }
                else {
                    $lastinserttime = "$yy$mm$dd$hh$min$ss";
                }


                ## find out the asn and pop for the last IP
                if(not($oneIP = new Net::IP($dstaddr))) {
                    print "Net::IP::Error()\n";
                    print "The file is $filename, the line is $line\n";

                    last;
                }

                $IPnum = $oneIP->intip();

                $cntasn = $ipasntable{$IPnum};

                if(not(defined($cntasn))) {
                    $cntasn = "NULL";
                }

                if(not(defined($cntasn))) {

                    ## look up asn value from the prefix-as mapping patricia handler
                    $cntasn = $pt->match_string($dstaddr);

                    if(not(defined($cntasn))) {
                        $cntasn = "NULL";
                    }
                }

                $cntpop = $iploctable{$IPnum};

                if(not(defined($cntpop))) {
                    $cntpop = "NULL";
                }

                $lasthop = "$dstaddr:-1:$-1:$cntasn:$cntpop";

            }
            elsif($line =~ /duration/) {
                next;
            }
            elsif($line !~ /^\s*$/) {   # process the line if it contains other than white spaces
                ($dummy, $cntindex, $cntIP, $cntrtt, $cntttl) = split(/\s+/, $line);

                if($lastindex != 0) {  # not the beginning of the first hop
                    my $numstars = $cntindex-$lastindex-1;
                    my $i;

                    ## fill in * for those missing hops
                    for($i=0; $i<$numstars; $i++) {
                        push(@hops, "*");
                    }
                }

                # print "Current ip is $cntIP\n";
                my $oneIP;

                if(not($oneIP = new Net::IP($cntIP))) {
                    print "Net::IP::Error()\n";
                    print "The file is $filename, the line is $line\n";

                    last;
                }

                my $IPnum = $oneIP->intip();

                my $cntasn;
                my $cntpop;


                $cntasn = $ipasntable{$IPnum};

                if(not(defined($cntasn))) {

                    ## look up asn value from the prefix-as mapping patricia handler
                    $cntasn = $pt->match_string($cntIP);

                    if(not(defined($cntasn))) {
                        $cntasn = "NULL";
                    }
                }

                $cntpop = $iploctable{$IPnum};

                if(not(defined($cntpop))) {
                    $cntpop = "NULL";
                }



                my $hopstr = "$cntIP:$cntrtt:$cntttl:$cntasn:$cntpop";
                push(@hops, $hopstr);

                $lastindex = $cntindex;
            }
            elsif($line =~ /^\s*$/) {   # this is the white space lines

                ## begin to process one traceroute probing result
                if($#hops >=0) {
                    my @cnthops = ($firsthop, @hops, $lasthop);

                    $totaltraceroutes++;

                    if(discardpath(@cnthops) == 0) {
                        $processedtraceroutes++;
                        processoneprobe($probetime, $starttime, @cnthops);
                    }
                }

                ## last line is the end of one traceroute probe
                if($lastindex != 0) {
                    @hops = ();        # reset hops array to empty
                    $lastindex = 0;    # reset lastindex to 0
                }
            }
        }

        close(INPUT1);
    }
}

$cntprocesstime = time();
$totalprocesstime += $cntprocesstime - $lastprocesstime;
$lastprocesstime = $processbegintime;

populatehashes();


$sth->finish ();
$dbh->disconnect ();

my $ratio;

print OUTPUT3 "$totalfiles traceroute files are checked.\n";
print OUTPUT3 "There are $numgroups groups(s) of probing in the checking interval.\n";

for my $onetime (sort keys %timefilehash) {
    my @cntset = @{$timefilehash{$onetime}};
    my $goodfiles = $#cntset+1;
    print OUTPUT3 "At time $onetime, there are $goodfiles good files.\n";
    foreach my $onefile (@{$timefilehash{$onetime}}) {
        print "The file is $onefile\n";
    }
}

print OUTPUT3 "\n\n";



$ratio = $corruptedfiles/$totalfiles;
print OUTPUT3 "$corruptedfiles, counted as $ratio, traceroute files are corrupted and are discarded.\n";
$ratio = $workingfiles/$totalfiles;
print OUTPUT3 "$workingfiles, counted as $ratio, traceroute files are good.\n\n\n";


print OUTPUT3 "$totaltraceroutes traceroute paths are parsed.\n";
$ratio = $discardedIPloops/$totaltraceroutes;
print OUTPUT3 "$discardedIPloops, counted as $ratio, traceroute paths are discarded due to IP loops.\n";
$ratio = $discardedPoPloops/$totaltraceroutes;
print OUTPUT3 "$discardedPoPloops, counted as $ratio, traceroute paths are discarded due to PoP loops.\n";
$ratio = $discardedASloops/$totaltraceroutes;
print OUTPUT3 "$discardedASloops traceroute paths, counted as $ratio, are discarded due to AS loops.\n";
$ratio = $processedtraceroutes/$totaltraceroutes;
print OUTPUT3 "$processedtraceroutes, counted as $ratio, traceroutes are processed.\n\n\n";



print OUTPUT3 "$totalnextingress BGP next-ingresses are checked.\n";
$ratio = $nextingressmorethanonestarskipped/$totalnextingress;
print OUTPUT3 "$nextingressmorethanonestarskipped, counted as $ratio, BGP next-ingresses are discarded due to more than one consecutive stars between the egress and next-ingress IP.\n";
$ratio = $nextingressonestarnounknownincluded/$totalnextingress;
print OUTPUT3 "$nextingressonestarnounknownincluded, counted as $ratio, BGP next-ingresses are included with last IP in AS mapped to PoP and one * between the egress and next-ingress IP.\n";
$ratio = $nextingressonestarunknownincluded/$totalnextingress;
print OUTPUT3 "$nextingressonestarunknownincluded, counted as $ratio, BGP next-ingresses are included with last IP in AS unmapped while second last IP mapped to PoP, and one * between the last IP and next-ingress IP.\n";
$ratio = $nextingressnostarnounknownincluded/$totalnextingress;
print OUTPUT3 "$nextingressnostarnounknownincluded, counted as $ratio, BGP next-ingresses are included with last IP in AS mapped to PoP and no star between the egress and next-ingress IP.\n";
$ratio = $nextingressnostarunknownincluded/$totalnextingress;
print OUTPUT3 "$nextingressnostarunknownincluded, counted as $ratio, BGP next-ingresses are included with last IP in AS unmapped while second last IP mapped to PoP, and no star between the egress and next-ingress IP.\n\n\n";


print OUTPUT3 "$totalsameasnexthop BGP same-AS egresses are checked.\n";
$ratio = $sameasnexthopegresstwounknownsdiscarded/$totalsameasnexthop;
print OUTPUT3 "$sameasnexthopegresstwounknownsdiscarded, counted as $ratio, BGP same-AS egresses are discarded due to last and second last IPs in AS unmapped.\n";
$ratio = $sameasnexthopegressnotunknown/$totalsameasnexthop;
print OUTPUT3 "$sameasnexthopegressnotunknown, counted as $ratio, BGP same-AS egresses are included with last IP in AS mapped to PoP.\n";
$ratio = $sameasnexthopegressunknown/$totalsameasnexthop;
print OUTPUT3 "$sameasnexthopegressunknown, counted as $ratio, BGP same-AS egresses are included with last IP in AS unmapped and second last IP mapped.\n\n\n";



print OUTPUT3 "$totalaspath BGP AS-paths are checked.\n";
$ratio = $aspathmorethanonestarskipped/$totalaspath;
print OUTPUT3 "$aspathmorethanonestarskipped, counted as $ratio, BGP AS-paths are discarded due to more than one consecutive stars on IP-path from current AS to destination host.\n";
$ratio = $aspathonestarincluded/$totalaspath;
print OUTPUT3 "$aspathonestarincluded, counted as $ratio, BGP AS-paths are included with single star(s) on IP-path from current AS to destination host.\n";
$ratio = $aspathnostarincluded/$totalaspath;
print OUTPUT3 "$aspathnostarincluded, counted as $ratio, BGP AS-paths are included with no star on IP-path from current AS to destination host.\n\n\n";


print OUTPUT3 "$totalpoppath IGP PoP-paths are checked.\n";
$ratio = $poppathmorethanoneunknownskipped/$totalpoppath;
print OUTPUT3 "$poppathmorethanoneunknownskipped, counted as $ratio, IGP PoP-paths are discarded due to more than one consecutive unknown PoPs (either NULL PoP or *).\n";
$ratio = $poppathoneunknownincluded/$totalpoppath;
print OUTPUT3 "$poppathoneunknownincluded, counted as $ratio, IGP PoP-paths are included with single unknown PoP(s) (either NULL PoP or *).\n";
$ratio = $poppathnounknownincluded/$totalpoppath;
print OUTPUT3 "$poppathnounknownincluded, counted as $ratio, IGP PoP-paths are included with no unknown PoP(s) (either NULL PoP or *).\n\n\n";


my $stoptime = time();  ## get time in seconds since 1970

my $elapsedtime = $stoptime - $begintime;

print OUTPUT3 "The code stops at time $stoptime\n\n";
print OUTPUT3 "The code runs $elapsedtime seconds\n";
print OUTPUT3 "The traceroute processing runs $totalprocesstime seconds\n";


close(OUTPUT1);
close(OUTPUT2);
close(OUTPUT3);


exit (0);



I don't see anything in the Perl code that handles gzip files.  Do you have a wrapper script calling it?
See the line:

open(INPUT1, "zcat $file | ") || die "can't open file $file for read";
ASKER CERTIFIED SOLUTION
Avatar of Tintin
Tintin

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I agree with Tintin.  Look at it this way -- if it's taking 8 hours to process 800 files, that's 36 seconds per file.  Of that 36 seconds, I'd bet 35 are spent on zcat operations, DBI operations, or network overhead.  Since these are esssentially fixed operations (they'll be the same regardless of the language the script is written in), there's little to gain by switching languages.

Your biggest time chunk is undoubtedly used decompressing the files.   I don't know the zcat program -- is there a faster alternative (like tar or gzip)?  You're already using MySQL, which is about the fastest SQL DB program out there....

Can you distribute the work amongst several computers?
Or, depending on what you're doing, is it necessary to completely decompress the files?  I assume each archive has several files in it -- is it necessary to decompress the entire file, or can you just extract the one you need to manipulate?
Actually, I did timing on the code. DBI and zcat operations only took very small amount of time to finish. Also each gz file only contains one compressed plain text file. There are no network operations in my code.
Your script does this for INPUT1, INPUT2, INPUT3
    my @filecache = ();
    while(my $oneline = <INPUT1>) {
        push(@filecache, $oneline);
    }
    foreach my $oneline (@filecache) {

It might be faster do do this (replace all above lines with this 1 line):
    while(my $oneline=<INPUT1>) {
=======================
In this loop:
    for my $onetime (sort keys %timefilehash) {
        my @cntset = @{$timefilehash{$onetime}};
        my $goodfiles = $#cntset+1;
        foreach my $onefile (@{$timefilehash{$onetime}}) {
            push(@files, $onefile);
            print "The file is $onefile\n";
        }
    }
You create @cntset and use it to create $goodfiles.  Neither are used after this.  Removing those two lines should give you a little speed.

I haven't looked through the rest of the file... when you looked at the timing, which portions were using the most time.
I have modified the code to make it more readable. The new code is as follows.


#!/usr/bin/perl -w
#  process-traceroute-insert-mysql.pl


use strict;
use File::Find;
use File::Basename;
use DBI;
use Net::IP;
use Net::Patricia;
use Time::Local;


## start-time is the selected starting time, end-time is the selected ending time for traceroute data processing, prefixasfile is the prefix-as mapping file name, inconsistent-as-path-output is the file to store inconsistent aspath entries, inconsistent-pop-path-output is the file to store inconsistent poppath entries, discarding-stats-output is the file to store files and traceroutes being discarded

if($#ARGV != 7) {
    print "usage: process-traceroute.pl start-time end-time good-traceroute-file-list corrupted-traceroute-file-list prefix-as-mapping-file inconsistent-as-path-output inconsistent-pop-path-output policy-filtering-stats-output\n";
    print "start-time and end-time in format YYMMDDHHMMSS\n";
    exit(1);
}

my ($startingtime, $endingtime, $goodfilelist, $corruptedfilelist, $prefixasfile, $aspathoutput, $poppathoutput, $policyoutput) = @ARGV;


## open the prefix-as mapping file and store them in the Patricia handler
open(INPUT1, "<$prefixasfile") || die "cannot open $prefixasfile file for read.";

## open the good file list
open(INPUT2, "<$goodfilelist") || die "cannot open $goodfilelist file for read.";

## open the corrupted file list
open(INPUT3, "<$corruptedfilelist") || die "cannot open $corruptedfilelist file for read.";

## open the inconsistent as-path file for write
open(OUTPUT1, ">$aspathoutput") || die "cannot open $aspathoutput file for write.";

## open the inconsistent pop-path file for write
open(OUTPUT2, ">$poppathoutput") || die "cannot open $poppathoutput file for write.";

## open the policy-filtering-stats-output file for write
open(OUTPUT3, ">$policyoutput") || die "cannot open $policyoutput file for write.";

## open the AS-loop file for write
open(OUTPUT4, ">as-loops.txt") || die "cannot open as-loops.txt for write.";

## open the AS-loop distribution file for write
open(OUTPUT5, ">as-loops-num-ips.txt") || die "cannot open as-loops-num-ips.txt for write.";


my $pt = new Net::Patricia;
my $prefixt = new Net::Patricia;

$startingtime =~ /(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/;
my ($yy1, $mm1, $dd1, $hh1, $min1, $ss1) = ($1, $2, $3, $4, $5, $6);

$endingtime =~ /(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/;
my ($yy2, $mm2, $dd2, $hh2, $min2, $ss2) = ($1, $2, $3, $4, $5, $6);


print OUTPUT3 "Traceroute files between 20$yy1-$mm1-$dd1 $hh1:$min1:$ss1 and 20$yy2-$mm2-$dd2 $hh2:$min2:$ss2 are checked.\n";


my $fileasn;


my $begintime = time();  ## get time in seconds since 1970

print OUTPUT3 "The code starts time is $begintime\n";



while(my $oneline = <INPUT1>) {
    chomp($oneline);

    my ($oneprefix, $oneas) = split(/\s+/, $oneline);

    if($oneprefix =~ /\d{1,3}(.\d{1,3}){3}\/\d{1,2}/) {

        $pt->add_string($oneprefix, $oneas);
        $prefixt->add_string($oneprefix);
    }
}

close(INPUT1);

my %timefilehash = ();

my $totalfiles = 0;
my $corruptedfiles = 0;
my $workingfiles = 0;


while(my $file=<INPUT2>) {
    chomp($file);

    if( -f $file && $file =~ /_(\d{12})_re_\d+/ ) {
        my $cnttime = $1;

        my $diff1 = to_seconds($cnttime) - to_seconds($startingtime);
        my $diff2 = to_seconds($endingtime) - to_seconds($cnttime);

        if( $diff1 >= 0 && $diff2 >= 0 ) {
            $totalfiles++;
            $workingfiles++;

            my $timefiles = $timefilehash{$cnttime};

            if(not(defined($timefiles))) {
                $timefilehash{$cnttime} = [$file];
            }
            else {
                push(@{$timefilehash{$cnttime}}, $file);
            }
        }
    }
}

while(my $file=<INPUT3>) {
    chomp($file);

    if(-f $file && $file =~ /_(\d{12})_re_\d+/ ) {
        my $cnttime = $1;

        my $diff1 = to_seconds($cnttime) - to_seconds($startingtime);
        my $diff2 = to_seconds($endingtime) - to_seconds($cnttime);

        if( $diff1 >= 0 && $diff2 >= 0 ) {
            $totalfiles++;
            $corruptedfiles++;
        }
    }
}


my @files = ();


sub to_seconds
{
    use integer;

    my $x = $_[0];

    my $year = "20".substr($x,0,2);
    my $mo = substr($x,2,2);
    my $day = substr($x,4,2);
    my $hour = substr($x,6,2);
    my $minute = substr($x,8,2);
    my $second = substr($x,10,2);

    my $t = timelocal($second,$minute,$hour,$day,$mo - 1,$year - 1900);
    return($t);
}


my $numgroups= keys %timefilehash;

for my $onetime (sort keys %timefilehash) {
    foreach my $onefile (@{$timefilehash{$onetime}}) {
        push(@files, $onefile);
        print "The file is $onefile\n";
    }
}


# connect to mySQL database for later data query and retrieval
my $dsn = "DBI:mysql:test_bm";   # data source name
my $user_name = "root";          # user name
my $password = "NewPw";          # password


my %ipasntable = ();       ## this hash table keeps the ASN of an IP from DNS name mapping
my %iplockeytable = ();    ## this hash table keeps the PoP key of an IP from DNS name mapping
my %lockeyloctable = ();   ## this hash table keeps the PoP of an PoP key from DNS naming mapping
my %iploctable = ();       ## this hash table keeps the PoP of an IP from DNS name mapping
my %ipasnpoptable = ();   ## this hashtable keeps the ASN and PoP value


my %bgphash = ();
my %igphash = ();


# connect to database
my $dbh = DBI->connect ($dsn, $user_name, $password,
    { RaiseError => 1, PrintError => 0 });


## select ipAddress, asn, lockey from the ipAddress table
my $sth = $dbh->prepare("SELECT ipAddress, asn, locKey FROM ipAddress");
$sth->execute();

## fetch query results from ipAddress table
while(my @ary = $sth->fetchrow_array()) {
    my ($cntip, $cntasn, $cntkey) = @ary;

    if($cntasn ne "NULL") {
        if($cntasn > 0) {
            $ipasntable{$cntip} = $cntasn;
        }
    }
    else {
## look up asn value from the prefix-as mapping patricia handler
        $cntasn = $pt->match_string($cntip);

        if(defined($cntasn)) {
            $ipasntable{$cntip} = $cntasn;
        }
    }

    if($cntkey ne "NULL") {
        if($cntkey > 1) {
            $iplockeytable{$cntip} = $cntkey;
        }
    }
}

# connect to database
$dbh = DBI->connect ($dsn, $user_name, $password,
    { RaiseError => 1, PrintError => 0 });


## select lockey, locName from the location table
$sth = $dbh->prepare("SELECT locKey, locName FROM location");
$sth->execute();

## fetch query results from location table
while(my @ary = $sth->fetchrow_array()) {
    my ($cntkey, $cntloc) = @ary;

    if($cntkey ne "NULL") {
        if($cntkey > 1) {
            $lockeyloctable{$cntkey} = $cntloc;
        }
    }
}

while ( my ($oneip, $onekey) = each(%iplockeytable) ) {
    my $oneloc = $lockeyloctable{$onekey};
    my $oneasn = $ipasntable{$oneip};

    # print "For ip $oneip, its ASN is $oneasn, its PoP is $oneloc\n";

    $iploctable{$oneip} = $oneloc;
}

## release iplockeytable and lockeyloctable memory
%iplockeytable = ();
%lockeyloctable = ();

# connect to database
$dbh = DBI->connect ($dsn, $user_name, $password,
    { RaiseError => 1, PrintError => 0 });


## drop inferred BGP table if it exists
my $bgpdrop = "
DROP TABLE IF EXISTS bgp";

$sth = $dbh->prepare($bgpdrop);
$sth->execute();


## create inferred BGP table
my $bgpcreate = "
CREATE TABLE bgp (
bkey      int(12) unsigned NOT NULL auto_increment,
ptime     datetime NOT NULL,
tstart    datetime NOT NULL,
vpip      varchar(16) NOT NULL,
dip   varchar(24) NOT NULL,
cntas     int(8) unsigned NOT NULL,
cntpop    varchar(32) NOT NULL,
nextas    int(8) unsigned,
nextpop   varchar(32),
aspath    varchar(64),
PRIMARY KEY (bkey)
)
ENGINE=InnoDB DEFAULT CHARSET=utf8";


$sth = $dbh->prepare($bgpcreate);
$sth->execute();


## drop intra-AS PoP-path table if it exists
my $poppathdrop = "
DROP TABLE IF EXISTS poppath";

$sth = $dbh->prepare($poppathdrop);
$sth->execute();


## create intra-AS PoP-path table
my $poppathcreate = "
CREATE TABLE poppath (
pkey      int(12) unsigned NOT NULL auto_increment,
ptime     datetime NOT NULL,
tstart    datetime NOT NULL,
vpip      varchar(16) NOT NULL,
dip       varchar(16) NOT NULL,
asn       int(8) unsigned NOT NULL,
srcpop    varchar(32) NOT NULL,
dstpop    varchar(32) NOT NULL,
poppath   varchar(256) NOT NULL,
ippathlen int(4)  NOT NULL,
PRIMARY KEY (pkey)
)
ENGINE=InnoDB DEFAULT CHARSET=utf8";

$sth = $dbh->prepare($poppathcreate);
$sth->execute();


## subroutine to check whether an ASN is a targeted ASN
sub istargetas {
    my $asn = $_;

    if($asn eq "1239" || $asn eq "16631" || $asn eq "1668" || $asn eq "209" ||
        $asn eq "2828" || $asn eq "2856" || $asn eq "2914" || $asn eq "3257" ||
        $asn eq "3320" || $asn eq "3356" || $asn eq "3549" || $asn eq "3561" ||
        $asn eq "5511" || $asn eq "6395" || $asn eq "6453" || $asn eq "6461" ||
        $asn eq "701" || $asn eq "7018") {
        return 1;
    }
    else {
        return 0;
    }

}


sub bgpcontains {
    my ($first, $second) = @_;

    foreach my $one (@{$first}) {

        if($one eq $second) {
            return 1;
        }
    }

    return 0;
}


sub igpcontains {
    my ($first, $second) = @_;

    foreach my $one (@{$first}) {
        if($one eq $second) {
            return 1;
        }
    }

    return 0;
}


my $totalnextingress = 0;
my $nextingressincluded = 0;
my $nextingressskipped = 0;

my $totalnextegress = 0;
my $nextegressskipped = 0;
my $nextegressincluded = 0;

my $totalaspath = 0;
my $aspathskipped = 0;
my $aspathincluded = 0;


my $totalpoppath = 0;
my $poppathskipped = 0;
my $poppathincluded = 0;


## this function prints out the IP-path of a traceroute path
sub getippath {
    my (@hops) = @_;

    my ($ippath) = split(/:/, $hops[0]);

    for(my $i=1; $i<=$#hops; $i++) {
        my ($cntip) = split(/:/, $hops[$i]);
        $ippath .= "->$cntip";
    }

    return $ippath;
}

sub comparelist {
    my ($list1, $list2) = @_;
    my $listsize = $#{@{$list1}};
    my $i;

    for($i=0; $i<=$listsize; $i++) {
        if(@{$list1}[$i] ne @{$list2}[$i]) {
            return 1;
        }
    }

    return 0;
}


## this subroutine removes duplicate IPs
sub removeduplicates {
    my @hops = @_;
    my @noduplicates = ();

    ## remove duplicate IPs in the hops
    my $lastip = "-1";

    for(my $i=0; $i<=$#hops; $i++) {

        if($hops[$i] eq "*") {
            push(@noduplicates, $hops[$i]);
        }
        else {
            my ($cntip, $cntdummy1, $cntdummy2, $cntasn, $cntpop) = split(/:/, $hops[$i]);

            if($cntip ne $lastip) {
                push(@noduplicates, $hops[$i]);
                $lastip = $cntip;
            }
        }
    }

    return @noduplicates;
}



## this subroutine divdes IP-hops into AS groups
sub dividehopsintoasgroups {

    my (@hops) = @_;
    my $asstr = "";

    my @asgroups = ();
    my $cntindex = 0;
    my $cntgroup = "";
    my $lastasn="-1";
    my $hopsize = $#hops;


    ## get as-path and divide hops into AS groups by getting each AS group's hop indices
    foreach my $onehop (@hops) {

        ## skip stars
        if($onehop eq "*") {
            $cntindex++;
            next;
        }

        my ($cntIP, $cntrtt, $cntttl, $cntasn, $cntpop) = split(/:/, $onehop);

        ## this is the first AS in the traceroute path
        if($lastasn eq "-1") {
            if($cntasn ne "NULL" && $cntasn ne "0") {
                $asstr = $cntasn;
                $cntgroup = $cntindex;
                $lastasn = $cntasn;
            }
        }
        else {  ## not first AS in the traceroute path
            if($cntasn ne "NULL" && $cntasn ne "0" && $cntasn ne $lastasn) {

                push(@asgroups, $cntgroup);

                $asstr .= ">$cntasn";
                $cntgroup = $cntindex;
                $lastasn = $cntasn;
            }
            elsif($cntasn ne "NULL" && $cntasn ne "0" && $cntasn eq $lastasn) {
                my $i;

                $cntgroup .= ":$cntindex";
            }

            if($cntindex == $hopsize) {
                push(@asgroups, $cntgroup);
            }
        }

        $cntindex++;
    }

    return($asstr, @asgroups);
}


## this subroutine gets PoP-level path for the as groups
sub getpoppaths {
    ## get the PoP-level paths for the as groups
    my ($hops, $asgroups) = @_;
    my @groups = ();

    foreach my $onegroup (@{$asgroups}) {
        my (@indices) = split(/:/, $onegroup);
        my $firstindex = $indices[0];
        my $lastindex = $indices[$#indices];


        my $groupstr = "";
        my $lastpop = "-1";
        my $i;

        for($i=$firstindex; $i<=$lastindex; $i++) {
            my $cnthop = @{$hops}[$i];

            if($cnthop eq "*") {

                if($groupstr eq "") {
                    $groupstr .= $cnthop;
                    $lastpop = "NULL";
                }
                else {
                    $groupstr .= "|$cnthop";
                    $lastpop = "NULL";
                }
            }
            else {
                my ($cntIP, $cntrtt, $cntttl, $cntasn, $cntpop) = split(/:/, $cnthop);

                # if($cntpop eq "NULL") {
                if($groupstr eq "") {
                    $groupstr .= "$cntIP:$cntasn:$cntpop:$i";  ## i is the index of the hop in hops
                    $lastpop = "NULL";
                }
                else {
                    $groupstr .= "|$cntIP:$cntasn:$cntpop:$i";
                    $lastpop = "NULL";
                }

                # }
                #elsif($cntpop ne $lastpop) {
                #if($groupstr eq "") {
                #    $groupstr .= "$cntIP:$cntasn:$cntpop:$i";
                #    $lastpop = $cntpop;
                #}
                #else {
                #    $groupstr .= "|$cntIP:$cntasn:$cntpop:$i";
                #    $lastpop = $cntpop;
                #}
                #}
            }
        } ## end for($i ... ...)

        push(@groups, $groupstr);
    }

    return @groups;

}


## subroutine to get AS-path from current AS to destination host
sub getpartialaspath {

    my ($afterstars, $asstr, $cntlastasn) = @_;

    #print "The as string is $asstr\n";

    my $partialaspath;

    if($afterstars < 2) {

        ### get the partial as path from current as to the destination as
        my (@ashops) = split(/>/, $asstr);
        my $asindex;
        my $jj;

        for($jj=0; $jj<=$#ashops; $jj++) {
            my $cntasn = $ashops[$jj];

            if($cntasn == $cntlastasn) {
                $asindex = $jj;
                next;
            }
        }

        $partialaspath = $ashops[$asindex];

        for($jj=$asindex+1; $jj<=$#ashops; $jj++) {
            $partialaspath .= ">$ashops[$jj]";
        }
    }
    else {
        $partialaspath = "NULL";   ## do not use the AS path if there exists two consecutive "*" after the current AS
    }

    return $partialaspath;
}

## this subroutine calculates the unknowns between current last group's last valid PoP and next group's first IP
sub checkunknownsbetweentwoases {

    my ($groups, $cntgroupindex, $hops) = @_;

    my $groupsize = $#{@{$groups}};
    my @cntgrouphops = split(/\|/, @{$groups}[$cntgroupindex]);
    my $cntgroupsize = $#cntgrouphops;

    #print "Current group hops are @cntgrouphops\n";

    my ($cntgrouplastvalidIP, $cntgrouplastvalidasn, $cntgrouplastvalidpop, $cntgrouplastvalidindex);
    my ($nextgroupfirstIP, $nextgroupfirstasn, $nextgroupfirstpop, $nextgroupfirstindex);

    my $cntgrouphasvalidpop = 0;
    my $islastgroup = 0;

    ## get the last valid hop elements at current group
    for(my $i=$cntgroupsize; $i>=0; $i--) {
        if($cntgrouphops[$i] ne "*") {
            my ($cntIP, $cntasn, $cntpop, $cntindex) = split(/:/, $cntgrouphops[$i]);

            if($cntasn ne "NULL" && $cntasn ne "0" && $cntpop ne "NULL") {
                $cntgrouphasvalidpop = 1;
                ($cntgrouplastvalidIP, $cntgrouplastvalidasn, $cntgrouplastvalidpop, $cntgrouplastvalidindex) = ($cntIP, $cntasn, $cntpop, $cntindex);
            }
        }
    }


    ## begin to check previous egress--next ingress PoP entry and populate it into the bgp table
    if($cntgroupindex<$groupsize) {      ## it is not the last AS
        my $nextgroup = @{$groups}[$cntgroupindex+1];
        my (@nextgrouphops) = split(/\|/, $nextgroup);
        my $nextgroupfirsthop = $nextgrouphops[0];

        if($nextgroupfirsthop eq "*") {
            print "The first pop is *.\n";
            print "The next group hops are @nextgrouphops\n";
            print "Something is wrong with the code. Please check and fix it.\n";
            exit(1);
        }

        ($nextgroupfirstIP, $nextgroupfirstasn, $nextgroupfirstpop, $nextgroupfirstindex) = split(/:/, $nextgroupfirsthop);
    }
    else {
        $islastgroup = 1;
    }

    my $betweenunknowns;
    ## this variable indicates how many unknowns between the current group's last valid pop and next group's first IP
    if($islastgroup == 0 && defined($cntgrouplastvalidindex) && defined($nextgroupfirstindex)) {
        $betweenunknowns = $nextgroupfirstindex - $cntgrouplastvalidindex - 1;
    }

    return($islastgroup, $cntgrouphasvalidpop, $betweenunknowns, $cntgrouplastvalidasn, $cntgrouplastvalidpop, $cntgrouplastvalidindex, $nextgroupfirstasn, $nextgroupfirstIP);
}


## subroutine to generate next-ingress field for a bgp table entry
sub generatenextingressbgpentry {

    my ($key, $oneentry, $afterstars) = @_;

    my $entries = $bgphash{$key};

    if(not(defined($entries))) {
        $totalaspath++;

        if($afterstars >= 2) {
            $aspathskipped++;
        }
        else {
            $aspathincluded++;
        }

        $totalnextingress++;     ## increase the next ingress entry count
        $nextingressincluded++;
        @{$bgphash{$key}} = ($oneentry);
    }
    else {
        if(bgpcontains(\@{$entries}, $oneentry) == 0) {

            $totalaspath++;

            if($afterstars >= 2) {
                $aspathskipped++;
            }
            else {
                $aspathincluded++;
            }
            $totalnextingress++;
            $nextingressincluded++;
            push(@{$bgphash{$key}}, $oneentry);
        }
    }
}

## this subroutine create one bgp next-egress entry and store it in the bgp hash table
sub generatenextegressbgpentry {

    my ($key, $oneentry, $afterstars) = @_;

    $totalaspath++;
    if($afterstars >= 2) {
        $aspathskipped++;
    }
    else {
        $aspathincluded++;
    }


    my $entries = $bgphash{$key};

    if(not(defined($entries))) {
        $totalnextegress++;
        $nextegressincluded++;

        @{$bgphash{$key}} = ($oneentry);
    }
    else {
        if(bgpcontains(\@{$entries}, $oneentry) == 0) {
            $totalnextegress++;
            $nextegressincluded++;

            push(@{$bgphash{$key}}, $oneentry);
        }
    }

}


## subroutine to generate igp poppath table entries
sub generatepoppathentries {

    my ($hops, $cntgrouphops, $cnthopindex, $startasn, $startpop, $startindex, $srcip, $dstip, $starttime, $probetime) = @_;
    my $cntgroupsize = $#{@{$cntgrouphops}};
    my $hopsize = $#{@{$hops}};

    my ($i, $j);
    my $poppath = $startpop;


    ## begin to process the right-side paths of the current pop
    for($i=$cnthopindex+1; $i<=$cntgroupsize; $i++) {
        if(@{$cntgrouphops}[$i] eq "*") {
            $poppath .= ">*";
            next;
        }
        else {
            my ($endip, $endasn, $endpop, $endindex) = split(/:/, @{$cntgrouphops}[$i]);

            if($endasn ne "NULL" && $endasn ne $startasn) {
                print "The start PoP ASN is $startasn, and the end PoP ASN is $endasn\n";
                print "Something is wrong with the code. Please check and fix it.\n";
                exit(1);
            }

            if($endpop eq "NULL") {
                $poppath .= ">NULL";
                next;
            }
            else {
                $poppath .= ">$endpop";

                ## begin to inspect and compress the PoP-path
                my @pops = split(/>/, $poppath);
                my @knownpops = ();

                for($j=0; $j<=$#pops; $j++) {

                    ## keep known PoP indices into PoP list
                    if($pops[$j] ne "*" && $pops[$j] ne "NULL") {
                        push(@knownpops, $j);
                    }
                }

                if($knownpops[0] != 0 || $knownpops[$#knownpops] != $#pops) {
                    print "The first known PoP index is $knownpops[0], and the last known PoP index is $knownpops[$#knownpops].\n";
                    print "Something is wrong with the code. Please check and fix it.\n";
                    exit(1);
                }

                my $twounknownbetweendifferentpops = 0;
                my $oneunknownbetweendifferentpops = 0;

                my $newpoppath = "$pops[0]";
                my $lastkeptPoP = $pops[0];
                my $lastPoPindex = 0;

                for(my $j=1; $j<=$#knownpops; $j++) {
                    my $cntPoP = $pops[$knownpops[$j]];

                    if($cntPoP ne $lastkeptPoP) {
                        ## there are more than one NULL PoP or * between current PoP and last known PoP
                        if($knownpops[$j]-$lastPoPindex > 2) {
                            $twounknownbetweendifferentpops = 1;
                            last;
                        }
                        ## there is one NULL PoP or * between current PoP and last known PoP, add a wild card
                        elsif($knownpops[$j]-$lastPoPindex == 2) {
                            if($oneunknownbetweendifferentpops == 0) {  ## this is the first NULL PoP or * between two known PoPs
                                $oneunknownbetweendifferentpops = 1;
                                $lastPoPindex = $knownpops[$j];
                                $lastkeptPoP = $cntPoP;
                                $newpoppath .= ">*>$cntPoP";
                            }
                            else {
                                $twounknownbetweendifferentpops = 1;
                                last;
                            }
                        }
                        else {     ## this is no NULL PoP or * between two known PoPs
                            $lastPoPindex = $knownpops[$j];
                            $lastkeptPoP = $cntPoP;
                            $newpoppath .= ">$cntPoP";
                        }
                    }
                }

                ## begin to find the earliest IP index in the same PoP as startIP
                my $earliestindex = $startindex;
                for(my $j=$startindex; $j>=0; $j--) {
                    my $cnthop = @{$hops}[$j];

                    if(@{$hops}[$j] ne "*") {
                        my ($cntip, $cntdummy1, $cntdummy2, $cntasn, $cntpop) = split(/:/, @{$hops}[$j]);

                        if($cntasn eq $startasn && $cntpop eq $startpop) {
                            $earliestindex = $j;
                        }
                        elsif($cntpop ne "NULL" && $cntpop ne $startpop) {
                            last;
                        }
                        elsif($cntasn ne "NULL" && $cntasn ne $startasn) {
                            last;
                        }
                    }
                }

                ## begin to find the latest IP index in the same PoP as endIP
                my $latestindex = $endindex;
                for(my $j=$endindex; $j<=$hopsize; $j++) {
                    my $cnthop = @{$hops}[$j];

                    if(@{$hops}[$j] ne "*") {
                        my ($cntip, $cntdummy1, $cntdummy2, $cntasn, $cntpop) = split(/:/, @{$hops}[$j]);

                        if($cntasn eq $endasn && $cntpop eq $endpop) {
                            $latestindex = $j;
                        }
                        elsif($cntpop ne "NULL" && $cntpop ne $endpop) {
                            last;
                        }
                        elsif($cntasn ne "NULL" && $cntasn ne $endasn) {
                            last;
                        }
                    }
                }

                my $ippathlen = $latestindex - $earliestindex + 1;

                my $key = "$srcip<$dstip<$startasn<$startpop<$endpop<$starttime";
                my $oneentry = "$probetime<$newpoppath<$ippathlen";

                my $entries = $igphash{$key};

                if(not(defined($entries))) {
                    $totalpoppath++;

                    if($twounknownbetweendifferentpops == 1) {
                        $poppathskipped++;
                    }
                    else {
                        $poppathincluded++;
                        @{$igphash{$key}} = ($oneentry);
                    }
                }
                else {

                    if(igpcontains(\@{$entries}, $oneentry) == 0) {
                        $totalpoppath++;

                        if($twounknownbetweendifferentpops == 1) {
                            $poppathskipped++;
                        }
                        else {
                            $poppathincluded++;
                            push(@{$igphash{$key}}, $oneentry);
                        }
                    }  ## end if(igpcontains(\@{$entries}, $oneentry) == 0) { ... }
                }  ## end if(not(defined($entries))) { ... } else { ... }
            }  ## end  if($endpop eq "NULL") { ... } else { ... }
        }  ## end if(@{$cntgrouphops}[$i] eq "*") {...} else {...}
    }  ## end  for($i=$cnthopindex+1; $i<=$cntgroupsize; $i++)
}


## get the consecutive * from current group index to the end of the group
sub getconsecutivestarsonpath {
    my ($cntgrouplastindex, $hops) = @_;
    my ($i, $hopsize);
    my $afterstars = 0;

    $hopsize = $#{@{$hops}};

    for($i=$cntgrouplastindex+1; $i<=$hopsize; $i++) {
        if(@{$hops}[$i] eq "*") {
            $afterstars++;

            if($afterstars == 2) {
                last;
            }
        }
        else {
            $afterstars = 0;
        }
    }

    return $afterstars;
}



## this subroutine generate table entries and
sub generatetableentriesforasgroups {

    my ($hops, $groups, $asstr, $srcip, $dstip, $starttime, $probetime) = @_;
    my ($hopsize, $groupsize) = ($#{@{$hops}}, $#{@{$groups}});
    my ($i, $j);    

    for($i=0; $i<=$groupsize; $i++) {  ## iterate through AS groups
        my $onegroup = @{$groups}[$i];

        my (@cntgrouphops) = split(/\|/, $onegroup);
        my ($cntIP, $cntasn, $cntpop, $cntindex) = split(/:/, $cntgrouphops[0]);

        if($cntasn != $fileasn) {   ## only process the as group whose asn is specified in the file name
            next;
        }

        if($cntgrouphops[0] eq "*") {
            print "Current group first hop is *.\n";
            print "Something is wrong with the code. Please check and fix it.\n";
            exit(1);
        }


        my $cntgrouplasthop = $cntgrouphops[$#cntgrouphops];

        if($cntgrouplasthop eq "*") {   ## the current last hop is not *
            print "Current group last hop is *. \n";
            print "Something is wrong with the code. Please check and fix it.\n";
            exit(1);
        }

        my ($cntgrouplastIP, $cntgrouplastasn, $cntgrouplastpop, $cntgrouplastindex) = split(/:/, $cntgrouplasthop);

        my $afterstars = getconsecutivestarsonpath($cntgrouplastindex, \@{$hops});


        my $partialaspath = getpartialaspath($afterstars, $asstr, $cntgrouplastasn);

        ## print "The current asn is $cntlastasn, and the as path is $partialaspath, and the as string is $asstr, and file asn is $fileasn\n";

        my ($islastgroup, $cntgrouphasvalidpop, $betweenunknowns, $cntgrouplastvalidasn, $cntgrouplastvalidpop, $cntgrouplastvalidindex, $nextgroupfirstasn, $nextgroupfirstIP) = checkunknownsbetweentwoases(\@{$groups}, $i, \@{$hops});

        if($islastgroup == 0) {
            if($cntgrouphasvalidpop == 1 && $betweenunknowns <= 1) {  
                my $key = "$srcip<$dstip<$cntgrouplastvalidasn<$cntgrouplastvalidpop<$starttime";
                my $oneentry = "$probetime<$nextgroupfirstasn<$nextgroupfirstIP<$partialaspath";

                generatenextingressbgpentry($key, $oneentry, $afterstars);
            }
            else {
                $totalnextingress++;
                $nextingressskipped++;
            }## end if($cntgrouphasvalidpop == 1 && $betweenunknowns <= 1) {...} else {...}
        }


        ## begin to generate bgp and IGP PoP-path entries for this group
        for($j=0; $j<=$#cntgrouphops-1; $j++) {
            if($cntgrouphops[$j] eq "*") {
                next;
            }

            my ($startIP, $startasn, $startpop, $startindex) = split(/:/, $cntgrouphops[$j]);
            if($startasn eq "0" || $startasn eq "NULL" || $startpop eq "NULL") {
                next;
            }

            if(($islastgroup == 1 || $betweenunknowns <= 1) && $cntgrouphasvalidpop == 1) {  
                if($startindex >= $cntgrouplastvalidindex || $startpop eq $cntgrouplastvalidpop) {
                    next;
                }

                my $key = "$srcip<$dstip<$startasn<$startpop<$starttime";
                my $oneentry = "$probetime<$cntgrouplastvalidasn<$cntgrouplastvalidpop<$partialaspath";
                generatenextegressbgpentry($key, $oneentry, $afterstars);
            }
            else {
                $totalnextegress++;
                $nextegressskipped++;
            }

            ## generate IGP pop-path table entries for this group
            generatepoppathentries($hops, \@cntgrouphops, $j, $startasn, $startpop, $startindex, $srcip, $dstip, $starttime, $probetime);

        } ## end for($j=0; $j<=$#cntgrouphops-1; $j++)
    } ## end for($i=0; $i<=$#groups; $i++)
}


## subroutine to process one traceroute, create bgp entries and poppath entries, and insert entries into bgp table and poppath table
sub processonetraceroute {

    my ($probetime, $starttime, @hops) = @_;

    my $srchop = $hops[0];
    my $dsthop = $hops[$#hops];

    my ($srcip, $dummy1, $dummy2, $srcasn, $srcpop) = split(/:/, $srchop);
    my ($dstip, $dummy3, $dummy4, $dstasn, $dstpop) = split(/:/, $dsthop);


    @hops = removeduplicates(@hops);   ## remove consecutive duplicate IPs in the traceroute hops

    my ($asstr, @asgroups) = dividehopsintoasgroups(@hops);  ## divide hops into as groups
    my @groups = getpoppaths(\@hops, \@asgroups);    ## get pop-path in ASes seperated by "|"

    generatetableentriesforasgroups(\@hops, \@groups, $asstr, $srcip, $dstip, $probetime, $starttime);

}



## this subroutine inserts bgp and igp entries into mysql database
sub populatehashes {

# connect to database
    $dbh = DBI->connect ($dsn, $user_name, $password,
        { RaiseError => 1, PrintError => 0 });


    my $str = "LOCK TABLES bgp WRITE";
    $sth = $dbh->prepare($str);
    $sth->execute();

    $str = "INSERT INTO bgp (ptime, tstart, vpip, dip, cntas, cntpop, nextas, nextpop, aspath) VALUES ";
    my $first = 1;
    my $entrycount = 0;

    while ( my ($key, $val) = each(%bgphash) ) {
        my ($srcip, $dstip, $asn, $pop, $starttime) = split(/</, $key);

        my $count=0;
        my $strtowrite = "";

        $entrycount++;

        foreach my $one (@{$val}) {
            my ($probetime, $nextasn, $nextpop, $aspath) = split(/</, $one);

            $count++;

            if($first == 1) {
                $str .= "('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$pop', $nextasn, '$nextpop', '$aspath')";
                $first = 0;
            }
            else {
                $str .= ", ('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$pop', $nextasn, '$nextpop', '$aspath')";
            }

            $strtowrite .= "$probetime\t$starttime\t$srcip\t$dstip\t$asn\t$pop\t$nextasn\t$nextpop\t$aspath\n";
        }

        if($count>1) {
            print OUTPUT2 "$strtowrite";
        }

        if($entrycount >= 3000) {
            $sth = $dbh->prepare($str);
            $sth->execute();

            $str = "INSERT INTO bgp (ptime, tstart, vpip, dip, cntas, cntpop, nextas, nextpop, aspath) VALUES ";
            $first = 1;
            $entrycount = 0;
        }

    }

    if($entrycount > 0) {
        $sth = $dbh->prepare($str);
        $sth->execute();
    }


    $str = "LOCK TABLES poppath WRITE";
    $sth = $dbh->prepare($str);
    $sth->execute();


    $str = "INSERT INTO poppath (ptime, tstart, vpip, dip, asn, srcpop, dstpop, poppath, ippathlen) VALUES ";
    $first = 1;
    $entrycount = 0;


    while ( my ($key, $val) = each(%igphash) ) {
        my ($srcip, $dstip, $asn, $srcpop, $dstpop, $starttime) = split(/</, $key);
        my $count = 0;
        my $strtowrite="";

        $entrycount++;

        foreach my $one (@{$val}) {
            my ($probetime, $poppath, $ippathlen) = split(/</, $one);

            $count++;

            if($first == 1) {
                $str .= "('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$srcpop', '$dstpop', '$poppath', '$ippathlen')";
                $first = 0;
            }
            else {
                $str .= ", ('$probetime', '$starttime', '$srcip', '$dstip', $asn, '$srcpop', '$dstpop', '$poppath', '$ippathlen')";
            }

            $strtowrite .= "$probetime\t$starttime\t$srcip\t$dstip\t$asn\t$srcpop\t$dstpop\t$poppath\t$ippathlen\n";
        }

        if($count>1) {
            print OUTPUT1 "$strtowrite";
        }

        if($entrycount >= 3000) {
            $sth = $dbh->prepare($str);
            $sth->execute();

            $str = "INSERT INTO poppath (ptime, tstart, vpip, dip, asn, srcpop, dstpop, poppath, ippathlen) VALUES ";
            $first = 1;
            $entrycount = 0;
        }
    }

    if($entrycount>0) {
        $sth = $dbh->prepare($str);
        $sth->execute();
    }


    $str = "UNLOCK TABLES";

    $sth = $dbh->prepare($str);
    $sth->execute();

    %bgphash = ();
    %igphash = ();

}



my $filecount = 0;

my $discardedIPloops = 0;
my $discardedPoPloops = 0;
my $discardedASloops = 0;
my $processedtraceroutes = 0;



sub findnumIPs {
    my @hops = @_;

    my %lastasindex = ();
    my $lastasn = "-1";
    my $loopstr = "";

    for(my $i=0; $i<=$#hops; $i++) {
        if($hops[$i] ne "*") {
            my ($ip, $rtt, $ttl, $asn, $pop) = split(/:/, $hops[$i]);

            if($asn eq "NULL") {
                next;
            }

            if(not(defined($lastasindex{$asn}))) {
                $lastasindex{$asn} = $i;
                $lastasn = $asn;
            }
            else {
                if($asn eq $lastasn) {
                    $lastasindex{$asn} = $i;
                }
                else {
                    my $lastindex = $lastasindex{$asn};
                    my $numipsonloop = $i - $lastindex;
                    my $distinctas = 0;
                    my $keptasn = -1;

                    for(my $j=$lastindex; $j<=$i; $j++) {
                        if($hops[$j] eq "*") {
                            if($loopstr eq "") {
                                $loopstr = "*";
                            }
                            else {
                                $loopstr .= ">*";
                            }
                            next;
                        }

                        my ($cntip, $cntrtt, $cntttl, $cntasn, $cntpop) = split(/:/, $hops[$j]);
                        if($loopstr eq "") {
                            $loopstr = "$cntip:$cntasn";
                        }
                        else {
                            $loopstr .= ">$cntip:$cntasn";
                        }

                        if($cntasn ne "NULL" && $cntasn ne $keptasn) {
                            $distinctas++;
                            $keptasn = $cntasn;
                        }
                    } ## end for(my $j ... ...)

                    $distinctas--;  ## decrease by one due to last repeating as

                    print OUTPUT5 "$numipsonloop    $distinctas\n";

                    return $loopstr;
                }
            }
        }
    }

    return $loopstr;
}




sub discardpath {
    my (@hops) = @_;

    ## check for IP, PoP, AS loops
    my %iphash = ();
    my %pophash = ();
    my %ashash = ();
    my $lastip = "-1";
    my $lastasn = -1;
    my $lastpop = "-1";

    my $aspath = "";

    foreach my $onehop (@hops) {

        if($onehop ne "*") {
            my ($ip, $rtt, $ttl, $asn, $pop) = split(/:/, $onehop);

            if($ip ne $lastip) {
                if(not(defined($iphash{$ip}))) {
                    $iphash{$ip} = 1;
                    $lastip = $ip;
                }
                else {
                    $discardedIPloops++;

                    ## print "The traceroute @hops contains an IP loop.\n";
                    ## print "$ip appeared more than once.\n";
                    return 1;
                }
            }

            if($asn ne "NULL") {

                if($pop ne "NULL") {
                    my $cntpop = "$asn->$pop";

                    if($cntpop ne $lastpop) {

                        if(not(defined($pophash{$cntpop}))) {
                            $pophash{$cntpop} = 1;
                            $lastpop = $cntpop;
                        }
                        else {
                            $discardedPoPloops++;

                            ## print "The traceroute @hops contains a PoP loop.\n";
                            return 2;
                        }
                    }
                }

                $aspath .= "$asn|";

                if($asn ne $lastasn) {
                    if(not(defined($ashash{$asn}))) {
                        $ashash{$asn} = 1;
                        $lastasn = $asn;
                    }
                    else {

                        my $loopstr = findnumIPs(@hops);

                        print OUTPUT4 "The hops are @hops, there is an AS loop. The AS loops are $aspath\n";
                        print OUTPUT4 "The AS loop segments are $loopstr\n";

                        $discardedASloops++;

                        return 3;
                    }

                }
            }

        }
    }

    return 0;
}



my $totaltraceroutes = 0;

my $processbegintime = time();  ## get time in seconds since 1970
my $lastprocesstime = $processbegintime;
my $cntprocesstime;
my $totalprocesstime = 0;


# begin to process traceroute plain texts, and map intermediate IPs to their AS numbers and locations (POPs)
foreach my $file (@files) {

    my $lastinserttime="-1";     ## this stores the old data collection hour

    # this is traceroute plain text file
    if(-f $file) {


        open(INPUT1, "zcat $file | ") || die "can't open file $file for read";

        $filecount++;

        print "File $filecount is $file\n";

        my $tag;
        my $date;
        my $time;
        my $srcaddr;
        my $arrow;
        my $dstaddr;
        my $icmpstatus;
        my $hopcount;
        my @hops = ();
        my $lastindex=0;
        my $dummy;
        my $cntindex=0;
        my $cntIP;
        my $cntrtt;
        my $cntttl;
        my $firsthop;
        my $lasthop;


        ## begin to extract probing time from the file name
        #  extension is in the format of .*

        $file =~ /.*\/([^\/]+)/;
        my $filename = $1;
        my ($site, $datetime, $re, $targetasn) = split(/_/, $filename);

        $fileasn = $targetasn;

        $datetime =~ /(\d\d)(\d\d)(\d\d)(\d\d)(\d\d)(\d\d)/;
        my ($yy, $mm, $dd, $hh, $min, $ss) = ($1, $2, $3, $4, $5, $6);

        my $yyyy;
        my $probetime;
        my $starttime;

        $probetime = "20$yy-$mm-$dd $hh:$min:$ss";

        my $startmin = 15*int(($min/15));
        $starttime = "20$yy-$mm-$dd $hh:$startmin:00";

        ## begin to process lines iteratively
        while(my $line = <INPUT1>) {
            chomp($line);

            ## this is the beginning of a new traceroute probe
            if($line =~ /->/) {
                ($tag, $date, $time, $srcaddr, $arrow, $dstaddr, $icmpstatus, $hopcount) = split(/\s+/, $line);

                my ($firsthop, $lasthop, $oneIP, $IPnum, $cntasn, $cntpop);

                if(not(defined($ipasnpoptable{$srcaddr}))) {

                    ## find out the asn and pop for the first IP
                    if(not($oneIP = new Net::IP($srcaddr))) {
                        print "Net::IP::Error()\n";
                        print "The file is $filename, the line is $line\n";
                        last;
                    }

                    $IPnum = $oneIP->intip();
                    $cntasn = $ipasntable{$IPnum};

                    if(not(defined($cntasn))) {
                        $cntasn = "NULL";
                    }

                    if(not(defined($cntasn))) {

                        ## look up asn value from the prefix-as mapping patricia handler
                        $cntasn = $pt->match_string($srcaddr);

                        if(not(defined($cntasn))) {
                            $cntasn = "NULL";
                        }
                    }

                    $cntpop = $iploctable{$IPnum};
                    if(not(defined($cntpop))) {
                        $cntpop = "NULL";
                    }

                    $firsthop = "$srcaddr:-1:$-1:$cntasn:$cntpop";
                    $ipasnpoptable{$srcaddr} = $firsthop;
                }
                else {
                    $firsthop = $ipasnpoptable{$srcaddr};
                }


                if($lastinserttime ne "-1") {

                    my $oldtime = to_seconds($lastinserttime);
                    my $newtime = to_seconds("$yy$mm$dd$hh$min$ss");

                    my $timediff = $newtime - $oldtime;

                    ## populate bgp and igp hash into database if it has been 3 hours since last insertion
                    if( $timediff > 10800  ) {
                        $cntprocesstime = time();
                        $totalprocesstime += $cntprocesstime - $lastprocesstime;
                        $lastprocesstime = $processbegintime;

                        populatehashes();
                        $lastinserttime = "$yy$mm$dd$hh$min$ss";
                    }

                }
                else {
                    $lastinserttime = "$yy$mm$dd$hh$min$ss";
                }

                if(not(defined($ipasnpoptable{$dstaddr}))) {


                    ## find out the asn and pop for the last IP
                    if(not($oneIP = new Net::IP($dstaddr))) {
                        print "Net::IP::Error()\n";
                        print "The file is $filename, the line is $line\n";

                        last;
                    }

                    $IPnum = $oneIP->intip();
                    $cntasn = $ipasntable{$IPnum};
                    if(not(defined($cntasn))) {
                        $cntasn = "NULL";
                    }

                    if(not(defined($cntasn))) {

                        ## look up asn value from the prefix-as mapping patricia handler
                        $cntasn = $pt->match_string($dstaddr);
                        if(not(defined($cntasn))) {
                            $cntasn = "NULL";
                        }
                    }

                    $cntpop = $iploctable{$IPnum};
                    if(not(defined($cntpop))) {
                        $cntpop = "NULL";
                    }

                    $lasthop = "$dstaddr:-1:$-1:$cntasn:$cntpop";
                    $ipasnpoptable{$dstaddr} = $lasthop;
                }
                else {
                    $lasthop = $ipasnpoptable{$dstaddr};
                }

            }
            elsif($line =~ /duration/) {
                next;
            }
            elsif($line !~ /^\s*$/) {   # process the line if it contains other than white spaces
                ($dummy, $cntindex, $cntIP, $cntrtt, $cntttl) = split(/\s+/, $line);

                if($lastindex != 0) {  # not the beginning of the first hop
                    my $numstars = $cntindex-$lastindex-1;
                    my $i;

                    ## fill in * for those missing hops
                    for($i=0; $i<$numstars; $i++) {
                        push(@hops, "*");
                    }
                }

                my $hopstr;

                if(not(defined($ipasnpoptable{$cntIP}))) {
                    # print "Current ip is $cntIP\n";
                    my $oneIP;

                    if(not($oneIP = new Net::IP($cntIP))) {
                        print "Net::IP::Error()\n";
                        print "The file is $filename, the line is $line\n";

                        last;
                    }

                    my $IPnum = $oneIP->intip();

                    my $cntasn;
                    my $cntpop;


                    $cntasn = $ipasntable{$IPnum};

                    if(not(defined($cntasn))) {

                        ## look up asn value from the prefix-as mapping patricia handler
                        $cntasn = $pt->match_string($cntIP);

                        if(not(defined($cntasn))) {
                            $cntasn = "NULL";
                        }
                    }

                    $cntpop = $iploctable{$IPnum};

                    if(not(defined($cntpop))) {
                        $cntpop = "NULL";
                    }

                    $hopstr = "$cntIP:$cntrtt:$cntttl:$cntasn:$cntpop";
                    $ipasnpoptable{$cntIP} = $hopstr;
                }
                else {
                    $hopstr = $ipasnpoptable{$cntIP};
                }
                push(@hops, $hopstr);

                $lastindex = $cntindex;
            }
            elsif($line =~ /^\s*$/) {   # this is the white space lines

                ## begin to process one traceroute probing result
                if($#hops >=0) {
                    my @cnthops = ($firsthop, @hops, $lasthop);

                    $totaltraceroutes++;

                    if(discardpath(@cnthops) == 0) {
                        $processedtraceroutes++;
                        processonetraceroute($probetime, $starttime, @cnthops);
                    }
                }

                ## last line is the end of one traceroute probe
                if($lastindex != 0) {
                    @hops = ();        # reset hops array to empty
                    $lastindex = 0;    # reset lastindex to 0
                }
            }
        }

        close(INPUT1);
    }
}

$cntprocesstime = time();
$totalprocesstime += $cntprocesstime - $lastprocesstime;
$lastprocesstime = $processbegintime;

populatehashes();


$sth->finish ();
$dbh->disconnect ();

my $ratio;

print OUTPUT3 "$totalfiles traceroute files are checked.\n";
print OUTPUT3 "There are $numgroups groups(s) of probing in the checking interval.\n";

for my $onetime (sort keys %timefilehash) {
    my @cntset = @{$timefilehash{$onetime}};
    my $goodfiles = $#cntset+1;
    print OUTPUT3 "At time $onetime, there are $goodfiles good files.\n";
    foreach my $onefile (@{$timefilehash{$onetime}}) {
        print "The file is $onefile\n";
    }
}

print OUTPUT3 "\n\n";



$ratio = $corruptedfiles/$totalfiles;
print OUTPUT3 "$corruptedfiles, counted as $ratio, traceroute files are corrupted and are discarded.\n";
$ratio = $workingfiles/$totalfiles;
print OUTPUT3 "$workingfiles, counted as $ratio, traceroute files are good.\n\n\n";


print OUTPUT3 "$totaltraceroutes traceroute paths are parsed.\n";
$ratio = $discardedIPloops/$totaltraceroutes;
print OUTPUT3 "$discardedIPloops, counted as $ratio, traceroute paths are discarded due to IP loops.\n";
$ratio = $discardedPoPloops/$totaltraceroutes;
print OUTPUT3 "$discardedPoPloops, counted as $ratio, traceroute paths are discarded due to PoP loops.\n";
$ratio = $discardedASloops/$totaltraceroutes;
print OUTPUT3 "$discardedASloops traceroute paths, counted as $ratio, are discarded due to AS loops.\n";
$ratio = $processedtraceroutes/$totaltraceroutes;
print OUTPUT3 "$processedtraceroutes, counted as $ratio, traceroutes are processed.\n\n\n";



print OUTPUT3 "$totalnextingress BGP next-ingresses are checked.\n";
$ratio = $nextingressskipped/$totalnextingress;
print OUTPUT3 "$nextingressskipped, counted as $ratio, BGP next-ingresses are discarded due to more than one consecutive unknowns between last valid PoP and next-ingress IP.\n";
$ratio = $nextingressincluded/$totalnextingress;
print OUTPUT3 "$nextingressincluded, counted as $ratio, BGP next-ingresses are included.\n";


print OUTPUT3 "$totalnextegress BGP same-AS egresses are checked.\n";
$ratio = $nextegressskipped/$totalnextegress;
print OUTPUT3 "$nextegressskipped, counted as $ratio, BGP same-AS egresses are discarded due to more than one consecutive unknowns between last valid PoP and next-ingress IP.\n";
$ratio = $nextegressincluded/$totalnextegress;
print OUTPUT3 "$nextegressincluded, counted as $ratio, BGP same-AS egresses are included.\n";



print OUTPUT3 "$totalaspath BGP AS-paths are checked.\n";
$ratio = $aspathskipped/$totalaspath;
print OUTPUT3 "$aspathskipped, counted as $ratio, BGP AS-paths are discarded due to more than one consecutive stars on IP-path from current AS to destination host.\n";
$ratio = $aspathincluded/$totalaspath;
print OUTPUT3 "$aspathincluded, counted as $ratio, BGP AS-paths are included.\n";


print OUTPUT3 "$totalpoppath IGP PoP-paths are checked.\n";
$ratio = $poppathskipped/$totalpoppath;
print OUTPUT3 "$poppathskipped, counted as $ratio, IGP PoP-paths are discarded due to more than one unknown PoPs (either NULL PoP or *) between different neighboring PoPs.\n";
$ratio = $poppathincluded/$totalpoppath;
print OUTPUT3 "$poppathincluded, counted as $ratio, IGP PoP-paths are included.\n";


my $stoptime = time();  ## get time in seconds since 1970

my $elapsedtime = $stoptime - $begintime;

print OUTPUT3 "The code stops at time $stoptime\n\n";
print OUTPUT3 "The code runs $elapsedtime seconds\n";
print OUTPUT3 "The traceroute processing runs $totalprocesstime seconds\n";


close(OUTPUT1);
close(OUTPUT2);
close(OUTPUT3);

close(OUTPUT4);
close(OUTPUT5);


exit (0);



If i copy/paste the above, the "exit(0)" is on line 1647... I'll use line numbers that match that:

Lines 99-100: you call to_seconds($startingtime) and to_seconds($endingtime) inside the while loop from line 93.  These aren't changing, and should be calculated once before the loop. eg:
   $startingtimeseconds=to_seconds($startingtime);    #then use this inside the loop on line 99

Lines 106-113: this can be replaced with:
    push @{$timefilehash{$cnttime}}, $file;

Lines 124-125: Same as lines 99-100

Lines 194-214: Don't create a copy of @ary in line 194... Just use $ary[0], $ary[1], and $ary[2]

Lines 228-234: Same as lines 194-214

That's about as far as I got so far.... I don't think any of those things will make a bug difference though.