Need to isloate differences/matched in csv files using perl

richsark
richsark used Ask the Experts™
on
Hello,

I am in need of an expert here to help me. I have two directories that contain 50,000+ files that are 1K in size ending with .csv

The contents looks like this and its same format through out:

For example this file 2.0.1.0.csv has this:

SubnetAddress=2.0.1.0
SubnetName=mig-Official Test Subnet
SubnetMask=255.255.255.0
NetworkAddress=2.0.0.0
LocationID=2324
Street1=2121 richie hwy
Street2=
City=florida
State=FL
Zip=23111
Country=USA
ContactID=0
ContactLastName=
ContactFirstName=
ContactEmail=
ContactPhone=
ContactPager=
Application=""
Domain=atm.sarkie.net sarkie.net svr.sarkie.net ad.sarkie.net b1test.dot cmg.sarkie.net az.sarkie.net net.sarkie.net prt.sarkie.net addev.lab.sarkie.net dmz.sarkie.net svr.us.rich-sark.net wil.rich.net img.sarkie.net
TftpServer=
ShowUsage=N
CheckUsage=Y
SubnetDescription=
SharedNetwork=
HardwareType=Ethernet
WarningType=0
WarningPercent=0
DNSServers=nsvt0234.svr.us.rich-sark.net
TimeServers=
DefaultRouters=2.0.1.1 2.0.1.142
DHCPServer=kyrtp1dns.svr.sarkie.net
DHCPOptionTemplate=florida-RTP
DHCPPolicyTemplate=
LeasedDynamicObjects/TotalDynamicObjects (percent)=0/1 (0)
PrimaryInterface=No
AllowDHCPClientsModifyDynamicObjectResourceRecords=Same As in Global Policies


I am requesting that a script be created that can look in directory A which is located in  k:\raw and directory B located in K:\merged

What I need to do is match both directories and pull out what is different from both at the same time flag the ones that are duplicated in contents.

Perhaps the report can contain from where it originally matched from for easy verification

Since the csv files has many tokens, as you can see from above. I am only interested in these fields:

SubnetAddress=
SubnetName=
SubnetMask=
NetworkAddress=
Domain=
TftpServer=
DNSServers=
TimeServers=
DefaultRouters=
DHCPServer=
DHCPOptionTemplate=
DHCPPolicyTemplate=
AllowDHCPClientsModifyDynamicObjectResourceRecords=

Thanks in advance for your help !

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

For a given subnet address, will both directories give the file the same name (e.g. 2.0.1.0.csv)?  

If so, use diff to find the files that only exist in one directory or the other as well as files with the same name that are completely identical.   Then use perl to only find the differences in the variables you mention above for files that exist in both directories but are not completely identical.

If they may have different names, what criteria do you use to consider them candidates for differencing?




Author

Commented:
Hi
the e.g. is just one example. The files are basically 50,000 subnet.csv files

I took contents from file A from another source. I would like to know what is different from file B in terms of my condition I mentioned above:

SubnetAddress=
SubnetName=
SubnetMask=
NetworkAddress=
Domain=
TftpServer=
DNSServers=
TimeServers=
DefaultRouters=
DHCPServer=
DHCPOptionTemplate=
DHCPPolicyTemplate=
AllowDHCPClientsModifyDynamicObjectResourceRecords=

I am basically looking to see what matches what doesn't and whats missing and create a report based on above fields

Diff is just to big to sort 50,000 csv files. That is why I am looking for an expert to script this for me.

Is that do-able?

 Thanks

I wasn't clear with my question.  

1) If these are two directories with random filenames, and you're not comparing a/xyz with b/xyz, then the problem isn't really tractable.   You're basically asking to do a fuzzy match of 1 file against 50k possible candidates... and then repeat that for all files in a.

2) If you want to compare any files that have the same name in directories a and b using only the fields you mention above, then that's fairly easy.  

3) Or if you want to compare based on a matching primary key (say SubnetAddress) but don't know the name of the file it will appear in, that's also easy, but requires a second pass through the files...or storing the mapping  between subnet address and filename.

I'm guessing that case 2 is what you have, since you call these subnet.csv files....and the 1 example you show is named 2.0.1.0.csv.   Is the following a correct re-phrasing of your question?

Given two directories with subnet.csv files, if the same filename exists in both directories, compare the contents using only  the fields listed above.


If so, then the following may work for you:


#!/usr/bin/perl -w
use strict;
 
my $apath = "k:/raw";
my $bpath = "k:/merged";
my %a = map {s,^$apath/,,; $_ => 1} <$apath/*.csv>;
my %b = map {s,^$bpath/,,; $_ => 1} <$bpath/*.csv>;
 
foreach (keys %a) {
    unless (exists $b{$_}) {
        print "$_ not found in $bpath\n";
        delete $a{$_};
    }
}
foreach (keys %b) {
    print "$_ not found in $apath\n" unless exists $a{$_};
}
 
my @fields = ('SubnetAddress',
              'SubnetName',
              'SubnetMask',
              'NetworkAddress',
              'Domain',
              'TftpServer',
              'DNSServers',
              'TimeServers',
              'DefaultRouters',
              'DHCPServer',
              'DHCPOptionTemplate',
              'DHCPPolicyTemplate',
              'AllowDHCPClientsModifyDynamicObjectResourceRecords');
 
foreach my $file (keys %a) {
    my $af = readFile("$apath/$file");
    my $bf = readFile("$bpath/$file");
    my $diffFound = 0;
 
    foreach my $field (@fields) {
        if ($af->{$field} ne $bf->{$field}) {
            if ($diffFound == 0) {
                print "\nFile $file has the following differences:\n";
                $diffFound = 1;
            }
            print "\t< $field=$af->{$field}\n";
            print "\t> $field=$bf->{$field}\n";
        }
    }
}
 
exit;
 
sub readFile {
    my $file = shift;
    my %rval;
    open (FILE, "<$file") or die "Could not open $file\n";
    while (<FILE>) {
        next unless /^\s*([^=]+)=(.*?)\s*$/;
        $rval{$1} = $2;
    }
    close FILE;
    return { %rval };
}

Open in new window

Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

Author

Commented:
Hello climbgunks,

Thanks for your quick response.

Let me clarify my query a tad....  I have 2 different systems that contain subnets files like I have shown above.

These 2 different systems were merged together to become one. ( system 3)  What I need to do is see from 1 of the systems that all the subnets have been merged into system 3.

First, I was looking for

- What is matched exactly based on the key fields I provided and report on it
- What is different from e.g. ( I will pick another subnet.csv file) 5.1.10.0.csv) from directory A from directory B

Directory A:

SubnetAddress=5.1.10.0
SubnetName=Sark21
SubnetMask=255.255.255.0
NetworkAddress=5.0.0.0
Domain=lab.sarkie.net
TftpServer=
DNSServers=svt0234.svr.us.rich-sark.net
TimeServers=
DefaultRouters=5.1.10.254
DHCPServer=kyrtp1dns.svr.rich.net
DHCPOptionTemplate=
DHCPPolicyTemplate=
AllowDHCPClientsModifyDynamicObjectResourceRecords=

Directory B:

SubnetAddress=5.1.10.0
SubnetName=Sark21
SubnetMask=255.255.252.0
NetworkAddress=5.0.0.0
Domain=lab.sarkie.net dns3.rich.net
TftpServer=
DNSServers=svt0234.svr.us.rich-sark.net
TimeServers=
DefaultRouters=5.1.10.2
DHCPServer=
DHCPOptionTemplate=
DHCPPolicyTemplate=
AllowDHCPClientsModifyDynamicObjectResourceRecords=

As you can see from both files that there are some fields that have different information from one another. In this case I would need to report that and identify what is different and where it came from.

- another case would be that I have 5.24.23.0.csv in Directory A but not B, Once agian I would need to identify that as well.

- Most of them should ( I hope ) have the same information. I need to report on that as well

I hope you can see a pattern where I am going

Cheers


Did you try the program above?  It should do exactly what you want...

Author

Commented:
Hello Climbgunks,

I did, I like your idea, but there is just so many of < and > to sort 50,000 csv file with the diff you did.

Can  the report be a little better in terms of ouput, maybe create a report at the end once all is loaded in memory?

I would take any suggestions

Each diff between the fields of interest produces two lines (< for the file from patha, > for the file from pathb).   Since that's what you asked for, and I don't know what you really want, I'm at a loss here how to proceed.

If you just want it to say the files differ, that's easy.. .or if you want it to just print out the fields that differ (without the values).

I don't see how storing the output in memory buys you anything, other than possibly watching your machine thrash.

If it's slow, it's because you have 50K files in a single directory, which most filesystems aren't optimized for, and even if they are, the lookup for a single file is still O(log N).

code below tells you whether the file differs or not.... and prints out all the fields that differ
foreach my $file (keys %a) {
    my $af = readFile("$apath/$file");
    my $bf = readFile("$bpath/$file");
    my @diffs = ();
 
    foreach my $field (@fields) {
        if ($af->{$field} ne $bf->{$field}) {
            push @diffs, $field;
        }
    }
 
    if (@diffs == 0) {
        print "File $file is the same\n";
    }
    else {
        print "File $file differs in these fields: @diffs\n";
    }
}

Open in new window

Author

Commented:
OK, will give it a shot

Do I run as is? or do I embed it to your other code?

I would need the #!perl -w statement on top right?

ALso for the memory thing, it was just an idea. Will you code also tell me of one subnet.csv file is present or not in either directory?

Thanks so much for your kind help

you replace lines 3-48 with the new code...


If you had run the first code, the first thing it does is tell you  whether a file exists in only 1 directory or not and which directory it's missing from.

Author

Commented:
Thanks, will let you MOnday. Its been a long day. Thanks for all your help !
sorry..that was lines 33-48..typo

Author

Commented:
Hello climbgunks, sorry for the delay... I like the report format in the first option. I will use that moving forward.

One last change in that code, rather then the report showing < and > 

Can they be changed to names like merged and raw for easy viewing

Thanks !

Author

Commented:
Hello climbgunks, sorry for the delay... I like the report format in the first option. I will use that moving forward.

One last change in that code, rather then the report showing < and > 

Can they be changed to names like merged and raw for easy viewing

Thanks !




            print "\traw\t$field=$af->{$field}\n";
            print "\tmerged\t$field=$bf->{$field}\n";

Author

Commented:
OK, that go's in rows 44 and 45 of the first code right

Author

Commented:
HI, That did it, The final result I used is below.

Thanks for all your help !



#!/usr/bin/perl -w
use strict;
 
my $apath = "k:/raw";
my $bpath = "k:/merged";
my %a = map {s,^$apath/,,; $_ => 1} <$apath/*.csv>;
my %b = map {s,^$bpath/,,; $_ => 1} <$bpath/*.csv>;
 
foreach (keys %a) {
    unless (exists $b{$_}) {
        print "$_ not found in $bpath\n";
        delete $a{$_};
    }
}
foreach (keys %b) {
    print "$_ not found in $apath\n" unless exists $a{$_};
}
 
my @fields = ('SubnetAddress',
              'SubnetName',
              'SubnetMask',
              'NetworkAddress',
              'Domain',
              'TftpServer',
              'DNSServers',
              'TimeServers',
              'DefaultRouters',
              'DHCPServer',
              'DHCPOptionTemplate',
              'DHCPPolicyTemplate',
              'AllowDHCPClientsModifyDynamicObjectResourceRecords');
 
foreach my $file (keys %a) {
    my $af = readFile("$apath/$file");
    my $bf = readFile("$bpath/$file");
    my $diffFound = 0;
 
    foreach my $field (@fields) {
        if ($af->{$field} ne $bf->{$field}) {
            if ($diffFound == 0) {
                print "\nFile $file has the following differences:\n";
                $diffFound = 1;
            }
            
            print "\traw\t$field=$af->{$field}\n";
            print "\tmerged\t$field=$bf->{$field}\n";
        }
    }
}
 
exit;
 
sub readFile {
    my $file = shift;
    my %rval;
    open (FILE, "<$file") or die "Could not open $file\n";
    while (<FILE>) {
        next unless /^\s*([^=]+)=(.*?)\s*$/;
        $rval{$1} = $2;
    }
    close FILE;
    return { %rval };
}

Open in new window

Author

Commented:
Thanks for all your help ! At this rate you will be a genius in no time !

Author

Commented:
HI, I know I closed this, but if I want to add a 3rd directory to compare, would I simply need to do this?

I modified the script to show that. ( I hope I did it right)


#!/usr/bin/perl -w
use strict;
 
my $apath = "k:/raw";
my $bpath = "k:/merged";
my $cpath = "k:/merged2";
my %a = map {s,^$apath/,,; $_ => 1} <$apath/*.csv>;
my %b = map {s,^$bpath/,,; $_ => 1} <$bpath/*.csv>;
my %b = map {s,^$cpath/,,; $_ => 1} <$bpath/*.csv>;
 
foreach (keys %a) {
    unless (exists $b{$_}) {
        print "$_ not found in $bpath\n";
        delete $a{$_};
    }
}
foreach (keys %b) {
    print "$_ not found in $apath\n" unless exists $a{$_};
}
 
my @fields = ('SubnetAddress',
              'SubnetName',
              'SubnetMask',
              'NetworkAddress',
              'Domain',
              'TftpServer',
              'DNSServers',
              'TimeServers',
              'DefaultRouters',
              'DHCPServer',
              'DHCPOptionTemplate',
              'DHCPPolicyTemplate',
              'AllowDHCPClientsModifyDynamicObjectResourceRecords');
 
foreach my $file (keys %a) {
    my $af = readFile("$apath/$file");
    my $bf = readFile("$bpath/$file");
    my $cf = readFile("$cpath/$file");
    my $diffFound = 0;
 
    foreach my $field (@fields) {
        if ($af->{$field} ne $bf->{$field}) {
            if ($diffFound == 0) {
                print "\nFile $file has the following differences:\n";
                $diffFound = 1;
            }
            print "\t< $field=$af->{$field}\n";
            print "\t> $field=$bf->{$field}\n";
            print "\t> $field=$cf->{$field}\n";
        }
    }
}
 
exit;
 
sub readFile {
    my $file = shift;
    my %rval;
    open (FILE, "<$file") or die "Could not open $file\n";
    while (<FILE>) {
        next unless /^\s*([^=]+)=(.*?)\s*$/;
        $rval{$1} = $2;
    }
    close FILE;
    return { %rval };
}

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial