Using Perl to write a specific script to handle CSV data

Hi,

I'm kinda brand new to perl (well, programming in general), and so I'm trying to write a script which can automate work I manually perform on a day-to-day basis.

I've attached an overview of what I'm trying to achieve as a text file (species-script-overview.txt).
I've tried to put as much detail as I can into this text file detailing what i'm trying to achieve out of my script.

20130730-p12-A7.csv is an example CSV datafile I've attached which would contain the raw data used by the script.
SpeciesId.txt will also be used by the script to match scores against the .csv file and to assign Species depending on match scores and %.

I hope "species-script-overview.txt" makes sense... any help or assistance with writing such a script would be greatly appreciated.

Thanks,

Stephen.
SpeciesId.txt
species-script-overview.txt
CSV-Data.jpg
20130730-p12-A7.csv
StephenMcGowanAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wilcoxonCommented:
Here's my stab at it.  I have not installed and tested with your data.  Let me know if it doesn't work (and how)...
#!/usr/bin/perl
use strict;
use warnings;

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/monopd/monopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir my $dir, '.' or die "could not open dir: $!";
my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir $dir;
closedir $dir;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = m{^(\d+(?:\.\d)?(\d)?};
    $start += 0.1 if $dig >= 5;
    return $start;
}

# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
            # handle letter = lines
            $species{$spec}{$1} = [$2];
            push @{$species{$spec}{$1}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
        }
    }
    close $in;
}
sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(@{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

sub output {
    my ($wellposition, $data) = @_;
    my @order = sort $cust_sort keys %$data;
    print "Wellposition ($wellposition) Results:\n\n",
          "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($order[$i]) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %s\t%d matches\t%d\%\n", $i, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
}

Open in new window

0
StephenMcGowanAuthor Commented:
Hi wilcoxon,

Thanks a lot for getting back to me and spending the time to go through my 'script'.

I copy/pasted the above code into a file: "Id_script.pl" and located it to the desktop.

I ran the script in the windows command prompt and received the following errors in return:

C:\Users\Stephen\Desktop>perl Id_script.pl
"my" variable $dir masks earlier declaration in same scope at Id_script.pl line
22.
Unmatched ( in regex; marked by <-- HERE in m/^( <-- HERE \d+(?:\.\d)?(\d)?/ at
Id_script.pl line 62.

Open in new window



Thanks,

Stephen.
0
wilcoxonCommented:
Sorry about that - it was getting late.  I made a few typos and forgot to include the custom sort function...
#!/usr/bin/perl
use strict;
use warnings;

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/monopd/monopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir DIR, '.' or die "could not open dir: $!";
my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir DIR;
closedir DIR;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = m{^(\d+(?:\.\d)?)(\d)?};
    $start += 0.1 if $dig >= 5;
    return $start;
}

# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
            # handle letter = lines
            $species{$spec}{$1} = [$2];
            push @{$species{$spec}{$1}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
        }
    }
    close $in;
}
sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(@{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

{ # begin closure
my $data;
sub _cust_sort {
    if ($data->{$a}{pct} == $data->{$b}{pct}) {
        return $data->{$a}{cnt} <=> $data->{$b}{cnt};
    }
    return $data->{$a}{pct} <=> $data->{$b}{pct};
}
sub output {
    my $wellposition = shift;
    $data = shift;
    my @order = sort _cust_sort keys %$data;
    print "Wellposition ($wellposition) Results:\n\n",
          "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($order[$i]) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %s\t%d matches\t%d\%\n", $i, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
}
} # end closure

Open in new window

Ideally, the DIR used for the opendir/readdir/closdir should be a my var but I was feeling lazy and used a glob to disambiguate with the other "my $dir".  There's no functional difference in this case but it can (rarely) have some subtle gotchas...
0
Learn Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

StephenMcGowanAuthor Commented:
Thanks again wilcoxon,

I tried it again and received the following output/error message:

C:\Users\Stephen\Desktop>perl Id_script2.pl
Wellposition (A1) Results:

Top 5 Species Identities:
no more matches
Wellposition (A10) Results:

Top 5 Species Identities:
no more matches
Not an ARRAY reference at Id_script2.pl line 109.

C:\Users\Stephen\Desktop>

Open in new window

0
wilcoxonCommented:
Okay, this time I did grab your sample files and run them.  Here's the corrected version (multiple small errors).  You will want to look at the round subroutine and likely uncomment one of the two lines depending on if a rounded mass should be 123.0 or 123.
#!/usr/bin/perl
use strict;
use warnings;

my $len = 0; # hack global because it's simpler

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/monopd/monopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir DIR, '.' or die "could not open dir: $!";
my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir DIR;
closedir DIR;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = $num =~ m{^(\d+(?:\.\d)?)(\d)?};
    $start += 0.1 if (defined $dig and $dig >= 5);
    # XXX - you probably want one of these two uncommented
    # remove .0 from end of number
    # $start =~ s{\.0$}{};
    # add .0 to end of number if no decimal
    # $start .= '.0' unless ($start =~ m{\.\d$});
    return $start;
}

# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
            # handle letter = lines
            $species{$spec}{$1} = [$2];
            push @{$species{$spec}{$1}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
            $len = length($spec) if (length($spec) > $len);
        }
    }
    close $in;
}
sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(keys %{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

{ # begin closure
my $data;
sub _cust_sort {
    if ($data->{$b}{pct} == $data->{$a}{pct}) {
        return $data->{$b}{cnt} <=> $data->{$a}{cnt};
    }
    return $data->{$b}{pct} <=> $data->{$a}{pct};
}
sub output {
    my $wellposition = shift;
    $data = shift;
    my @order = sort _cust_sort keys %$data;
    print {$out} "Wellposition ($wellposition) Results:\n\n",
                 "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($order[$i]) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %-${len}s  %d matches  %0.1f%%\n", $i+1, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
}
} # end closure

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
StephenMcGowanAuthor Commented:
Hi Wilcoxon,

Thanks for getting back to me again.
I've run the script and it seems to run fine - although I've noticed that something isn't quite right with the percentage score feedback:

For example, I have taken B17 as an example:

Wellposition (B17) Results:

Top 5 Species Identities:
1) Sumatran Rhino    2 matches  50.0%
2) Black Rhino       2 matches  50.0%
3) Rat               2 matches  28.6%
4) Rabbit            2 matches  28.6%
5) Hare              2 matches  28.6%

Open in new window


Here, Sumatran Rhino has 2 matches and a score of 50% (therefore 4 matches = 100%).

However, Sumatran Rhino has 7 markers (see SpeciesId.txt):

Sumatran Rhino

A = 1182.6 AND 1198.6
B = 1453.7
C = 1550.8
D = 2145.1
E = 2808.4 
F = 2869.5 AND 2885.5
G = 2983.4 AND 2999.4

Open in new window


Therefore, the correct percentage if two markers were matched would be 2/7*100 = 29% as opposed to the 50% score given.

I think this has happened for most of the scores.

Any chance you could possibly take a look at this?

Thanks again,

Stephen.
0
Ray PaseurCommented:
I got a neglected question alert on this.  For many of us, PHP is a good solution (sometimes better than Perl) because PHP is a very rich language, free and open source, so widely adopted, has many user groups, etc.  From a quick look at the description and data set, it looks like this is a candidate for a short PHP script and a small data base, perhaps something that is also free like MySQL.  With respect for your time, this is not really a question with an answer, so much as a need for application development, and for that you might want to hire an application developer, perhaps from eLance or similar.  If you decide you want to learn a little PHP and try it, this book is an excellent place to start.
http://www.amazon.com/PHP-MySQL-Web-Development-Edition/dp/0321833899

Best regards, ~Ray
0
ozoCommented:
It looks like some of the lines in SpeciesId.txt have blank space at the end, so are not matching
 m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}
try changing it to
 m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?\s*$}
0
wilcoxonCommented:
Point allocation on this question seems extremely unfair.  I spent a significant amount of time writing the script that works almost perfectly.  Allocating all of the points to ozo who fixed the one minor problem with my code seems wrong.

I would request that the asker reallocate points to award the majority of the points to the majority of the work (my script).
0
ozoCommented:
I agree.  You should request that a moderator reopen the question.
0
StephenMcGowanAuthor Commented:
I totally agree, I accidentally assigned ozo the points but I didn't know I could re-assign them once submitting.

Sorry wilcoxon, it was a complete accident. Could a moderator please re-open this question?

Thanks
0
StephenMcGowanAuthor Commented:
I have requested the attention of a moderator.
0
wilcoxonCommented:
Thank you.  I completely understand these things happen (and EE is not always an intuitive interface).
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.