Avatar of StephenMcGowan
StephenMcGowan asked on

Using Perl to write a specific script to handle CSV data

Hi,

I'm kinda brand new to perl (well, programming in general), and so I'm trying to write a script which can automate work I manually perform on a day-to-day basis.

I've attached an overview of what I'm trying to achieve as a text file (species-script-overview.txt).
I've tried to put as much detail as I can into this text file detailing what i'm trying to achieve out of my script.

20130730-p12-A7.csv is an example CSV datafile I've attached which would contain the raw data used by the script.
SpeciesId.txt will also be used by the script to match scores against the .csv file and to assign Species depending on match scores and %.

I hope "species-script-overview.txt" makes sense... any help or assistance with writing such a script would be greatly appreciated.

Thanks,

Stephen.
SpeciesId.txt
species-script-overview.txt
CSV-Data.jpg
20130730-p12-A7.csv
PerlWeb Languages and Standards

Avatar of undefined
Last Comment
wilcoxon

8/22/2022 - Mon
wilcoxon

Here's my stab at it.  I have not installed and tested with your data.  Let me know if it doesn't work (and how)...
#!/usr/bin/perl
use strict;
use warnings;

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/monopd/monopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir my $dir, '.' or die "could not open dir: $!";
my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir $dir;
closedir $dir;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = m{^(\d+(?:\.\d)?(\d)?};
    $start += 0.1 if $dig >= 5;
    return $start;
}

# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
            # handle letter = lines
            $species{$spec}{$1} = [$2];
            push @{$species{$spec}{$1}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
        }
    }
    close $in;
}
sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(@{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

sub output {
    my ($wellposition, $data) = @_;
    my @order = sort $cust_sort keys %$data;
    print "Wellposition ($wellposition) Results:\n\n",
          "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($order[$i]) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %s\t%d matches\t%d\%\n", $i, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
}

Open in new window

ASKER
StephenMcGowan

Hi wilcoxon,

Thanks a lot for getting back to me and spending the time to go through my 'script'.

I copy/pasted the above code into a file: "Id_script.pl" and located it to the desktop.

I ran the script in the windows command prompt and received the following errors in return:

C:\Users\Stephen\Desktop>perl Id_script.pl
"my" variable $dir masks earlier declaration in same scope at Id_script.pl line
22.
Unmatched ( in regex; marked by <-- HERE in m/^( <-- HERE \d+(?:\.\d)?(\d)?/ at
Id_script.pl line 62.

Open in new window



Thanks,

Stephen.
wilcoxon

Sorry about that - it was getting late.  I made a few typos and forgot to include the custom sort function...
#!/usr/bin/perl
use strict;
use warnings;

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/monopd/monopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir DIR, '.' or die "could not open dir: $!";
my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir DIR;
closedir DIR;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = m{^(\d+(?:\.\d)?)(\d)?};
    $start += 0.1 if $dig >= 5;
    return $start;
}

# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
            # handle letter = lines
            $species{$spec}{$1} = [$2];
            push @{$species{$spec}{$1}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
        }
    }
    close $in;
}
sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(@{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

{ # begin closure
my $data;
sub _cust_sort {
    if ($data->{$a}{pct} == $data->{$b}{pct}) {
        return $data->{$a}{cnt} <=> $data->{$b}{cnt};
    }
    return $data->{$a}{pct} <=> $data->{$b}{pct};
}
sub output {
    my $wellposition = shift;
    $data = shift;
    my @order = sort _cust_sort keys %$data;
    print "Wellposition ($wellposition) Results:\n\n",
          "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($order[$i]) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %s\t%d matches\t%d\%\n", $i, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
}
} # end closure

Open in new window

Ideally, the DIR used for the opendir/readdir/closdir should be a my var but I was feeling lazy and used a glob to disambiguate with the other "my $dir".  There's no functional difference in this case but it can (rarely) have some subtle gotchas...
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
ASKER
StephenMcGowan

Thanks again wilcoxon,

I tried it again and received the following output/error message:

C:\Users\Stephen\Desktop>perl Id_script2.pl
Wellposition (A1) Results:

Top 5 Species Identities:
no more matches
Wellposition (A10) Results:

Top 5 Species Identities:
no more matches
Not an ARRAY reference at Id_script2.pl line 109.

C:\Users\Stephen\Desktop>

Open in new window

ASKER CERTIFIED SOLUTION
wilcoxon

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
See how we're fighting big data
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
ASKER
StephenMcGowan

Hi Wilcoxon,

Thanks for getting back to me again.
I've run the script and it seems to run fine - although I've noticed that something isn't quite right with the percentage score feedback:

For example, I have taken B17 as an example:

Wellposition (B17) Results:

Top 5 Species Identities:
1) Sumatran Rhino    2 matches  50.0%
2) Black Rhino       2 matches  50.0%
3) Rat               2 matches  28.6%
4) Rabbit            2 matches  28.6%
5) Hare              2 matches  28.6%

Open in new window


Here, Sumatran Rhino has 2 matches and a score of 50% (therefore 4 matches = 100%).

However, Sumatran Rhino has 7 markers (see SpeciesId.txt):

Sumatran Rhino

A = 1182.6 AND 1198.6
B = 1453.7
C = 1550.8
D = 2145.1
E = 2808.4 
F = 2869.5 AND 2885.5
G = 2983.4 AND 2999.4

Open in new window


Therefore, the correct percentage if two markers were matched would be 2/7*100 = 29% as opposed to the 50% score given.

I think this has happened for most of the scores.

Any chance you could possibly take a look at this?

Thanks again,

Stephen.
Ray Paseur

I got a neglected question alert on this.  For many of us, PHP is a good solution (sometimes better than Perl) because PHP is a very rich language, free and open source, so widely adopted, has many user groups, etc.  From a quick look at the description and data set, it looks like this is a candidate for a short PHP script and a small data base, perhaps something that is also free like MySQL.  With respect for your time, this is not really a question with an answer, so much as a need for application development, and for that you might want to hire an application developer, perhaps from eLance or similar.  If you decide you want to learn a little PHP and try it, this book is an excellent place to start.
http://www.amazon.com/PHP-MySQL-Web-Development-Edition/dp/0321833899

Best regards, ~Ray
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
ozo

It looks like some of the lines in SpeciesId.txt have blank space at the end, so are not matching
 m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}
try changing it to
 m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?\s*$}
wilcoxon

Point allocation on this question seems extremely unfair.  I spent a significant amount of time writing the script that works almost perfectly.  Allocating all of the points to ozo who fixed the one minor problem with my code seems wrong.

I would request that the asker reallocate points to award the majority of the points to the majority of the work (my script).
ozo

I agree.  You should request that a moderator reopen the question.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
ASKER
StephenMcGowan

I totally agree, I accidentally assigned ozo the points but I didn't know I could re-assign them once submitting.

Sorry wilcoxon, it was a complete accident. Could a moderator please re-open this question?

Thanks
ASKER
StephenMcGowan

I have requested the attention of a moderator.
wilcoxon

Thank you.  I completely understand these things happen (and EE is not always an intuitive interface).
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.