Solved

Modifying an existing perl script

Posted on 2013-10-25
7
279 Views
Last Modified: 2013-11-01
Hi,

I'm trying to modify an existing perl script: Speciesmatch.pl

The script currently takes CSV files in the following naming convention: 20130730-p12-A1.csv; i.e. "date, p-number, location.csv"

However, the structure of these files has now changed in that the file names have now been shortened to simply their location (e.g. "A1") followed with a short reference name (e.g. "DOL1").
I've attached 3 example csv data files to this question.

Briefly about the script:

The script (speciesmatch.txt) should process the CSV files one at a time.
For each csv file, it should look at the mass column (first column) and go through the column of scores.
For each score it should take this value and check this mass score against the scores for each animal species contained in SpeciesId.txt. It should record a running tally to see which animal species has the most number of matches to the mass column in the CSV file. The results are recorded in a SpeciesID file where the "top 5 species" are recorded with their number of matches and  percentage match.

The script is already set up correctly to do this, although I'm struggling to run it with the altered ".csv" file naming convention (A1-DOL1.csv)


One other modification I'd like to make to this script is to take the top 5 animal species match for a file (say in this case A1)

1. Dolphin 12 matches 100%
2. Cow  5 matches...
3. Snake 2 matches...
etc

Take the very top match-up (in this case Horse)
I would then like the script to open the A1_DOL1.csv file and then write to the very first row:

"Best match: Dolphin"

This would mean that the "mass/intensity/filename/samplename" table would need to be shifted down a couple of lines for the "Best match: Dolphin" to be written in to the CSV file.

Sorry to come across as long winded... but I just thought I'd put as much detail into this question as possible, as I found it quite hard to explain what I'm trying to achieve out of this. Hopefully the script will make sense and will have most of the hard work already done.

Any help with this would be very much appreciated.

Thanks,

Stephen.
Speciesmatch.txt
A1-DOL1.csv
A2-HOR.csv
A3-HIPP.csv
SpeciesId.txt
0
Comment
Question by:StephenMcGowan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
7 Comments
 
LVL 5

Expert Comment

by:Dave Gould
ID: 39600658
The line "my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir DIR" creates an array called files and will add all files in the directory DIR (C:/Users/Stephen/Desktop/20130219_PH12A/relmonopeaklists) thet meet the following requirements:
^ ==> start of line
\d+ ==> one or more numbers
_ ==> an underscore char
\w ==> any single "word" character (a letter, digit or underscore)
_ ==>  an underscore char
[A-P] ==> A capital letter between A & P
\d+ ==> one or more numbers
\.csv$ ==> a . followed by csv followed by end of line ( the backslash is so as not to interpret the .)

In order to know exactly how to modify the line, you need to know what are the location naming conventions. Same for the short reference but you could use ^.+\.csv$
This would match any line ending in .csv
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39600801
This should do what you are asking:
#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;

my $len = 0; # hack global because it's simpler

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/20130219_PH12A/relmonopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir DIR, '.' or die "could not open dir: $!";
my @files = sort grep m{^[A-P]\d+[-_]\w+\.csv$}, readdir DIR;
# match A-P followed by at least 1 number followed by
## _ or - (description says - but code says _) and any letters/numbers
# alternately, you could simply do m{\.csv$} as trappa suggested but it may
## pick up files you don't want it to
closedir DIR;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data, $fil);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = $num =~ m{^(\d+(?:\.\d)?)(\d)?};
    $start += 0.1 if (defined $dig and $dig >= 5);
    # XXXXX - XXXXXX
    # remove .0 from end of number
    # $start =~ s{\.0$}{};
    #add .0 to end of number if no decimal
    $start .= '.0' unless ($start =~ m{\.\d$});
    return $start;
}

# main sub
{ # closure
my $Z='Z';
sub _init {
    open my $in, '<', 'SpeciesId3.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z]?)\s*=?\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?\s*$}) {
            # handle letter = lines
            push @{$species{$spec}{$1||++$Z}}, $2;
            push @{$species{$spec}{$1||$Z}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
            $len = length($spec) if (length($spec) > $len);
        }
    }
    close $in;
}

sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(keys %{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

{ # begin closure
my $data;
sub _cust_sort {
    if ($data->{$b}{pct} == $data->{$a}{pct}) {
        return $data->{$b}{cnt} <=> $data->{$a}{cnt};
    }
    return $data->{$b}{pct} <=> $data->{$a}{pct};
}
sub output {
    my ($wellposition, $data, $fil) = @_;
    my @order = sort _cust_sort keys %$data;
    print {$out} "Wellposition ($wellposition) Results:\n\n",
                 "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($spec) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %-${len}s  %d matches  %0.1f%%\n", $i+1, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
    if ($order[0]) {
        # could be done without Tie::File but I find it easiest for modifying files
        tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
        unshift @lines, "Best match: $order[0]", '';
        untie @lines;
    }
}
} # end closure

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39605186
Hi Wilcoxon,

The script seems to be falling over at the moment, with %species requiring explicit package name:

%species requires explicit package name
Thanks,

Stephen
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:StephenMcGowan
ID: 39609167
Ok, so I've tried to fix the above problem.

At the beginning of the script I declared %species using

my %species;

#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;

my $len = 0; # hack global because it's simpler
my %species;

Open in new window


The script seems to run through to completion, although I'm receiving a large number of error messages which seem to center around:

uninitialized value in numeric comparison  (<=>) at line 137
uninitialized value in numeric eq (==) at line 136
uninitialized value $wellposition in concatenation (.) or string at line 144

(see the attached file)

Checking the SpeciesId results file, it seems that the wellposition ($wellposition) hasn't been written to the file:

i.e.

Wellposition () Results:

Top 5 Species Identities:
1) Mic Oeconomus     1 matches  1.0%
2) Field Mouse       1 matches  1.5%
Wellposition () Results:

where it should read:

Wellposition (A1) Results:

Top 5 Species Identities:
1) Mic Oeconomus     1 matches  1.0%
2) Field Mouse       1 matches  1.5%
Wellposition (A10) Results:

etc etc
errors.jpg
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39609901
I'll try to take a look tomorrow.  It's been a busy week.
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 39615920
Sorry it took so long to get back to this.  This should work (tested).  There were a few errors in the script you provided, I missed one necessary change, and I added one new bug...
#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;

my $len = 0; # hack global because it's simpler

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/20130219_PH12A/relmonopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir DIR, '.' or die "could not open dir: $!";
my @files = sort grep m{^[A-P]\d+[-_]\w+\.csv$}, readdir DIR;
# match A-P followed by at least 1 number followed by
## _ or - (description says - but code says _) and any letters/numbers
# alternately, you could simply do m{\.csv$} as trappa suggested but it may
## pick up files you don't want it to
closedir DIR;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^([A-P]\d+)[-_]\w+\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    # record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data, $fil);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = $num =~ m{^(\d+(?:\.\d)?)(\d)?};
    $start += 0.1 if (defined $dig and $dig >= 5);
    # XXXXX - XXXXXX
    # remove .0 from end of number
    # $start =~ s{\.0$}{};
    #add .0 to end of number if no decimal
    $start .= '.0' unless ($start =~ m{\.\d$});
    return $start;
}

# main sub
{ # closure
my $Z='Z';
my %species;
sub _init {
    open my $in, '<', 'SpeciesId3.txt' or die "could not open SpeciesId.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z]?)\s*=?\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?\s*$}) {
            # handle letter = lines
            push @{$species{$spec}{$1||++$Z}}, $2;
            push @{$species{$spec}{$1||$Z}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
            $len = length($spec) if (length($spec) > $len);
        }
    }
    close $in;
}

sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(keys %{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

{ # begin closure
my $data;
sub _cust_sort {
    if ($data->{$b}{pct} == $data->{$a}{pct}) {
        return $data->{$b}{cnt} <=> $data->{$a}{cnt};
    }
    return $data->{$b}{pct} <=> $data->{$a}{pct};
}
sub output {
    my ($wellposition, $fil);
    ($wellposition, $data, $fil) = @_;
    my @order = sort _cust_sort keys %$data;
    print {$out} "Wellposition ($wellposition) Results:\n\n",
                 "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($spec) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %-${len}s  %d matches  %0.1f%%\n", $i+1, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
    print {$out} "\n";
    if ($order[0]) {
        # could be done without Tie::File but I find it easiest for modifying files
        tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
        unshift @lines, "Best match: $order[0]", '';
        untie @lines;
    }
}
} # end closure

Open in new window

0
 

Author Closing Comment

by:StephenMcGowan
ID: 39616417
Excellent! - does exactly as intended.
0

Featured Post

Enroll in June's Course of the Month

June's Course of the Month is now available! Every 10 seconds, a consumer gets hit with ransomware. Refresh your knowledge of ransomware best practices by enrolling in this month's complimentary course for Premium Members, Team Accounts, and Qualified Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Having just graduated from college and entered the workforce, I don’t find myself always using the tools and programs I grew accustomed to over the past four years. However, there is one program I continually find myself reverting back to…R.   So …
The purpose of this article is to demonstrate how we can use conditional statements using Python.
The viewer will learn how to implement Singleton Design Pattern in Java.
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question