Link to home
Start Free TrialLog in
Avatar of StephenMcGowan
StephenMcGowan

asked on

Modifying perl code

Hi there,

I have an excel CSV file which contains two columns:

Column A: mass
Column B: intensity


I also have a section of code which I'm currently looking to modify:

# record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
        $mass = round($mass);
        $masses{$mass}++;
    }
    close $in;
    # pass masses hash to subroutine
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

Open in new window


At the moment, the code records all of the masses from the file.

I'm looking to change the code so that:

1) the script sorts the CSV file into "intensity" order: highest to lowest. So initially it focuses on column B.

2) the script then uses the "mass" values (column A) for the first 50 intensity values

for example the top 5 would work like this:

mass       intensity
0.3               7
4                 0.8
5                 0.1
0.8               4
1.9               9
2.6               2
3                 5.6
2                 3.2

1) sort into Intensity Order (Highest first)

mass       intensity
1.9               9
0.3               7
3                 5.6
0.8               4
2                 3.2
2.6               2
4                 0.8
5                 0.1

2) Take the first 5 masses from the list

mass
1.9
0.3
3
0.8
2

This is just a small example, however, I'd be doing this for the first 50.

Thanks,

Stephen.
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of StephenMcGowan
StephenMcGowan

ASKER

Hi ozo,

Thanks for getting back to me.

I ran the script and was given two errors regarding @top:

"Global symbol "@top" requires explicit package name at Id_script4.pl line 70"

which is this line:

 push @top,[$mass,$intensity];

and

"Global symbol "@top" requires explicit package name at Id_script4.pl line 74"

which is this line:

"$masses{round($_->[0])}++ for (sort{$b->[1]<=>$a->[1]}@top[0..49])[0..4];"



Stephen.
echo 'Global symbol "@top" requires explicit package name at Id_script4.pl line 70' | splain
Global symbol "@top" requires explicit package name at Id_script4.pl line 70 (#1)
    (F) You've said "use strict" or "use strict vars", which indicates
    that all variables must either be lexically scoped (using "my" or "state"),
    declared beforehand using "our", or explicitly qualified to say
    which package the global variable is in (using "::")


I did not include the declaration because it seemed like the section of code I was modifying  
was also missing some declarations.   Without the rest of the code, I did not know whether you would want to put the new declaration together with whatever other declarations you might have, or even whether you were using strict vars.
Ohh I think I see...

The complete script is shown below, would I need to declare @top with a "my @top =" statement?

#!/usr/bin/perl
use strict;
use warnings;

my $len = 0; # hack global because it's simpler

##########################################################################
#Script to identify animal species using monoisotopic peak markers against
#MS data
##########################################################################

# forward slashes in dir name should work
my $dir = 'C:/Users/Stephen/Desktop/test/relmonopeaklists';
chdir $dir or die "could not cd to $dir: $!";

# create or overwrite SpeciesId
open my $out, '>', 'SpeciesId' or die "could not write SpeciesId: $!";

##########################################################################
#FILE HANDLING
##########################################################################

# get the list of csv files
opendir DIR, '.' or die "could not open dir: $!";
my @files = sort grep m{^\d+_\w+_[A-P]\d+\.csv$}, readdir DIR;
closedir DIR;

####################
#FOR EACH CSV FILE:
####################

foreach my $fil (@files) {
    # get wellposition from filename
    my ($wellposition) = $fil =~ m{^\d+_\w+_([A-P]\d+)\.csv$};
    open my $in, '<', $fil or die "could not open $fil: $!";
    
# record all masses from the file
#    my %masses;
#    while (<$in>) {
#        chomp;
#        # skip header line
#        next if m{mass.*intensity};
#        my ($mass) = split /,/;
#        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
#            warn "mass ($mass) not a recognized number - #skipping";
#            next;
#        }
#        $mass = round($mass);
#        $masses{$mass}++;
#    }
#    close $in;
#    # pass masses hash to subroutine
#    my $data = analyze(\%masses);
#    output($wellposition, $data);
#}
#
#close $out;

# record all masses from the file
    my %masses;
    while (<$in>) {
        chomp;
        # skip header line
        next if m{mass.*intensity};
        my ($mass,$intensity) = split /,/;
        unless ($mass =~ m{^\d+(?:\.\d+)$}) {
            warn "mass ($mass) not a recognized number - skipping";
            next;
        }
       push @top,[$mass,$intensity];

    }
    close $in;
   $masses{round($_->[0])}++ for (sort{$b->[1]<=>$a->[1]}@top[0..49])[0..4];
    # pass masses hash to subroutine
   
    my $data = analyze(\%masses);
    output($wellposition, $data);
}

close $out;

##########################################################################
#SUB-ROUTINES
##########################################################################

sub round {
    my ($num) = @_;
    my ($start, $dig) = $num =~ m{^(\d+(?:\.\d)?)(\d)?};
    $start += 0.1 if (defined $dig and $dig >= 5);
    # XXX - you probably want one of these two uncommented
    # remove .0 from end of number
    # $start =~ s{\.0$}{};
    #add .0 to end of number if no decimal
    $start .= '.0' unless ($start =~ m{\.\d$});
    return $start;
}

# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;

my $Z='Z';
sub _init {

    open my $in, '<', 'Species_Int.txt' or die "could not open Species_Int.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z]?)\s*=?\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?\s*$}) {
            # handle letter = lines
            push @{$species{$spec}{$1||++$Z}}, $2;
            push @{$species{$spec}{$1||$Z}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
            $len = length($spec) if (length($spec) > $len);
        }
    }
    close $in;
}

sub analyze {
    my ($masses) = @_;
    _init() unless %species;
    my %data;
    # loop over species entries
SPEC:
    foreach my $spec (keys %species) {
        # loop over each letter of a species
LTR:
        foreach my $ltr (keys %{$species{$spec}}) {
            # loop over each mass for a letter
            foreach my $mass (@{$species{$spec}{$ltr}}) {
                # skip to next letter if it is not found
                next LTR unless exists($masses->{$mass});
            }
            # if we get here, all mass values were found for the species/letter
            $data{$spec}{cnt}++;
        }
    }
    # add percentages
    foreach my $spec (keys %data) {
        $data{$spec}{pct} = round($data{$spec}{cnt} / scalar(keys %{$species{$spec}}) * 100);
    }
    return \%data;
}
} # end closure

##########################################################################
#RESULTS
##########################################################################

{ # begin closure
my $data;
sub _cust_sort {
    if ($data->{$b}{pct} == $data->{$a}{pct}) {
        return $data->{$b}{cnt} <=> $data->{$a}{cnt};
    }
    return $data->{$b}{pct} <=> $data->{$a}{pct};
}
sub output {
    my $wellposition = shift;
    $data = shift;
    my @order = sort _cust_sort keys %$data;
    print {$out} "Wellposition ($wellposition) Results:\n\n",
                 "Top 5 Species Identities:\n";
    # print out the top 5
    for my $i (0..4) {
        my $spec = $order[$i];
        unless ($order[$i]) {
            print "no more matches\n";
            last; # exit loop
        }
        printf {$out} "%d) %-${len}s  %d matches  %0.1f%%\n", $i+1, $spec, $data->{$spec}{cnt}, $data->{$spec}{pct};
    }
}
} # end closure

Open in new window