Avatar of StephenMcGowan
StephenMcGowan
 asked on

Perl Script to input data from an XLS file

Hi,

I'm trying to write a script which takes .xls column entries (from a table of peptides), creates ranges of +/- 3 for each value, and then matches this range against another .xls file to return label information.

I've put together a step-by-step procedure for what I'm trying to achieve in "script_procedure.txt".
I've also attached sample files: "cropped_MB4_Elephant.xls", "cropped_MB10_Dolphin.xls"

Finally, Label_List is used for the script to extract the label information for START and END +/- 3 ranges and input these into a new column.

The attached text file should make clearer sense.

Thanks,

Stephen.
cropped-MB4-Elephant.xls
cropped-MB10-Dolphin.xls
Label-List.xls
script-procedure.txt
PerlProgramming Languages-Other

Avatar of undefined
Last Comment
jb1dev

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
jb1dev

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
jb1dev

Sorry about that. This fixes the issue I mentioned above.

NOTE
Be sure to change the SRCDIR and DSTDIR appropriately. They are located at the top of the script.
Be sure to have the perl modules Spreadsheet::ParseExcel and Spreadsheet::WriteExcel installed.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/scale';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                
                #
                # Copy the existing cell to the new doc
                #
                if(defined($cell)) {
                    $newWorksheet->write($row, $col, $cell->unformatted());
                }

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if(defined($cell)) {
                        if($cell->unformatted() eq "pep_start") {
                            $pStartCol = $col;
                        }
                        if($cell->unformatted() eq "pep_end") {
                            $pEndCol = $col;
                        }
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header row. 
                    # Check if we have a column which needs analysis 
                    #

                    if(defined($cell)) {
                        if($col == $pStartCol) {
                            my $start = $cell->unformatted();
                            $sLabel = rangeLookup($start, \%startLabelMap);
                        }
                    
                        if($col == $pEndCol) {
                            my $end = $cell->unformatted();
                            $eLabel = rangeLookup($end, \%endLabelMap);
                        }
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

StephenMcGowan

ASKER
Hi jb1dev,

Thanks a lot for getting back to me with this! very much appreciated!

I've tried running the script you've written but I'm currently coming across this error message:

Can't call method "worksheets" on an undefined value @ line 45

which is here...

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

Open in new window


Thanks again,

Stephen.
jb1dev

So line 45 is where it is trying to open the worksheets in the "label" file:

for my $worksheet ( $labelWorkbook->worksheets() ) {

Open in new window


Is the label file located in the SRCDIR?

Is the label filename correct? Check my comments in the code here:

#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls
# but file is actually called Label-List.xls
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse(
                        File::Spec->catpath(    $VOLUME,
                                                $SRCDIR,
                                                'Label-List.xls') );

Open in new window

All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
jb1dev

Can you post what you have changed the SRCDIR value to?

If you are on windows, are you using \\ for directory separator?

Sorry I do not have a Windows platform handy on which to try this.

I can add additional logging and checking for presence of the specified directories and files, if that would help.

Let me know.
StephenMcGowan

ASKER
Hi jb1,

This is how I modified the beginning section of code:


#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir setting
#
my $VOLUME = 'C';
my $SRCDIR = '/Users/Stephen/Desktop/LCMS';
my $DSTDIR = '/Users/Stephen/Desktop/output'; 

#
# Read files matching "cropped_MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped_MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label_List.xls') );

#
# Populate label maps to check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();

Open in new window


I'm guessing that the SRCDIR and DSTDIR are set wrong?

Thanks again,

Stephen.
jb1dev

Can you try setting:

my $VOLUME = 'C:\\';
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
jb1dev

Or try with the following:

my $VOLUME = '';
my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $DSTDIR = 'C:/Users/Stephen/Desktop/output';

I realize, my dir listing is not using the File::Spec so the above should hopefully work for finding your "LCMS" dir.
jb1dev

Check if the below filetest.pl works.
If it does, I will adjust the code to use this path instead.

#!/usr/bin/perl

my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $file = $SRCDIR.'/'.'Label-List.xls';

if(-e $file) {
    print "File exists\n";
}

Open in new window