?
Solved

Perl Script to input data from an XLS file

Posted on 2013-11-14
9
Medium Priority
?
206 Views
Last Modified: 2014-11-04
Hi,

I'm trying to write a script which takes .xls column entries (from a table of peptides), creates ranges of +/- 3 for each value, and then matches this range against another .xls file to return label information.

I've put together a step-by-step procedure for what I'm trying to achieve in "script_procedure.txt".
I've also attached sample files: "cropped_MB4_Elephant.xls", "cropped_MB10_Dolphin.xls"

Finally, Label_List is used for the script to extract the label information for START and END +/- 3 ranges and input these into a new column.

The attached text file should make clearer sense.

Thanks,

Stephen.
cropped-MB4-Elephant.xls
cropped-MB10-Dolphin.xls
Label-List.xls
script-procedure.txt
0
Comment
Question by:StephenMcGowan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 2
9 Comments
 
LVL 14

Accepted Solution

by:
jb1dev earned 2000 total points
ID: 39649728
I have written the following which *should* work. It works on smaller files than those you have provided. However when run against the full-sized files (~28M) I start seeing strange errors from the underlying Spreadsheet::WriteExcel library, and the new columns are not added to the output. I am still investigating that.

You can see a working example in the attached files.

NOTE set SRCDIR and DSTDIR accordingly.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/small';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                next unless $cell;
            
                #
                # Copy the existing cell to the new doc
                #
                $newWorksheet->write($row, $col, $cell->unformatted());

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if($cell->unformatted() eq "pep_start") {
                        $pStartCol = $col;
                    }
                    if($cell->unformatted() eq "pep_end") {
                        $pEndCol = $col;
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header tow. 
                    # Check if we have a column which needs analysis 
                    #

                    if($col == $pStartCol) {
                        my $start = $cell->unformatted();
                        $sLabel = rangeLookup($start, \%startLabelMap);
                    }
                    
                    if($col == $pEndCol) {
                        my $end = $cell->unformatted();
                        $eLabel = rangeLookup($end, \%endLabelMap);
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

cropped-MB4-Elephant2.xls
cropped-MB4-Elephant2-new.xls
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39649911
Sorry about that. This fixes the issue I mentioned above.

NOTE
Be sure to change the SRCDIR and DSTDIR appropriately. They are located at the top of the script.
Be sure to have the perl modules Spreadsheet::ParseExcel and Spreadsheet::WriteExcel installed.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/scale';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                
                #
                # Copy the existing cell to the new doc
                #
                if(defined($cell)) {
                    $newWorksheet->write($row, $col, $cell->unformatted());
                }

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if(defined($cell)) {
                        if($cell->unformatted() eq "pep_start") {
                            $pStartCol = $col;
                        }
                        if($cell->unformatted() eq "pep_end") {
                            $pEndCol = $col;
                        }
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header row. 
                    # Check if we have a column which needs analysis 
                    #

                    if(defined($cell)) {
                        if($col == $pStartCol) {
                            my $start = $cell->unformatted();
                            $sLabel = rangeLookup($start, \%startLabelMap);
                        }
                    
                        if($col == $pEndCol) {
                            my $end = $cell->unformatted();
                            $eLabel = rangeLookup($end, \%endLabelMap);
                        }
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39650814
Hi jb1dev,

Thanks a lot for getting back to me with this! very much appreciated!

I've tried running the script you've written but I'm currently coming across this error message:

Can't call method "worksheets" on an undefined value @ line 45

which is here...

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

Open in new window


Thanks again,

Stephen.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 14

Expert Comment

by:jb1dev
ID: 39651929
So line 45 is where it is trying to open the worksheets in the "label" file:

for my $worksheet ( $labelWorkbook->worksheets() ) {

Open in new window


Is the label file located in the SRCDIR?

Is the label filename correct? Check my comments in the code here:

#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls
# but file is actually called Label-List.xls
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse(
                        File::Spec->catpath(    $VOLUME,
                                                $SRCDIR,
                                                'Label-List.xls') );

Open in new window

0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39654788
Can you post what you have changed the SRCDIR value to?

If you are on windows, are you using \\ for directory separator?

Sorry I do not have a Windows platform handy on which to try this.

I can add additional logging and checking for presence of the specified directories and files, if that would help.

Let me know.
0
 

Author Comment

by:StephenMcGowan
ID: 39656157
Hi jb1,

This is how I modified the beginning section of code:


#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir setting
#
my $VOLUME = 'C';
my $SRCDIR = '/Users/Stephen/Desktop/LCMS';
my $DSTDIR = '/Users/Stephen/Desktop/output'; 

#
# Read files matching "cropped_MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped_MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label_List.xls') );

#
# Populate label maps to check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();

Open in new window


I'm guessing that the SRCDIR and DSTDIR are set wrong?

Thanks again,

Stephen.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657128
Can you try setting:

my $VOLUME = 'C:\\';
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657199
Or try with the following:

my $VOLUME = '';
my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $DSTDIR = 'C:/Users/Stephen/Desktop/output';

I realize, my dir listing is not using the File::Spec so the above should hopefully work for finding your "LCMS" dir.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657275
Check if the below filetest.pl works.
If it does, I will adjust the code to use this path instead.

#!/usr/bin/perl

my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $file = $SRCDIR.'/'.'Label-List.xls';

if(-e $file) {
    print "File exists\n";
}

Open in new window

0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

718 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question