Solved

Perl Script to input data from an XLS file

Posted on 2013-11-14
9
198 Views
Last Modified: 2014-11-04
Hi,

I'm trying to write a script which takes .xls column entries (from a table of peptides), creates ranges of +/- 3 for each value, and then matches this range against another .xls file to return label information.

I've put together a step-by-step procedure for what I'm trying to achieve in "script_procedure.txt".
I've also attached sample files: "cropped_MB4_Elephant.xls", "cropped_MB10_Dolphin.xls"

Finally, Label_List is used for the script to extract the label information for START and END +/- 3 ranges and input these into a new column.

The attached text file should make clearer sense.

Thanks,

Stephen.
cropped-MB4-Elephant.xls
cropped-MB10-Dolphin.xls
Label-List.xls
script-procedure.txt
0
Comment
Question by:StephenMcGowan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 2
9 Comments
 
LVL 14

Accepted Solution

by:
jb1dev earned 500 total points
ID: 39649728
I have written the following which *should* work. It works on smaller files than those you have provided. However when run against the full-sized files (~28M) I start seeing strange errors from the underlying Spreadsheet::WriteExcel library, and the new columns are not added to the output. I am still investigating that.

You can see a working example in the attached files.

NOTE set SRCDIR and DSTDIR accordingly.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/small';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                next unless $cell;
            
                #
                # Copy the existing cell to the new doc
                #
                $newWorksheet->write($row, $col, $cell->unformatted());

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if($cell->unformatted() eq "pep_start") {
                        $pStartCol = $col;
                    }
                    if($cell->unformatted() eq "pep_end") {
                        $pEndCol = $col;
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header tow. 
                    # Check if we have a column which needs analysis 
                    #

                    if($col == $pStartCol) {
                        my $start = $cell->unformatted();
                        $sLabel = rangeLookup($start, \%startLabelMap);
                    }
                    
                    if($col == $pEndCol) {
                        my $end = $cell->unformatted();
                        $eLabel = rangeLookup($end, \%endLabelMap);
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

cropped-MB4-Elephant2.xls
cropped-MB4-Elephant2-new.xls
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39649911
Sorry about that. This fixes the issue I mentioned above.

NOTE
Be sure to change the SRCDIR and DSTDIR appropriately. They are located at the top of the script.
Be sure to have the perl modules Spreadsheet::ParseExcel and Spreadsheet::WriteExcel installed.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/scale';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                
                #
                # Copy the existing cell to the new doc
                #
                if(defined($cell)) {
                    $newWorksheet->write($row, $col, $cell->unformatted());
                }

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if(defined($cell)) {
                        if($cell->unformatted() eq "pep_start") {
                            $pStartCol = $col;
                        }
                        if($cell->unformatted() eq "pep_end") {
                            $pEndCol = $col;
                        }
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header row. 
                    # Check if we have a column which needs analysis 
                    #

                    if(defined($cell)) {
                        if($col == $pStartCol) {
                            my $start = $cell->unformatted();
                            $sLabel = rangeLookup($start, \%startLabelMap);
                        }
                    
                        if($col == $pEndCol) {
                            my $end = $cell->unformatted();
                            $eLabel = rangeLookup($end, \%endLabelMap);
                        }
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39650814
Hi jb1dev,

Thanks a lot for getting back to me with this! very much appreciated!

I've tried running the script you've written but I'm currently coming across this error message:

Can't call method "worksheets" on an undefined value @ line 45

which is here...

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

Open in new window


Thanks again,

Stephen.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 14

Expert Comment

by:jb1dev
ID: 39651929
So line 45 is where it is trying to open the worksheets in the "label" file:

for my $worksheet ( $labelWorkbook->worksheets() ) {

Open in new window


Is the label file located in the SRCDIR?

Is the label filename correct? Check my comments in the code here:

#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls
# but file is actually called Label-List.xls
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse(
                        File::Spec->catpath(    $VOLUME,
                                                $SRCDIR,
                                                'Label-List.xls') );

Open in new window

0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39654788
Can you post what you have changed the SRCDIR value to?

If you are on windows, are you using \\ for directory separator?

Sorry I do not have a Windows platform handy on which to try this.

I can add additional logging and checking for presence of the specified directories and files, if that would help.

Let me know.
0
 

Author Comment

by:StephenMcGowan
ID: 39656157
Hi jb1,

This is how I modified the beginning section of code:


#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir setting
#
my $VOLUME = 'C';
my $SRCDIR = '/Users/Stephen/Desktop/LCMS';
my $DSTDIR = '/Users/Stephen/Desktop/output'; 

#
# Read files matching "cropped_MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped_MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label_List.xls') );

#
# Populate label maps to check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();

Open in new window


I'm guessing that the SRCDIR and DSTDIR are set wrong?

Thanks again,

Stephen.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657128
Can you try setting:

my $VOLUME = 'C:\\';
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657199
Or try with the following:

my $VOLUME = '';
my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $DSTDIR = 'C:/Users/Stephen/Desktop/output';

I realize, my dir listing is not using the File::Spec so the above should hopefully work for finding your "LCMS" dir.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657275
Check if the below filetest.pl works.
If it does, I will adjust the code to use this path instead.

#!/usr/bin/perl

my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $file = $SRCDIR.'/'.'Label-List.xls';

if(-e $file) {
    print "File exists\n";
}

Open in new window

0

Featured Post

[Webinar] Code, Load, and Grow

Managing multiple websites, servers, applications, and security on a daily basis? Join us for a webinar on May 25th to learn how to simplify administration and management of virtual hosts for IT admins, create a secure environment, and deploy code more effectively and frequently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Adobe Customization Wizard XI issues 26 279
Choosing the right language for new project 8 68
Perl Frameworks 1 119
Problem to error 4 81
Navigation is an important part of web design from a usability perspective. But it is often a pain when it comes to a developer’s perspective. By navigation, it often means menuing. This is less theory and more practical of how to get a specific gro…
How to remove superseded packages in windows w60 or w61 installation media (.wim) or online system to prevent unnecessary space. w60 means Windows Vista or Windows Server 2008. w61 means Windows 7 or Windows Server 2008 R2. There are various …
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question