Solved

Perl Script to input data from an XLS file

Posted on 2013-11-14
9
193 Views
Last Modified: 2014-11-04
Hi,

I'm trying to write a script which takes .xls column entries (from a table of peptides), creates ranges of +/- 3 for each value, and then matches this range against another .xls file to return label information.

I've put together a step-by-step procedure for what I'm trying to achieve in "script_procedure.txt".
I've also attached sample files: "cropped_MB4_Elephant.xls", "cropped_MB10_Dolphin.xls"

Finally, Label_List is used for the script to extract the label information for START and END +/- 3 ranges and input these into a new column.

The attached text file should make clearer sense.

Thanks,

Stephen.
cropped-MB4-Elephant.xls
cropped-MB10-Dolphin.xls
Label-List.xls
script-procedure.txt
0
Comment
Question by:StephenMcGowan
  • 7
  • 2
9 Comments
 
LVL 14

Accepted Solution

by:
jb1dev earned 500 total points
ID: 39649728
I have written the following which *should* work. It works on smaller files than those you have provided. However when run against the full-sized files (~28M) I start seeing strange errors from the underlying Spreadsheet::WriteExcel library, and the new columns are not added to the output. I am still investigating that.

You can see a working example in the attached files.

NOTE set SRCDIR and DSTDIR accordingly.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/small';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                next unless $cell;
            
                #
                # Copy the existing cell to the new doc
                #
                $newWorksheet->write($row, $col, $cell->unformatted());

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if($cell->unformatted() eq "pep_start") {
                        $pStartCol = $col;
                    }
                    if($cell->unformatted() eq "pep_end") {
                        $pEndCol = $col;
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header tow. 
                    # Check if we have a column which needs analysis 
                    #

                    if($col == $pStartCol) {
                        my $start = $cell->unformatted();
                        $sLabel = rangeLookup($start, \%startLabelMap);
                    }
                    
                    if($col == $pEndCol) {
                        my $end = $cell->unformatted();
                        $eLabel = rangeLookup($end, \%endLabelMap);
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

cropped-MB4-Elephant2.xls
cropped-MB4-Elephant2-new.xls
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39649911
Sorry about that. This fixes the issue I mentioned above.

NOTE
Be sure to change the SRCDIR and DSTDIR appropriately. They are located at the top of the script.
Be sure to have the perl modules Spreadsheet::ParseExcel and Spreadsheet::WriteExcel installed.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/scale';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                
                #
                # Copy the existing cell to the new doc
                #
                if(defined($cell)) {
                    $newWorksheet->write($row, $col, $cell->unformatted());
                }

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if(defined($cell)) {
                        if($cell->unformatted() eq "pep_start") {
                            $pStartCol = $col;
                        }
                        if($cell->unformatted() eq "pep_end") {
                            $pEndCol = $col;
                        }
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header row. 
                    # Check if we have a column which needs analysis 
                    #

                    if(defined($cell)) {
                        if($col == $pStartCol) {
                            my $start = $cell->unformatted();
                            $sLabel = rangeLookup($start, \%startLabelMap);
                        }
                    
                        if($col == $pEndCol) {
                            my $end = $cell->unformatted();
                            $eLabel = rangeLookup($end, \%endLabelMap);
                        }
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39650814
Hi jb1dev,

Thanks a lot for getting back to me with this! very much appreciated!

I've tried running the script you've written but I'm currently coming across this error message:

Can't call method "worksheets" on an undefined value @ line 45

which is here...

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

Open in new window


Thanks again,

Stephen.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39651929
So line 45 is where it is trying to open the worksheets in the "label" file:

for my $worksheet ( $labelWorkbook->worksheets() ) {

Open in new window


Is the label file located in the SRCDIR?

Is the label filename correct? Check my comments in the code here:

#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls
# but file is actually called Label-List.xls
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse(
                        File::Spec->catpath(    $VOLUME,
                                                $SRCDIR,
                                                'Label-List.xls') );

Open in new window

0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 14

Expert Comment

by:jb1dev
ID: 39654788
Can you post what you have changed the SRCDIR value to?

If you are on windows, are you using \\ for directory separator?

Sorry I do not have a Windows platform handy on which to try this.

I can add additional logging and checking for presence of the specified directories and files, if that would help.

Let me know.
0
 

Author Comment

by:StephenMcGowan
ID: 39656157
Hi jb1,

This is how I modified the beginning section of code:


#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir setting
#
my $VOLUME = 'C';
my $SRCDIR = '/Users/Stephen/Desktop/LCMS';
my $DSTDIR = '/Users/Stephen/Desktop/output'; 

#
# Read files matching "cropped_MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped_MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label_List.xls') );

#
# Populate label maps to check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();

Open in new window


I'm guessing that the SRCDIR and DSTDIR are set wrong?

Thanks again,

Stephen.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657128
Can you try setting:

my $VOLUME = 'C:\\';
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657199
Or try with the following:

my $VOLUME = '';
my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $DSTDIR = 'C:/Users/Stephen/Desktop/output';

I realize, my dir listing is not using the File::Spec so the above should hopefully work for finding your "LCMS" dir.
0
 
LVL 14

Expert Comment

by:jb1dev
ID: 39657275
Check if the below filetest.pl works.
If it does, I will adjust the code to use this path instead.

#!/usr/bin/perl

my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $file = $SRCDIR.'/'.'Label-List.xls';

if(-e $file) {
    print "File exists\n";
}

Open in new window

0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
either24  challenge 19 90
recursion example 16 111
Programatically extract date from website 8 65
What language/protocol is the Angular Chat? 2 53
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Windows Script Host (WSH) has been part of Windows since Windows NT4. Windows Script Host provides architecture for building dynamic scripts that consist of a core object model, scripting hosts, and scripting engines. The key components of Window…
The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

947 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now