Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 215
  • Last Modified:

Perl Script to input data from an XLS file

Hi,

I'm trying to write a script which takes .xls column entries (from a table of peptides), creates ranges of +/- 3 for each value, and then matches this range against another .xls file to return label information.

I've put together a step-by-step procedure for what I'm trying to achieve in "script_procedure.txt".
I've also attached sample files: "cropped_MB4_Elephant.xls", "cropped_MB10_Dolphin.xls"

Finally, Label_List is used for the script to extract the label information for START and END +/- 3 ranges and input these into a new column.

The attached text file should make clearer sense.

Thanks,

Stephen.
cropped-MB4-Elephant.xls
cropped-MB10-Dolphin.xls
Label-List.xls
script-procedure.txt
0
StephenMcGowan
Asked:
StephenMcGowan
  • 7
  • 2
1 Solution
 
jb1devCommented:
I have written the following which *should* work. It works on smaller files than those you have provided. However when run against the full-sized files (~28M) I start seeing strange errors from the underlying Spreadsheet::WriteExcel library, and the new columns are not added to the output. I am still investigating that.

You can see a working example in the attached files.

NOTE set SRCDIR and DSTDIR accordingly.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/small';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                next unless $cell;
            
                #
                # Copy the existing cell to the new doc
                #
                $newWorksheet->write($row, $col, $cell->unformatted());

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if($cell->unformatted() eq "pep_start") {
                        $pStartCol = $col;
                    }
                    if($cell->unformatted() eq "pep_end") {
                        $pEndCol = $col;
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header tow. 
                    # Check if we have a column which needs analysis 
                    #

                    if($col == $pStartCol) {
                        my $start = $cell->unformatted();
                        $sLabel = rangeLookup($start, \%startLabelMap);
                    }
                    
                    if($col == $pEndCol) {
                        my $end = $cell->unformatted();
                        $eLabel = rangeLookup($end, \%endLabelMap);
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

cropped-MB4-Elephant2.xls
cropped-MB4-Elephant2-new.xls
0
 
jb1devCommented:
Sorry about that. This fixes the issue I mentioned above.

NOTE
Be sure to change the SRCDIR and DSTDIR appropriately. They are located at the top of the script.
Be sure to have the perl modules Spreadsheet::ParseExcel and Spreadsheet::WriteExcel installed.

#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir here
#
my $VOLUME = 'C';   # Ignored on Unix. 
                    # Probably needs to be "C" or "C:" on windows.
my $SRCDIR = '/home/exch/20131114/scale';
my $DSTDIR = '/home/exch/20131114/output'; 

#
# Read files matching "cropped-MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped-MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls 
# but file is actually called Label-List.xls 
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label-List.xls') );

#
# Populate label maps for which we will check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();
       
        # print "DEBUG Adding label $label $start $end\n";

        if($start ne "") {
            $startLabelMap{$start}  = $label;
        }

        if($end ne "") {
            $endLabelMap{$end}      = $label;
        }

    }
}

#
# Function for checking ranges.
# Find number key within range in hash.
#
# @param num        The number to lookup in the hash. (Find within +/- 3)
# @param hashref    A reference to the hash
#
sub rangeLookup {
    my $num = shift;
    my $hashRef = shift;
     
    # print "DEBUG Checking map...\n";

    foreach my $key (sort (keys %{$hashRef})) {
        my $label = $hashRef->{$key};

        # print "DEBUG Checking $num against $key: $label\n";

        if( $key > $num-3 && $key < $num+3) {
            # print "DEBUG Found match $label\n";
            return $label;
        }
    }

    return "No label for $num";
}


#
# Process each file
#
foreach my $file (@filelist) {
    print "INFO Processing file $file\n";
    
    # print "INFO Creating parser\n";
    my $parser   = Spreadsheet::ParseExcel->new();
    my $workbook = $parser->parse(
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                $file ) );
    # print "INFO Creating writer\n";
    my $newWorkbook  = Spreadsheet::WriteExcel->new(
                                    File::Spec->catpath(    $VOLUME, 
                                                            $DSTDIR, 
                                                            "$file".".new") );
    my $newWorksheet = $newWorkbook->add_worksheet();

    # print "INFO Iterating over worksheets...\n";

    #
    # Spec doesn't mention if these columns are always at the same index.
    # If yes, then we can just hard code these rather than inspecting
    # the header names.
    #
    my $pStartCol;
    my $pEndCol;

    #
    # Spec doesn't mention if these new columns will always be at the same 
    # index, so keep track of their index when we create them.
    #
    my $lStartCol;
    my $lEndCol; 
    my $labelCol;

    for my $worksheet ( $workbook->worksheets() ) {
        my ( $row_min, $row_max ) = $worksheet->row_range();
        my ( $col_min, $col_max ) = $worksheet->col_range();

        for my $row ( $row_min .. $row_max ) {

            #
            # Derived start and end labels for this row.
            #
            my $sLabel;
            my $eLabel;

            for my $col ( $col_min .. $col_max ) {
                my $cell = $worksheet->get_cell( $row, $col );
                
                #
                # Copy the existing cell to the new doc
                #
                if(defined($cell)) {
                    $newWorksheet->write($row, $col, $cell->unformatted());
                }

                if($row == 0) {
                    
                    #
                    # We are reading the headers. This will tell us
                    # column indexes of pep_start/pep_end
                    #

                    if(defined($cell)) {
                        if($cell->unformatted() eq "pep_start") {
                            $pStartCol = $col;
                        }
                        if($cell->unformatted() eq "pep_end") {
                            $pEndCol = $col;
                        }
                    }

                    #
                    # if we are at the end of the first row of columns, 
                    # add the new columns for L_Start L_End and LabeList
                    #
                    if($col == $col_max) {
                        $lStartCol  = $col + 1;
                        $lEndCol    = $col + 2;
                        $labelCol   = $col + 3;
                        $newWorksheet->write($row, $lStartCol,  "L_Start");
                        $newWorksheet->write($row, $lEndCol,    "L_End");
                        $newWorksheet->write($row, $labelCol,   "LabelList");
                    }

                } else {
                    
                    #
                    # We are reading past the header row. 
                    # Check if we have a column which needs analysis 
                    #

                    if(defined($cell)) {
                        if($col == $pStartCol) {
                            my $start = $cell->unformatted();
                            $sLabel = rangeLookup($start, \%startLabelMap);
                        }
                    
                        if($col == $pEndCol) {
                            my $end = $cell->unformatted();
                            $eLabel = rangeLookup($end, \%endLabelMap);
                        }
                    }

                    #
                    # We have read all of the existing columns.
                    # Concat the start and end labels as necessary and
                    # add them
                    #
                    if($col == $col_max) {
                        if(!defined($sLabel) && !defined($eLabel)) {
                            print "WARNING No label for row ".$row."\n";    
                            $newWorksheet->write($row, $labelCol, "No Label");
                        } else {
                            $newWorksheet->write($row, $lStartCol, $sLabel);
                            $newWorksheet->write($row, $lEndCol, $eLabel);
                            if($sLabel eq $eLabel) {
                                $newWorksheet->write($row, $labelCol, $sLabel);
                            } else {
                                $newWorksheet->write($row, $labelCol, 
                                                    "$sLabel"."_"."$eLabel");
                            }
                        }
                    }
                }

                # DEBUG
                # print "Row, Col    = ($row, $col)\n";
                # print "Value       = ", $cell->value(),       "\n";
                # print "Unformatted = ", $cell->unformatted(), "\n";
                # print "\n";
            }
        } 

    }
}

Open in new window

0
 
StephenMcGowanAuthor Commented:
Hi jb1dev,

Thanks a lot for getting back to me with this! very much appreciated!

I've tried running the script you've written but I'm currently coming across this error message:

Can't call method "worksheets" on an undefined value @ line 45

which is here...

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

Open in new window


Thanks again,

Stephen.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
jb1devCommented:
So line 45 is where it is trying to open the worksheets in the "label" file:

for my $worksheet ( $labelWorkbook->worksheets() ) {

Open in new window


Is the label file located in the SRCDIR?

Is the label filename correct? Check my comments in the code here:

#
# Read Label List (Label_List.xls)
# Spec calls this Label_List.xls
# but file is actually called Label-List.xls
# Please note dash in filename versus underscore and change accordingly
# when/if appropriate
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse(
                        File::Spec->catpath(    $VOLUME,
                                                $SRCDIR,
                                                'Label-List.xls') );

Open in new window

0
 
jb1devCommented:
Can you post what you have changed the SRCDIR value to?

If you are on windows, are you using \\ for directory separator?

Sorry I do not have a Windows platform handy on which to try this.

I can add additional logging and checking for presence of the specified directories and files, if that would help.

Let me know.
0
 
StephenMcGowanAuthor Commented:
Hi jb1,

This is how I modified the beginning section of code:


#!/usr/bin/perl -w

use strict;
use File::Spec;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;

#
# Set source dir /dst dir setting
#
my $VOLUME = 'C';
my $SRCDIR = '/Users/Stephen/Desktop/LCMS';
my $DSTDIR = '/Users/Stephen/Desktop/output'; 

#
# Read files matching "cropped_MB" from DIR
#
opendir DIR, $SRCDIR or die "Could not open dir: $!\n";
my @filelist = grep(/cropped_MB/i, readdir DIR);
closedir DIR;


#
# Read Label List (Label_List.xls)
#
my $labelParser   = Spreadsheet::ParseExcel->new();
my $labelWorkbook = $labelParser->parse( 
                        File::Spec->catpath(    $VOLUME, 
                                                $SRCDIR, 
                                                'Label_List.xls') );

#
# Populate label maps to check ranges against.
#
my %startLabelMap;
my %endLabelMap;

# print "INFO Populating label maps...\n";

for my $worksheet ( $labelWorkbook->worksheets() ) {
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        if($row == 0) {
            # Skip first row of label list, these are headers
            # print "DEBUG Skipping ".$worksheet->get_cell( $row, 0 )->unformatted()."\n";
            next;
        }
        
        my $label   = $worksheet->get_cell( $row, 0 )->unformatted();
        my $start   = $worksheet->get_cell( $row, 1 )->unformatted();
        my $end     = $worksheet->get_cell( $row, 2 )->unformatted();

Open in new window


I'm guessing that the SRCDIR and DSTDIR are set wrong?

Thanks again,

Stephen.
0
 
jb1devCommented:
Can you try setting:

my $VOLUME = 'C:\\';
0
 
jb1devCommented:
Or try with the following:

my $VOLUME = '';
my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $DSTDIR = 'C:/Users/Stephen/Desktop/output';

I realize, my dir listing is not using the File::Spec so the above should hopefully work for finding your "LCMS" dir.
0
 
jb1devCommented:
Check if the below filetest.pl works.
If it does, I will adjust the code to use this path instead.

#!/usr/bin/perl

my $SRCDIR = 'C:/Users/Stephen/Desktop/LCMS';
my $file = $SRCDIR.'/'.'Label-List.xls';

if(-e $file) {
    print "File exists\n";
}

Open in new window

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 7
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now