Solved

CSV file script for cropping and matching

Posted on 2013-10-29
9
294 Views
Last Modified: 2013-11-11
Hi,

I have a folder of .CSV files which I'd like a script to process.

The files have the following naming convention:

MB12_Mouse.csv
MB13_Chicken.csv
MB14_Pig.csv

etc

And I have located these to the following directory: C:\Users\Stephen\Desktop\LCMS

I've attached an example .csv file (Mouse) to show the format and style of the CSV file I'm currently working with.

Roughly 100 rows into each file is a table, row headings beginning with "prot_hit_num, prot acc" etc.
I basically need to modify each .csv file to keep this table but remove all prior information for each file.
Therefore this table header would move from row 99 to row 1, and the rest of the table would move up.

The second thing I'd like to do with this table, is to add a new column called "pep_label" and insert this one column to the right hand side of the "pep_end" column in the table.

For each row of the table, I would like to match up "pep_start" and "pep_end" with "Start" and "End" of Label_List.xls (attached) and if:

pep_start matches Start
and
pep_end matches End

return "Label" from Label_List.xls and put this into the"pep_label" column on the appropriate row in the .csv table.
MB12-Mouse.csv
Label-List.xls
0
Comment
Question by:StephenMcGowan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
9 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 39611713
What do you want to happen with the second table (peptide matches not assigned to protein hits)?

This question is currently in the Perl zone.  Are you only accepting Perl solutions?

Once the data is 'cleaned up', what are you going to do with it?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39615937
Give this a shot...
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

my $dir = 'C:\Users\Stephen\Desktop\LCMS';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$} readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39616575
Hey wilcoxon,

Gave it a shot, cheers for getting back to me.

At the moment I'm receiving the following errors:

Not enough arguments for grep at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
syntax error at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
Execution of stephen_label_script.pl aborted due to compilation errors.


Cheers again,
errors2.jpg
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 26

Expert Comment

by:wilcoxon
ID: 39618205
Sigh - stupid grep syntax - block format does not need a comma but regex format does.  Add a comma after the regex and before readdir on the grep/readdir line.  That should fix the problem.

To be specific, line 10 should be:
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39628990
Hi wilcoxon,

Thanks for getting back to me, sorry it's taken so long for me to get back to you with this.
I've tried modifying the script with the comma inserted into line 10 above.

However when I now try to run the script, the cmd prompt cursor moves to the next line and flashes to indicate that the script is running, however, this has been continuously flashing now for the past 3 days with "MB-12-Mouse.csv" being "locked for editing". I wouldn't expect this to be a three day script-run job? ;)

Would you be able to test this script at your end using the mouse.csv and label-list and let me know if it works for you?

Thanks,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629185
Sure.  I'll test it sometime in the next few days (possibly tonight)...
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 39635892
Sorry it took me this long to get back to you...

Here's a revised script with more debugging and more output.  The critical fix was adding the $i++ to the @lines loop (so, when you ran it, it was reprocessing the same line for 3 days).
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

#my $dir = 'C:\Users\Stephen\Desktop\LCMS';
my $dir = '.';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    print "processing $fil...\n";
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        #print "\ttable line $i\n";
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (not defined $cols[$st] or not defined $cols[$en]) {
            warn "line $i missing desired columns ($st,$en) - skipping\n$lines[$i]\n";
            $i++;
            next;
        }
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
        $i++;
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39639566
Thanks again wilcoxon,

No worries about the time taken :)

Just a quick question, is it possible to write a simpler script which basically extracts the table for each CSV file, and outputs this table as a separate CSV file,  using "cropped_" as the filename precursor.

Say:

"cropped_MB12_Mouse.csv" etc?


Is it also possible to write that into this script as well? so that there would be two versions of each CSV file:

MB12_Mouse.csv
cropped_MB12_Mouse.csv

Thanks again,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39639599
Sure.  It would be possible.  It wouldn't really be much simpler than the current one.

I'm unclear on your second question.  Would you want to add the functionality to this script as it is (eg the files would be identical) or would you want changes made to this script before adding in the logic to create cropped_*.csv?
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
java constructor error 8 133
matchUp  challenge 9 125
Python 3.5.2 32 virtualenv problems 3 87
sort Multi-dimensional array 6 16
If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question