Solved

CSV file script for cropping and matching

Posted on 2013-10-29
9
292 Views
Last Modified: 2013-11-11
Hi,

I have a folder of .CSV files which I'd like a script to process.

The files have the following naming convention:

MB12_Mouse.csv
MB13_Chicken.csv
MB14_Pig.csv

etc

And I have located these to the following directory: C:\Users\Stephen\Desktop\LCMS

I've attached an example .csv file (Mouse) to show the format and style of the CSV file I'm currently working with.

Roughly 100 rows into each file is a table, row headings beginning with "prot_hit_num, prot acc" etc.
I basically need to modify each .csv file to keep this table but remove all prior information for each file.
Therefore this table header would move from row 99 to row 1, and the rest of the table would move up.

The second thing I'd like to do with this table, is to add a new column called "pep_label" and insert this one column to the right hand side of the "pep_end" column in the table.

For each row of the table, I would like to match up "pep_start" and "pep_end" with "Start" and "End" of Label_List.xls (attached) and if:

pep_start matches Start
and
pep_end matches End

return "Label" from Label_List.xls and put this into the"pep_label" column on the appropriate row in the .csv table.
MB12-Mouse.csv
Label-List.xls
0
Comment
Question by:StephenMcGowan
  • 5
  • 3
9 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 39611713
What do you want to happen with the second table (peptide matches not assigned to protein hits)?

This question is currently in the Perl zone.  Are you only accepting Perl solutions?

Once the data is 'cleaned up', what are you going to do with it?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39615937
Give this a shot...
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

my $dir = 'C:\Users\Stephen\Desktop\LCMS';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$} readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39616575
Hey wilcoxon,

Gave it a shot, cheers for getting back to me.

At the moment I'm receiving the following errors:

Not enough arguments for grep at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
syntax error at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
Execution of stephen_label_script.pl aborted due to compilation errors.


Cheers again,
errors2.jpg
0
Are your AD admin tools letting you down?

Managing Active Directory can get complicated.  Often, the native tools for managing AD are just not up to the task.  The largest Active Directory installations in the world have relied on one tool to manage their day-to-day administration tasks: Hyena. Start your trial today.

 
LVL 26

Expert Comment

by:wilcoxon
ID: 39618205
Sigh - stupid grep syntax - block format does not need a comma but regex format does.  Add a comma after the regex and before readdir on the grep/readdir line.  That should fix the problem.

To be specific, line 10 should be:
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39628990
Hi wilcoxon,

Thanks for getting back to me, sorry it's taken so long for me to get back to you with this.
I've tried modifying the script with the comma inserted into line 10 above.

However when I now try to run the script, the cmd prompt cursor moves to the next line and flashes to indicate that the script is running, however, this has been continuously flashing now for the past 3 days with "MB-12-Mouse.csv" being "locked for editing". I wouldn't expect this to be a three day script-run job? ;)

Would you be able to test this script at your end using the mouse.csv and label-list and let me know if it works for you?

Thanks,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629185
Sure.  I'll test it sometime in the next few days (possibly tonight)...
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 39635892
Sorry it took me this long to get back to you...

Here's a revised script with more debugging and more output.  The critical fix was adding the $i++ to the @lines loop (so, when you ran it, it was reprocessing the same line for 3 days).
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

#my $dir = 'C:\Users\Stephen\Desktop\LCMS';
my $dir = '.';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    print "processing $fil...\n";
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        #print "\ttable line $i\n";
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (not defined $cols[$st] or not defined $cols[$en]) {
            warn "line $i missing desired columns ($st,$en) - skipping\n$lines[$i]\n";
            $i++;
            next;
        }
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
        $i++;
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39639566
Thanks again wilcoxon,

No worries about the time taken :)

Just a quick question, is it possible to write a simpler script which basically extracts the table for each CSV file, and outputs this table as a separate CSV file,  using "cropped_" as the filename precursor.

Say:

"cropped_MB12_Mouse.csv" etc?


Is it also possible to write that into this script as well? so that there would be two versions of each CSV file:

MB12_Mouse.csv
cropped_MB12_Mouse.csv

Thanks again,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39639599
Sure.  It would be possible.  It wouldn't really be much simpler than the current one.

I'm unclear on your second question.  Would you want to add the functionality to this script as it is (eg the files would be identical) or would you want changes made to this script before adding in the logic to create cropped_*.csv?
0

Featured Post

Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
The purpose of this article is to demonstrate how we can use conditional statements using Python.
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question