Solved

CSV file script for cropping and matching

Posted on 2013-10-29
9
291 Views
Last Modified: 2013-11-11
Hi,

I have a folder of .CSV files which I'd like a script to process.

The files have the following naming convention:

MB12_Mouse.csv
MB13_Chicken.csv
MB14_Pig.csv

etc

And I have located these to the following directory: C:\Users\Stephen\Desktop\LCMS

I've attached an example .csv file (Mouse) to show the format and style of the CSV file I'm currently working with.

Roughly 100 rows into each file is a table, row headings beginning with "prot_hit_num, prot acc" etc.
I basically need to modify each .csv file to keep this table but remove all prior information for each file.
Therefore this table header would move from row 99 to row 1, and the rest of the table would move up.

The second thing I'd like to do with this table, is to add a new column called "pep_label" and insert this one column to the right hand side of the "pep_end" column in the table.

For each row of the table, I would like to match up "pep_start" and "pep_end" with "Start" and "End" of Label_List.xls (attached) and if:

pep_start matches Start
and
pep_end matches End

return "Label" from Label_List.xls and put this into the"pep_label" column on the appropriate row in the .csv table.
MB12-Mouse.csv
Label-List.xls
0
Comment
Question by:StephenMcGowan
  • 5
  • 3
9 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 39611713
What do you want to happen with the second table (peptide matches not assigned to protein hits)?

This question is currently in the Perl zone.  Are you only accepting Perl solutions?

Once the data is 'cleaned up', what are you going to do with it?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39615937
Give this a shot...
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

my $dir = 'C:\Users\Stephen\Desktop\LCMS';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$} readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39616575
Hey wilcoxon,

Gave it a shot, cheers for getting back to me.

At the moment I'm receiving the following errors:

Not enough arguments for grep at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
syntax error at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
Execution of stephen_label_script.pl aborted due to compilation errors.


Cheers again,
errors2.jpg
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39618205
Sigh - stupid grep syntax - block format does not need a comma but regex format does.  Add a comma after the regex and before readdir on the grep/readdir line.  That should fix the problem.

To be specific, line 10 should be:
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;

Open in new window

0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:StephenMcGowan
ID: 39628990
Hi wilcoxon,

Thanks for getting back to me, sorry it's taken so long for me to get back to you with this.
I've tried modifying the script with the comma inserted into line 10 above.

However when I now try to run the script, the cmd prompt cursor moves to the next line and flashes to indicate that the script is running, however, this has been continuously flashing now for the past 3 days with "MB-12-Mouse.csv" being "locked for editing". I wouldn't expect this to be a three day script-run job? ;)

Would you be able to test this script at your end using the mouse.csv and label-list and let me know if it works for you?

Thanks,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629185
Sure.  I'll test it sometime in the next few days (possibly tonight)...
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 39635892
Sorry it took me this long to get back to you...

Here's a revised script with more debugging and more output.  The critical fix was adding the $i++ to the @lines loop (so, when you ran it, it was reprocessing the same line for 3 days).
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

#my $dir = 'C:\Users\Stephen\Desktop\LCMS';
my $dir = '.';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    print "processing $fil...\n";
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        #print "\ttable line $i\n";
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (not defined $cols[$st] or not defined $cols[$en]) {
            warn "line $i missing desired columns ($st,$en) - skipping\n$lines[$i]\n";
            $i++;
            next;
        }
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
        $i++;
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39639566
Thanks again wilcoxon,

No worries about the time taken :)

Just a quick question, is it possible to write a simpler script which basically extracts the table for each CSV file, and outputs this table as a separate CSV file,  using "cropped_" as the filename precursor.

Say:

"cropped_MB12_Mouse.csv" etc?


Is it also possible to write that into this script as well? so that there would be two versions of each CSV file:

MB12_Mouse.csv
cropped_MB12_Mouse.csv

Thanks again,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39639599
Sure.  It would be possible.  It wouldn't really be much simpler than the current one.

I'm unclear on your second question.  Would you want to add the functionality to this script as it is (eg the files would be identical) or would you want changes made to this script before adding in the logic to create cropped_*.csv?
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
string initialization in java 11 109
WPF - External data binding to controls inside a UserControl 2 54
noX challenge 17 91
allswap challenge 6 75
Navigation is an important part of web design from a usability perspective. But it is often a pain when it comes to a developer’s perspective. By navigation, it often means menuing. This is less theory and more practical of how to get a specific gro…
Windows Script Host (WSH) has been part of Windows since Windows NT4. Windows Script Host provides architecture for building dynamic scripts that consist of a core object model, scripting hosts, and scripting engines. The key components of Window…
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now