Solved

CSV file script for cropping and matching

Posted on 2013-10-29
9
295 Views
Last Modified: 2013-11-11
Hi,

I have a folder of .CSV files which I'd like a script to process.

The files have the following naming convention:

MB12_Mouse.csv
MB13_Chicken.csv
MB14_Pig.csv

etc

And I have located these to the following directory: C:\Users\Stephen\Desktop\LCMS

I've attached an example .csv file (Mouse) to show the format and style of the CSV file I'm currently working with.

Roughly 100 rows into each file is a table, row headings beginning with "prot_hit_num, prot acc" etc.
I basically need to modify each .csv file to keep this table but remove all prior information for each file.
Therefore this table header would move from row 99 to row 1, and the rest of the table would move up.

The second thing I'd like to do with this table, is to add a new column called "pep_label" and insert this one column to the right hand side of the "pep_end" column in the table.

For each row of the table, I would like to match up "pep_start" and "pep_end" with "Start" and "End" of Label_List.xls (attached) and if:

pep_start matches Start
and
pep_end matches End

return "Label" from Label_List.xls and put this into the"pep_label" column on the appropriate row in the .csv table.
MB12-Mouse.csv
Label-List.xls
0
Comment
Question by:StephenMcGowan
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
9 Comments
 
LVL 45

Expert Comment

by:aikimark
ID: 39611713
What do you want to happen with the second table (peptide matches not assigned to protein hits)?

This question is currently in the Perl zone.  Are you only accepting Perl solutions?

Once the data is 'cleaned up', what are you going to do with it?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39615937
Give this a shot...
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

my $dir = 'C:\Users\Stephen\Desktop\LCMS';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$} readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39616575
Hey wilcoxon,

Gave it a shot, cheers for getting back to me.

At the moment I'm receiving the following errors:

Not enough arguments for grep at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
syntax error at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
Execution of stephen_label_script.pl aborted due to compilation errors.


Cheers again,
errors2.jpg
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 26

Expert Comment

by:wilcoxon
ID: 39618205
Sigh - stupid grep syntax - block format does not need a comma but regex format does.  Add a comma after the regex and before readdir on the grep/readdir line.  That should fix the problem.

To be specific, line 10 should be:
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39628990
Hi wilcoxon,

Thanks for getting back to me, sorry it's taken so long for me to get back to you with this.
I've tried modifying the script with the comma inserted into line 10 above.

However when I now try to run the script, the cmd prompt cursor moves to the next line and flashes to indicate that the script is running, however, this has been continuously flashing now for the past 3 days with "MB-12-Mouse.csv" being "locked for editing". I wouldn't expect this to be a three day script-run job? ;)

Would you be able to test this script at your end using the mouse.csv and label-list and let me know if it works for you?

Thanks,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629185
Sure.  I'll test it sometime in the next few days (possibly tonight)...
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 39635892
Sorry it took me this long to get back to you...

Here's a revised script with more debugging and more output.  The critical fix was adding the $i++ to the @lines loop (so, when you ran it, it was reprocessing the same line for 3 days).
#!/usr/bin/perl
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

#my $dir = 'C:\Users\Stephen\Desktop\LCMS';
my $dir = '.';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    print "processing $fil...\n";
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        #print "\ttable line $i\n";
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (not defined $cols[$st] or not defined $cols[$en]) {
            warn "line $i missing desired columns ($st,$en) - skipping\n$lines[$i]\n";
            $i++;
            next;
        }
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
        $i++;
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

0
 

Author Comment

by:StephenMcGowan
ID: 39639566
Thanks again wilcoxon,

No worries about the time taken :)

Just a quick question, is it possible to write a simpler script which basically extracts the table for each CSV file, and outputs this table as a separate CSV file,  using "cropped_" as the filename precursor.

Say:

"cropped_MB12_Mouse.csv" etc?


Is it also possible to write that into this script as well? so that there would be two versions of each CSV file:

MB12_Mouse.csv
cropped_MB12_Mouse.csv

Thanks again,

Stephen.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39639599
Sure.  It would be possible.  It wouldn't really be much simpler than the current one.

I'm unclear on your second question.  Would you want to add the functionality to this script as it is (eg the files would be identical) or would you want changes made to this script before adding in the logic to create cropped_*.csv?
0

Featured Post

What Is Transaction Monitoring and who needs it?

Synthetic Transaction Monitoring that you need for the day to day, which ensures your business website keeps running optimally, and that there is no downtime to impact your customer experience.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
This article will show, step by step, how to integrate R code into a R Sweave document
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.

724 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question