Link to home
Start Free TrialLog in
Avatar of StephenMcGowan
StephenMcGowan

asked on

CSV file script for cropping and matching

Hi,

I have a folder of .CSV files which I'd like a script to process.

The files have the following naming convention:

MB12_Mouse.csv
MB13_Chicken.csv
MB14_Pig.csv

etc

And I have located these to the following directory: C:\Users\Stephen\Desktop\LCMS

I've attached an example .csv file (Mouse) to show the format and style of the CSV file I'm currently working with.

Roughly 100 rows into each file is a table, row headings beginning with "prot_hit_num, prot acc" etc.
I basically need to modify each .csv file to keep this table but remove all prior information for each file.
Therefore this table header would move from row 99 to row 1, and the rest of the table would move up.

The second thing I'd like to do with this table, is to add a new column called "pep_label" and insert this one column to the right hand side of the "pep_end" column in the table.

For each row of the table, I would like to match up "pep_start" and "pep_end" with "Start" and "End" of Label_List.xls (attached) and if:

pep_start matches Start
and
pep_end matches End

return "Label" from Label_List.xls and put this into the"pep_label" column on the appropriate row in the .csv table.
MB12-Mouse.csv
Label-List.xls
Avatar of aikimark
aikimark
Flag of United States of America image

What do you want to happen with the second table (peptide matches not assigned to protein hits)?

This question is currently in the Perl zone.  Are you only accepting Perl solutions?

Once the data is 'cleaned up', what are you going to do with it?
Give this a shot...
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Tie::File;

my $dir = 'C:\Users\Stephen\Desktop\LCMS';
chdir $dir or die "could not cd to $dir: $!";

opendir DIR, '.' or die "could not open dir: $!";
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$} readdir DIR;
closedir DIR;

my $list = get_xls_data('Label-List.xls');

foreach my $fil (@files) {
    tie my @lines, 'Tie::File', $fil or die "could not tie $fil: $!";
    while ($lines[0] !~ m{^prot_hit_num,prot_acc}) {
        shift @lines;
    }
    die "never found prot_hit_num table indicator in $fil"
        unless ($lines[0] =~ m{^prot_hit_num,prot_acc});
    $lines[0] .= 'pep_label';
    my ($st, $en) = get_cols($lines[0]);
    my $i = 1;
    while ($lines[$i]) {
        my @cols = split m{\s*,\s*}, $lines[$i];
        if (exists $list->{$cols[$st]}{$cols[$en]}) {
            $lines[$i] .= $list->{$cols[$st]}{$cols[$en]};
        }
    }
    untie @lines;
}

sub get_cols {
    my ($ln) = @_;
    my @cols = split m{\s*,\s*}, $ln;
    my ($st, $en);
    for my $i (0..@cols-1) {
        if ($cols[$i] eq 'pep_start') {
            $st = $i;
            last if $en;
        } elsif ($cols[$i] eq 'pep_end') {
            $en = $i;
            last if $st;
        }
    }
    die "could not locate pep_start and pep_end in $ln" unless ($st and $en);
    return ($st, $en);
}

sub get_xls_data {
    my ($fil) = @_;
    my $parser = Spreadsheet::ParseExcel->new();
    my $xls = $parser->parse($fil) or die $parser->error();
    my $sheet = $xls->worksheet(0);
    my ($row_min, $row_max) = $sheet->row_range();
    my %data;
    for my $row ($row_min+1..$row_max) {
        my $lbl = $sheet->get_cell($row, 2);
        my $st = $sheet->get_cell($row, 4);
        my $en = $sheet->get_cell($row, 5);
        next unless (defined $lbl and defined $st and defined $en);
        $data{$st->unformatted()}{$en->unformatted()} = $lbl->unformatted();
    }
    return \%data;
}

Open in new window

Avatar of StephenMcGowan
StephenMcGowan

ASKER

Hey wilcoxon,

Gave it a shot, cheers for getting back to me.

At the moment I'm receiving the following errors:

Not enough arguments for grep at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
syntax error at stephen_label_script.pl line 10, near "m{^[A-Z]+\d+[-_]\w+\.csv$} readdir"
Execution of stephen_label_script.pl aborted due to compilation errors.


Cheers again,
errors2.jpg
Sigh - stupid grep syntax - block format does not need a comma but regex format does.  Add a comma after the regex and before readdir on the grep/readdir line.  That should fix the problem.

To be specific, line 10 should be:
my @files = grep m{^[A-Z]+\d+[-_]\w+\.csv$}, readdir DIR;

Open in new window

Hi wilcoxon,

Thanks for getting back to me, sorry it's taken so long for me to get back to you with this.
I've tried modifying the script with the comma inserted into line 10 above.

However when I now try to run the script, the cmd prompt cursor moves to the next line and flashes to indicate that the script is running, however, this has been continuously flashing now for the past 3 days with "MB-12-Mouse.csv" being "locked for editing". I wouldn't expect this to be a three day script-run job? ;)

Would you be able to test this script at your end using the mouse.csv and label-list and let me know if it works for you?

Thanks,

Stephen.
Sure.  I'll test it sometime in the next few days (possibly tonight)...
ASKER CERTIFIED SOLUTION
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks again wilcoxon,

No worries about the time taken :)

Just a quick question, is it possible to write a simpler script which basically extracts the table for each CSV file, and outputs this table as a separate CSV file,  using "cropped_" as the filename precursor.

Say:

"cropped_MB12_Mouse.csv" etc?


Is it also possible to write that into this script as well? so that there would be two versions of each CSV file:

MB12_Mouse.csv
cropped_MB12_Mouse.csv

Thanks again,

Stephen.
Sure.  It would be possible.  It wouldn't really be much simpler than the current one.

I'm unclear on your second question.  Would you want to add the functionality to this script as it is (eg the files would be identical) or would you want changes made to this script before adding in the logic to create cropped_*.csv?