?
Solved

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt

Posted on 2009-02-14
5
Medium Priority
?
1,305 Views
Last Modified: 2012-05-06
I have the flat file refGene.txt. I need to read this this file with a perl script and split this file in a hash table of colums delimeted by  space.  So in total i will have 15 coloms of arrays index starting with 0.
In colum C[2] = name of the chromosome, C[9]= begining of each exon delimited with coma, C[10]= end of each exon delimeted with coma.
And the last column with -1,-1, 0,2 ...
So the  length of last colum C[L]= the length of C[9] and =C[10]
I need to parse the data from the last colum and drop all -1 from the begining plus the next nr. The next position i need to remeber lets say in a variable x= start position,
then i need to parse same data from the end to the begining and drop all -1 from the end  plus next nr. And remeber the next position in a y variable =end position.
According to this x=start index position i will select the same index position from C[9] and keep the value from this position associating this value with ch name from C[2]
And according to y=end index position i need to select the same index position from C[10]  and associate it with ch name from C[2]
By the end i need to have a file with chromosome names chNr_Xvalue from C[9]_Yvalue from C[10].
I the script to read the file refGene.txt from my working directory in unix and create a new file in the same directory with chromosome name and associated values.
I was thinking that to parse last colum i need to do like this
i= 0; while(C[l]=-1),{ i++, if(C[l]Neq -1) i++, j=i; X=j)
and from the end i=length of C[l]; while(C[i}=-1{ i--; if(C[i]=NEQ=-1, i--; j=i; Y=j)
Ch_start_end=Ch= value fromC9[x index position]_value fromC10[y index position]
If there is all -1 in the parsing we just drop and go to the next line in the file with the next chromosome.
Thank you very much if you  have futher question i will answer. From unix terminal this refGene.txt contains exactly 15 coloms starting from 0 index  it needs to be split first by spaces as the space is the delimiter of a colum . And each record in a colum it needs to be delimited by comma. Thank you very much. I need your assistnace as soon as posible.

Open in new window

0
Comment
Question by:Malna2009
  • 4
5 Comments
 
LVL 10

Expert Comment

by:oleber
ID: 23643807
Lets do like this. I'm not a bio-informatic person. So consider me very blond in that area and give me on example of all this cleanings, maybe then you will have your support.
0
 
LVL 10

Expert Comment

by:oleber
ID: 23644184
I didn't understand all the algorithm,  The lines seem to be there in a different order and some the remaining fields are missing.

test my code and say what are you missing.

Maintain the discussion in here. Other experts may help you to.
use strict;
use warnings;
 
use Data::Dumper;
 
sub find_index {
    my @c15 = @_;
    my @index = (0 .. $#c15);
    while (@c15 and $c15[0] == -1) {
        shift(@c15);
        shift(@index);
        if (@c15 and $c15[0] != -1) {
            shift(@c15);
            shift(@index);
        }
    };
    while (@c15 and $c15[-1] == -1) {
        pop(@c15);
        pop(@index);
        if (@c15 and $c15[-1] != -1) {
            pop(@c15);
            pop(@index);
        }
    };
    return @index;
}
 
while (my $line = <DATA>) {
    chomp $line; 
    if ($line) { 
        my @columns = split(/\s+/, $line);
        foreach my $i (9,10,15) {
            $columns[$i] = [ split(/,/, $columns[$i]) ];
        }
        
        my @index = find_index(@{$columns[15]});
        if ( @index ) {
            print "${columns[9][$index[0]]}_${columns[10][$index[-1]]}\n";
        }
    }
}
 
__DATA__
1643	NM_016459	chr5	+	138751155	138753504	138751352	138753444	4	138751155,138751608,138752048,138753267,	           138751509,138751719,138752173,138753504,	0	MGC29506	cmpl	cmpl	        -1,2,0,0,
977	NM_057088	chr12	-	51469735	51476159	51470092	51476093	9	51469735,51470888,51471256,51471741,51472289,51472761,51473213,51474161,51475448,	51470409,51470923,51471477,51471867,51472454,51472857,51473274,51474382,51476159,	0	KRT3	cmpl	cmpl	-1,-1,2,0,0,2,0,0,-1,
959	M_138450	chr13	+	49100624	49105732	49102584	49103175	2	49100624,49102566,	                                    49100752,49105732,	                         0	ARL11	        cmpl	cmpl	         -1,0,
 
984	NM_005176	chr12	-	52345210	52356376	52345364	52356243	5	52345210,52349198,52349921,52352626,52356103,	52345479,52349392,52349999,52352696,52356376,	         0	ATP5G2	          cmpl	cmpl	         2,0,0,2,0,

Open in new window

0
 
LVL 10

Expert Comment

by:oleber
ID: 23644193
maybe the print is more like:
            print "_${columns[9][$index[0]]}_${columns[10][$index[-1]]}_$columns[3].fna\n";

but I don't get the first field, how do you construct it?
0
 
LVL 10

Accepted Solution

by:
oleber earned 2000 total points
ID: 23644276
call the script like

perl script data_file.txt

after get me some feedback.


use strict;
use warnings;
 
use Data::Dumper;
 
sub find_index {
    my @c15 = @_;
    my @index = (0 .. $#c15);
    while (@c15 and $c15[0] == -1) {
        shift(@c15);
        shift(@index);
        if (@c15 and $c15[0] != -1) {
            shift(@c15);
            shift(@index);
        }
    };
    while (@c15 and $c15[-1] == -1) {
        pop(@c15);
        pop(@index);
        if (@c15 and $c15[-1] != -1) {
            pop(@c15);
            pop(@index);
        }
    };
    return @index;
}
 
while (my $line = <>) {
    chomp $line; 
    if ($line) { 
        my @columns = split(/\s+/, $line);
        if ($columns[15]) {
            foreach my $i (9,10,15) {
                $columns[$i] = [ split(/,/, $columns[$i]) ];
            }
            
            my @index = find_index(@{$columns[15]});
            if ( @index ) {
                print "$columns[2]_${columns[9][$index[0]]}_${columns[10][$index[-1]]}_$columns[3].fna\n";
            }
        }
    }
}

Open in new window

0
 

Author Comment

by:Malna2009
ID: 23644340
So my complete code selecting the working dir of regGene.txt should  be...?
 $WorkDir="/import/.......regGene.txt
$FileList=$WorkDir."/  .../...       /refGene.txt";
open(File,$FileList);


and i need the results printed out in  the new file ... ?
open(OUT,">".$FileShell) or die no output $FileShell\n;
print OUT "mkdir $NewDir\n"
....
close(OUT);
}
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Windows Script Host (WSH) has been part of Windows since Windows NT4. Windows Script Host provides architecture for building dynamic scripts that consist of a core object model, scripting hosts, and scripting engines. The key components of Window…
The viewer will learn how to implement Singleton Design Pattern in Java.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
Suggested Courses
Course of the Month14 days, 7 hours left to enroll

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question