Okay, this one is driving me nuts -- I just can't get it exactly right.
I have a multi-level hash of things, each with an associated numerical start and stop (among other attributes). I have created an array from this hash such that the array contains the (second level) keys of the hash in ascending numerical order:
foreach my $orf (sort {$contigs{$contig}{$a}{sta
rt} <=> $contigs{$contig}{$b}{star
t}} (keys %{$contigs{$contig}})) {
push (@ordered, $orf);
}
Some of these things (called "orfs", for open reading frames, they are contained within "contigs", which are the primary keys of the hash) are small, and contained entirely within larger orfs (of the same contig). I'd like to ignore these small open reading frames. So, I thought I'd cycle through the array, and compare the orfs to each other:
my ($i, $j, %rejected, %accepted);
for ($i = 1; $i <= $#ordered; $i++) {
next if ((exists $rejected{$ordered[$i]}) || (exists $accepted{$ordered[$i]}));
for ($j = $i+1; $j <= $#ordered; $j++) {
last if ($contigs{$contig}{$ordere
d[$j]}{sta
rt} > $contigs{$contig}{$ordered
[$i]}{end}
);
if (($contigs{$contig}{$order
ed[$j]}{st
art} > $contigs{$contig}{$ordered
[$i]}{star
t}) && ($contigs{$contig}{$ordere
d[$j]}{end
} < $contigs{$contig}{$ordered
[$i]}{end}
)) {
$rejected{$ordered[$j]} = '1';
} else {
push (@non_overlap, $j) unless exists $accepted{$ordered[$j]};
$accepted{$ordered[$j]} = '1';
}
}
}
foreach (@non_overlap) {
# do stuff
}
But it doesn't work. My output contains, for example:
>Contig2635_30 [1870 - 3501] (neg) 1632 bp (46.7% GC) 543 aa (62.9 kDa)
...data...
>Contig2635_1 [1891 - 2238] (pos) 348 bp (48.3% GC) 115 aa (13.6 kDa)
...data...
>Contig2635_2 [2429 - 2734] (pos) 306 bp (50.3% GC) 101 aa (11.7 kDa)
...data...
>Contig2635_3 [2596 - 3072] (pos) 477 bp (48.0% GC) 158 aa (17.6 kDa)
...data...
>Contig2635_4 [3431 - 3748] (pos) 318 bp (39.0% GC) 105 aa (12.5 kDa)
...data...
One can see that orf 1 (_1) of contig 2635 (occupying 1891 - 2238, start and end, respectively) is contained completely within orf 30 of the same contig (1870 - 3501), as are orfs 2 & 3 -- they should all (1, 2 & 3) be removed, because they're all completely contained within the larger orf 30. Note that orf 4 should be retained at this stage because, while it does overlap orf 30 (and is probably not real), it's not completely contained within it. These data also serve to demonstrate that there may be more than one small orf within a single larger one (there also, of course, may be none, or one).
The structure of the hash containing all these data is, for example:
$VAR1 = 'Contig2635';
$VAR2 = {
'1' => {
'aa' => '115',
'bp' => 348,
'end' => 2238,
'hdr' => '>Contig2635_1 [1891 - 2238] 348 bp, 115 aa',
'num' => '1',
'ori' => 'pos',
'start' => 1891
},
'30' => {
'aa' => '543',
'bp' => 1632,
'end' => 3501,
'hdr' => '>Contig2635_30 [1870 - 3501] 1632 bp, 543 aa',
'num' => '30',
'ori' => 'neg',
'start' => 1870
},
# etc.
I have tried many variations on the loop code shown above, but none has succeeded. Any help would be appreciated.
Note that the @ordered array contains at least hundreds (more likely thousands) of entries, thus I've attempted to limit the work done by the nested loops by making $j dependent on $i, so that you're not needlessly checking $j = entry 12 against $i = entry 500, because they're so far apart from each other there's no chance one is contained within the other. In like manner, I've tried to limit comparing $j = 500 against $i = 12, with the line "last if ($contigs{$contig}{$ordere
d[$j]}{sta
rt} > $contigs{$contig}{$ordered
[$i]}{end}
);", because if the start of $j is already greater than the end of $i, there's also no chance one contains the other.
I've also included hashes (%rejected, %accepted) to allow past history to be checked so an orf (hash key) doesn't get put in the @non_overlap array more than once, and to allow bypassing $i if the orf was already rejected as a prior $j (at some point in my trying to get this loop to work I thought this was needed, it may not be...).
But, as I said, it's not working, so anything is subject to change. Can someone see the logic error? Is there a better way of doing this?
Start Free Trial