travisbickle34
asked on
Retreive entries from a text file
I'm in pretty desperate need of a working Perl script. I know exactly what the code needs to do but I just don't know Perl!
Here's the problem:
I have a text file consisting of thousands of entries in the following format:
>Entry 1
blahblahblahblahblahblahbl ahblahblah blahblahbl a
blahblahblahblahblahblahbl ahblahblah blahblahbl a
blahblahblahblahblahblahbl ahblahblah blahblahbl ah
>Entry 2
etcetcetcetcetcetcetcetcet cetcetcetc etcetcetce tcetcet
etcetcetcetcetcetcetcetcet cetcetcetc etcetcetce tcetc
>Entry 1000
blahetcblahetcblahetcblahe tcblahetcb lahetcblah etcb
ahetcblahetcblahetcblahetc blahetcbla hetcblahet cbl
hetcblahetc
I need to cut some entries from the file. The best way for me to do this would be to use an input file listing the entry names I want removed.
eg.
>Entry1
>Entry 1000
>Entry Infinity
Thus I would provide the entry names as an input file, and the Perl script would iteratively search for each Entry name provided. When it finds an entry it is searching for it would cut the complete entry from the file. The '>' sign is the handiest delimiter to use since every entry name is preceded by it.
Any help would be SINCERELY appreciated.
Cheers in Advance,
tb34
Here's the problem:
I have a text file consisting of thousands of entries in the following format:
>Entry 1
blahblahblahblahblahblahbl
blahblahblahblahblahblahbl
blahblahblahblahblahblahbl
>Entry 2
etcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcet
>Entry 1000
blahetcblahetcblahetcblahe
ahetcblahetcblahetcblahetc
hetcblahetc
I need to cut some entries from the file. The best way for me to do this would be to use an input file listing the entry names I want removed.
eg.
>Entry1
>Entry 1000
>Entry Infinity
Thus I would provide the entry names as an input file, and the Perl script would iteratively search for each Entry name provided. When it finds an entry it is searching for it would cut the complete entry from the file. The '>' sign is the handiest delimiter to use since every entry name is preceded by it.
Any help would be SINCERELY appreciated.
Cheers in Advance,
tb34
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I've adapted and run the script. It runs ok but the output file produced is empty.
Any suggestions??
Any suggestions??
ASKER
I'm increasing the points for a working solution to this problem - as I said it's pretty important! :)
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I forgot, you need to keep the > at the begining of each section.
So, change:
$keep{$1} = $_ if /^(Entry.*?\n)/i;
to this:
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;
So, change:
$keep{$1} = $_ if /^(Entry.*?\n)/i;
to this:
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;
If you have a large number of sections that need to be removed, it might be faster if we iterate over each element of the @delete array instead of joining the array.
ASKER
Something's not right.
Using this as a file for processing:
>Entry 1
blahblahblahblahblahblahbl ahblahblah blahblahbl a
blahblahblahblahblahblahbl ahblahblah blahblahbl a
blahblahblahblahblahblahbl ahblahblah blahblahbl ah
>Entry 2
etcetcetcetcetcetcetcetcet cetcetcetc etcetcetce tcetcet
etcetcetcetcetcetcetcetcet cetcetcetc etcetcetce tcetc
>Entry 1000
blahetcblahetcblahetcblahe tcblahetcb lahetcblah etcb
ahetcblahetcblahetcblahetc blahetcbla hetcblahet cbl
hetcblahetc
And using this as a delete list:
>Entry 1
>Entry 2
The output I get is:
>Entry 1
blahblahblahblahblahblahbl ahblahblah blahblahbl a
blahblahblahblahblahblahbl ahblahblah blahblahbl a
blahblahblahblahblahblahbl ahblahblah blahblahbl ah
>Entry 1000
blahetcblahetcblahetcblahe tcblahetcb lahetcblah etcb
ahetcblahetcblahetcblahetc blahetcbla hetcblahet cbl
hetcblahetc
>Entry 2
etcetcetcetcetcetcetcetcet cetcetcetc etcetcetce tcetcet
etcetcetcetcetcetcetcetcet cetcetcetc etcetcetce tcetc
I dunno what's happening :-\
Using this as a file for processing:
>Entry 1
blahblahblahblahblahblahbl
blahblahblahblahblahblahbl
blahblahblahblahblahblahbl
>Entry 2
etcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcet
>Entry 1000
blahetcblahetcblahetcblahe
ahetcblahetcblahetcblahetc
hetcblahetc
And using this as a delete list:
>Entry 1
>Entry 2
The output I get is:
>Entry 1
blahblahblahblahblahblahbl
blahblahblahblahblahblahbl
blahblahblahblahblahblahbl
>Entry 1000
blahetcblahetcblahetcblahe
ahetcblahetcblahetcblahetc
hetcblahetc
>Entry 2
etcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcet
I dunno what's happening :-\
Humm, I must have changed something between the time I tested and the posting.
change these 2 lines:
next if (/($delete)/ or /^$/);
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;
to this:
next if (/^($delete)\n/);
$keep{$1} = ">$_" if (/^(Entry[^\n]+)/i);
change these 2 lines:
next if (/($delete)/ or /^$/);
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;
to this:
next if (/^($delete)\n/);
$keep{$1} = ">$_" if (/^(Entry[^\n]+)/i);
open DEL, "<delete.txt" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<travis.txt" or die $!;
{ local $/ = ">";
while (<IN>) {
chomp;
next unless length;
s/^/>/;
next if /$delete/;
push @keep, $_;
}
}
open OUT, ">travis.txt" or die $!;
print OUT @keep;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<travis.txt" or die $!;
{ local $/ = ">";
while (<IN>) {
chomp;
next unless length;
s/^/>/;
next if /$delete/;
push @keep, $_;
}
}
open OUT, ">travis.txt" or die $!;
print OUT @keep;
ozo's right, using an array would be better than the hash I used. And, I'm sure that his method of constructing the regex is better, but I'm not exactly sure why.
ASKER
Both scripts are working perfectly now - thanks guys!
One last question - is there a simple way to make ozo's script output the deleted sequences to a second file?
Also - how can I split the points between you two? It's only fair I think...
One last question - is there a simple way to make ozo's script output the deleted sequences to a second file?
Also - how can I split the points between you two? It's only fair I think...
hi travis , I did not forgot you , just we are not working in the same hours .
open (IN_FILE,"in_file"); # the file with the
@in_file_lines = <IN_FILE>
close(IN_FILE);
open (BIG_FILE,"my_big_file");
@big_file_lines = <BIG_FILE>;
close(BIG_FILE);
open(NEW_FILE,">new_big_fi le");
to create another file :
my @del_lines;
open (DEL_LINES,">deleted_lines .txt"); # put this line near the "open" of the other file .
$big_file_index = 0;
foreach $bad_entry (@in_file_lines) {
$bad_entry =~ /Entry\s*(\d*)/;
$bad_line = $1;
while ($big_file_index < $bad_line) {
print NEW_FILE, $big_file_line[$big_file_i ndex];
$big_file_index++;
}
push @del_lines,$big_file_line[ $big_file_ index];
$big_file_index++;
}
close(NEW_FILE);
close(DEL_LINES);
tal
open (IN_FILE,"in_file"); # the file with the
@in_file_lines = <IN_FILE>
close(IN_FILE);
open (BIG_FILE,"my_big_file");
@big_file_lines = <BIG_FILE>;
close(BIG_FILE);
open(NEW_FILE,">new_big_fi
to create another file :
my @del_lines;
open (DEL_LINES,">deleted_lines
$big_file_index = 0;
foreach $bad_entry (@in_file_lines) {
$bad_entry =~ /Entry\s*(\d*)/;
$bad_line = $1;
while ($big_file_index < $bad_line) {
print NEW_FILE, $big_file_line[$big_file_i
$big_file_index++;
}
push @del_lines,$big_file_line[
$big_file_index++;
}
close(NEW_FILE);
close(DEL_LINES);
tal
if( /$delete/ ){
print SECOND_FILE;
}else{
push @keep,$_;
}
print SECOND_FILE;
}else{
push @keep,$_;
}
ASKER
Ozo - can you edit your alteration into this piece of script? I seem to be making a bollocks of it somehow :(
open DEL, "<list" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<input" or die $!;
{ local $/ = ">";
while (<IN>) {
chomp;
next unless length;
s/^/>/;
next if /$delete/;
push @keep, $_;
}
}
open OUT, ">output" or die $!;
print OUT @keep;
open DEL, "<list" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<input" or die $!;
{ local $/ = ">";
while (<IN>) {
chomp;
next unless length;
s/^/>/;
next if /$delete/;
push @keep, $_;
}
}
open OUT, ">output" or die $!;
print OUT @keep;
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ok - I don't know what has happened but the above script now just seems to output all entries to both output files. My head hurts...
ASKER
Actually - it seems to be working fine now!
I don't know what was happening there. By any chance do entries consisting of only a single line mes up the process somehow?
Regardless - I'm allocating points now.
Thanks guys.
I don't know what was happening there. By any chance do entries consisting of only a single line mes up the process somehow?
Regardless - I'm allocating points now.
Thanks guys.
ASKER
I should probably point out that the actual syntax of the entry names is as follows:
>ADXCAPD.x.C.y
where x and y are numbers...