asked on

Retreive entries from a text file

I'm in pretty desperate need of a working Perl script. I know exactly what the code needs to do but I just don't know Perl!

Here's the problem:

I have a text file consisting of thousands of entries in the following format:

>Entry 1
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahblah
>Entry 2
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetc
>Entry 1000
blahetcblahetcblahetcblahetcblahetcblahetcblahetcb
ahetcblahetcblahetcblahetcblahetcblahetcblahetcbl
hetcblahetc

I need to cut some entries from the file. The best way for me to do this would be to use an input file listing the entry names I want removed.

eg.
>Entry1
>Entry 1000
>Entry Infinity

Thus I would provide the entry names as an input file, and the Perl script would iteratively search for each Entry name provided. When it finds an entry it is searching for it would cut the complete entry from the file. The '>' sign is the handiest delimiter to use since every entry name is preceded by it.

Any help would be SINCERELY appreciated.

Cheers in Advance,

tb34

SOLUTION

Talmash

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

travisbickle34

ASKER

I'll probably have to alter the script slightly from time to time and I'm just more comfortable with perl!

I should probably point out that the actual syntax of the entry names is as follows:

>ADXCAPD.x.C.y

where x and y are numbers...

travisbickle34

ASKER

I've adapted and run the script. It runs ok but the output file produced is empty.

Any suggestions??

travisbickle34

ASKER

I'm increasing the points for a working solution to this problem - as I said it's pretty important! :)

SOLUTION

FishMonger

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

FishMonger

I forgot, you need to keep the > at the begining of each section.

So, change:
$keep{$1} = $_ if /^(Entry.*?\n)/i;

to this:
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;

FishMonger

If you have a large number of sections that need to be removed, it might be faster if we iterate over each element of the @delete array instead of joining the array.

travisbickle34

ASKER

Something's not right.

Using this as a file for processing:

>Entry 1
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahblah
>Entry 2
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetc
>Entry 1000
blahetcblahetcblahetcblahetcblahetcblahetcblahetcb
ahetcblahetcblahetcblahetcblahetcblahetcblahetcbl
hetcblahetc

And using this as a delete list:

>Entry 1
>Entry 2

The output I get is:

>Entry 1
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahblah
>Entry 1000
blahetcblahetcblahetcblahetcblahetcblahetcblahetcb
ahetcblahetcblahetcblahetcblahetcblahetcblahetcbl
hetcblahetc

>Entry 2
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetc

I dunno what's happening :-\

FishMonger

Humm, I must have changed something between the time I tested and the posting.

change these 2 lines:
next if (/($delete)/ or /^$/);
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;

to this:
next if (/^($delete)\n/);
$keep{$1} = ">$_" if (/^(Entry[^\n]+)/i);

ozo

open DEL, "<delete.txt" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<travis.txt" or die $!;
{ local $/ = ">";
while (<IN>) {
chomp;
next unless length;
s/^/>/;
next if /$delete/;
push @keep, $_;

}
}
open OUT, ">travis.txt" or die $!;
print OUT @keep;

FishMonger

ozo's right, using an array would be better than the hash I used. And, I'm sure that his method of constructing the regex is better, but I'm not exactly sure why.

travisbickle34

ASKER

Both scripts are working perfectly now - thanks guys!

One last question - is there a simple way to make ozo's script output the deleted sequences to a second file?

Also - how can I split the points between you two? It's only fair I think...

Talmash

hi travis , I did not forgot you , just we are not working in the same hours .

open (IN_FILE,"in_file"); # the file with the
@in_file_lines = <IN_FILE>
close(IN_FILE);

open (BIG_FILE,"my_big_file");
@big_file_lines = <BIG_FILE>;
close(BIG_FILE);

open(NEW_FILE,">new_big_file");

to create another file :
my @del_lines;
open (DEL_LINES,">deleted_lines.txt"); # put this line near the "open" of the other file .

$big_file_index = 0;
foreach $bad_entry (@in_file_lines) {
$bad_entry =~ /Entry\s*(\d*)/;
$bad_line = $1;
while ($big_file_index < $bad_line) {
print NEW_FILE, $big_file_line[$big_file_index];
$big_file_index++;
}
push @del_lines,$big_file_line[$big_file_index];
$big_file_index++;
}
close(NEW_FILE);
close(DEL_LINES);

tal

ozo

if( /$delete/ ){
print SECOND_FILE;
}else{
push @keep,$_;
}

travisbickle34

ASKER

Ozo - can you edit your alteration into this piece of script? I seem to be making a bollocks of it somehow :(

open DEL, "<list" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<input" or die $!;
{ local $/ = ">";
while (<IN>) {
chomp;
next unless length;
s/^/>/;
next if /$delete/;
push @keep, $_;

}
}
open OUT, ">output" or die $!;
print OUT @keep;

ASKER CERTIFIED SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

travisbickle34

ASKER

Ok - I don't know what has happened but the above script now just seems to output all entries to both output files. My head hurts...

travisbickle34

ASKER

Actually - it seems to be working fine now!

I don't know what was happening there. By any chance do entries consisting of only a single line mes up the process somehow?

Regardless - I'm allocating points now.

Thanks guys.