Link to home
Start Free TrialLog in
Avatar of travisbickle34
travisbickle34

asked on

Retreive entries from a text file

I'm in pretty desperate need of a working Perl script.  I know exactly what the code needs to do but I just don't know Perl!

Here's the problem:

I have a text file consisting of thousands of entries in the following format:

>Entry 1
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahblah
>Entry 2
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetc
>Entry 1000
blahetcblahetcblahetcblahetcblahetcblahetcblahetcb
ahetcblahetcblahetcblahetcblahetcblahetcblahetcbl
hetcblahetc


I need to cut some entries from the file.  The best way for me to do this would be to use an input file listing the entry names I want removed.

eg.
>Entry1
>Entry 1000
>Entry Infinity

Thus I would provide the entry names as an input file, and the Perl script would iteratively search for each Entry name provided.  When it finds an entry it is searching for it would cut the complete entry from the file.  The '>' sign is the handiest delimiter to use since every entry name is preceded by it.

Any help would be SINCERELY appreciated.

Cheers in Advance,

tb34

SOLUTION
Avatar of Talmash
Talmash
Flag of Israel image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of travisbickle34
travisbickle34

ASKER

I'll probably have to alter the script slightly from time to time and I'm just more comfortable with perl!

I should probably point out that the actual syntax of the entry names is as follows:

>ADXCAPD.x.C.y

where x and y are numbers...
I've adapted and run the script.  It runs ok but the output file produced is empty.

Any suggestions??
I'm increasing the points for a working solution to this problem - as I said it's pretty important! :)

SOLUTION
Avatar of FishMonger
FishMonger
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I forgot, you need to keep the > at the begining of each section.

So, change:
$keep{$1} = $_ if /^(Entry.*?\n)/i;

to this:
$keep{$1} = ">$_" if /^(Entry.*?\n)/i;
If you have a large number of sections that need to be removed, it might be faster if we iterate over each element of the @delete array instead of joining the array.
Something's not right.

Using this as a file for processing:

>Entry 1
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahblah
>Entry 2
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetc
>Entry 1000
blahetcblahetcblahetcblahetcblahetcblahetcblahetcb
ahetcblahetcblahetcblahetcblahetcblahetcblahetcbl
hetcblahetc


And using this as a delete list:

>Entry 1
>Entry 2

The output I get is:

>Entry 1
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahbla
blahblahblahblahblahblahblahblahblahblahblahblah
>Entry 1000
blahetcblahetcblahetcblahetcblahetcblahetcblahetcb
ahetcblahetcblahetcblahetcblahetcblahetcblahetcbl
hetcblahetc

>Entry 2
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcet
etcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetcetc



I dunno what's happening  :-\



Humm, I must have changed something between the time I tested and the posting.

change these 2 lines:
      next if (/($delete)/ or /^$/);
      $keep{$1} = ">$_" if /^(Entry.*?\n)/i;

to this:
      next if (/^($delete)\n/);
      $keep{$1} = ">$_" if (/^(Entry[^\n]+)/i);
open DEL, "<delete.txt" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<travis.txt" or die $!;
{ local $/ = ">";
   while (<IN>) {
      chomp;
      next unless length;
      s/^/>/;
      next if /$delete/;
      push @keep, $_;
     
   }
}
   open OUT, ">travis.txt" or die $!;
   print OUT @keep;
ozo's right, using an array would be better than the hash I used.  And, I'm sure that his method of constructing the regex is better, but I'm not exactly sure why.
Both scripts are working perfectly now - thanks guys!

One last question - is there a simple way to make ozo's script output the deleted sequences to a second file?

Also - how can I split the points between you two?  It's only fair I think...
hi travis , I did not forgot you , just we are not working in the same hours .


open (IN_FILE,"in_file"); # the file with the
@in_file_lines = <IN_FILE>
close(IN_FILE);

open (BIG_FILE,"my_big_file");
@big_file_lines = <BIG_FILE>;
close(BIG_FILE);

open(NEW_FILE,">new_big_file");

to create another file :
my @del_lines;
open (DEL_LINES,">deleted_lines.txt"); # put this line near the "open" of the other file .

$big_file_index = 0;
foreach $bad_entry (@in_file_lines) {
        $bad_entry =~ /Entry\s*(\d*)/;
        $bad_line = $1;
        while ($big_file_index < $bad_line) {
            print NEW_FILE, $big_file_line[$big_file_index];
            $big_file_index++;
         }
         push @del_lines,$big_file_line[$big_file_index];
         $big_file_index++;
}
close(NEW_FILE);
close(DEL_LINES);

tal

  if( /$delete/ ){
        print SECOND_FILE;
    }else{
        push @keep,$_;
    }
Ozo - can you edit your alteration into this piece of script?  I seem to be making a bollocks of it somehow :(


open DEL, "<list" or die $!;
@delete = <DEL>;
chomp @delete;
close DEL;
$delete = join('|', map"\Q$_\E",@delete);
$delete =qr/^($delete)$/m;
open IN, "<input" or die $!;
{ local $/ = ">";
   while (<IN>) {
      chomp;
      next unless length;
      s/^/>/;
      next if /$delete/;
      push @keep, $_;

   }
}
   open OUT, ">output" or die $!;
   print OUT @keep;
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ok - I don't know what has happened but the above script now just seems to output all entries to both output files.  My head hurts...
Actually - it seems to be working fine now!

I don't know what was happening there.  By any chance do entries consisting of only a single line mes up the process somehow?

Regardless - I'm allocating points now.

Thanks guys.