Unix Shell or Perl script to facilitate splitting a file based on a footer separator


I currently have to use editor to manually split a file & would like a script to faciliate
the splitting such I can just run :
./splitting_script  Inputfile


There's a separator that tells us where to split the Inputfile.
A file with 3 separators will be split into 3 files,
a file with n separators will be split into n files.
Eg of an Inputfile:
record1 .....
record2 .....
...............
recordX
<YYYYMMDDhhmm_abcd>      <== this is a separator
recordX+1
........
recordY
<YYYYMMDDhhmm_abcd>      <== this is another separator
recordY+1
........
recordZ
<YYYYMMDDhhmm_abcd>      <== this is the last separator

where YYYYMMDD is the numeric date, hhmm is hour_minute
while abcd is a variable number (can be a 3 or 4 or 5 digit number).

Since the date, time & variable number are non-constant,
  the  <......>  is the separator to look for.


So in the above example, the InputFile would be split into the 3 files below :
File1:
====
record1 .....
record2 .....
...............
recordX
<YYYYMMDDhhmm_abcd>

File2:
====
recordX+1
........
recordY
<YYYYMMDDhhmm_abcd>

File3:
====
recordY+1
........
recordZ
<YYYYMMDDhhmm_abcd>


sunhuxAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
zlobchoConnect With a Mentor Commented:
open (FILE,">>test/file$number.txt") or die "can not create a new file\n";

or better

open (FILE,">>file$number.txt") or die "can not create a new file\n";
#!/usr/bin/perl
use strict;

my $file=$ARGV[0];
my $number=1;
my $separator_count=0;
my $rec=();
open (DATA,"$file") or die "can not open the inputfile\n";
while(<DATA>){
 if ($_=~/\<\d{12}_\d{3,5}\>/){ $separator_count++; }
 }
close (DATA);

if ($separator_count < 2){
print "The inputfile has only 1 day's data, no splitting needed\n";}

if ($separator_count > 1)  {
 open (DATA,"$file") or die "can not open the inputfile\n";
   while(<DATA>){
     if ($_=~/^.*/.../^\<\d{12}_\d{3,5}\>$/) { $rec.=$_; }
     if ($_=~/\<\d{12}_\d{3,5}\>/) {
       open (FILE,">>file$number.txt") or die "can not create a new file\n";
         print FILE $rec;
       close (FILE);
       $number++;
       undef $rec;
      }
   }
}
close (DATA);

Open in new window

0
 
zlobchoCommented:
Try this:
#!/usr/bin/perl
use strict;
my $file=$ARGV[0];
open (DATA,"$file") or die "can not open the inputfile\n";
my $number=1;
while(<DATA>){
   open (FILE,">>file$number.txt") or die "can not create a new file\n";
   if ($_!~/\<\d{12}_\d{3,5}\>/){
    print FILE $_;
   }
   if ($_=~/\<\d{12}_\d{3,5}\>/){
       print FILE $_;
       close (FILE);
       $number++;
   }
}
close (DATA);

Open in new window

0
 
sunhuxAuthor Commented:
Thanks, anyone has an equivalent Shell script, in case Perl is not present in CentOS Linux
0
Take Control of Web Hosting For Your Clients

As a web developer or IT admin, successfully managing multiple client accounts can be challenging. In this webinar we will look at the tools provided by Media Temple and Plesk to make managing your clients’ hosting easier.

 
sunhuxAuthor Commented:

Hi Zlobcho,

Perl interpreter is present in our CentOS as /usr/local/bin/perl
so should the first line of the Perl script be :
#!/usr/local/bin/perl


Will need help to put in 2 enhancements to your Perl script :

1) if the inputfile contains only one  "<......>", then don't split it but
    echo a message "The inputfile has only 1 day's data, no splitting needed".
    Loosely, in Shell script, my code would be
        separator_count=`grep "<" inputfile | grep ">" | wc -l `
        if [ $separator_count < 2 ]
        then
           echo "The inputfile has only 1 day's data, no splitting needed".
        else
           .... split the file as per your Perl script ...
        fi

2) I've tested the Perl script & if there's  N separators, it produces "N+1"  split files
    with the last  (ie the N+1) file being a file containing a line with either <CR> or
    <LF> or <EOF> character.  I think this is due to the fact that my inputfile's last
    line has these character(s).  Can you enhance your script NOT to produce the
    (N+1) file.  Perhaps put check for the last line/record of the input file : if it has
    less than 3 characters, then it should not be processed & output to a file
0
 
zlobchoCommented:
Try this here:
#!/usr/bin/perl
use strict;

my $file=$ARGV[0];
my $number=1;
my $separator_count=0;
my $rec=();
open (DATA,"$file") or die "can not open the inputfile\n";
while(<DATA>){
 if ($_=~/\<\d{12}_\d{3,5}\>/){ $separator_count++; }
 }
close (DATA);

if ($separator_count < 2){
print "The inputfile has only 1 day's data, no splitting needed\n";}

if ($separator_count > 1)  {
 open (DATA,"$file") or die "can not open the inputfile\n";
   while(<DATA>){
     if ($_=~/^.*/.../^\<\d{12}_\d{3,5}\>$/) { $rec.=$_; }
     if ($_=~/\<\d{12}_\d{3,5}\>/) {
       open (FILE,">>test/file$number.txt") or die "can not create a new file\n";
         print FILE $rec;
       close (FILE);
       $number++;
       undef $rec;
      }
   }
}
close (DATA);

Open in new window

0
 
sunhuxAuthor Commented:

Excellent,  the script tested ok.

Thanks vm zlobcho,
0
All Courses

From novice to tech pro — start learning today.