glennstewart
asked on
Simply awk/sed solution required - Split a file on fields - another additional question
Hi All,
A nice easy one being an extension of two previously answered questions. Should be nice easy points up for grabs.
For reference see the following two PAQ's:
First Question:
https://www.experts-exchange.com/questions/21763311/Simply-awk-sed-solution-required-Split-a-file-on-fields.html
Second Question:
https://www.experts-exchange.com/questions/21764891/Simply-awk-sed-solution-required-Split-a-file-on-fields-additional-question.html
The input is ever so slightly different, and the output required is ever so slightly different.
Ozo's answer from the second PAQ is great - it works.
So can base off this or re-work if you like.
-------------------------- ---------- ------
INPUT ATTRIBUTES:
* Text file to be split into multiple files
* Delimeters start with two consecutive pipes and consist of 7 pipes
DELIMITER EXAMPLES:
a) ||XX|123456|A111|||
b) ||XX|123456|A111|021234567 8||
c) ||XX|123456|A111|(02) 12345678||
d) ||XX|123456|A111|(02) 12345678|test@email.com|
a) The first example are the only compulsory fields
XX=company code
123456=booking which could be 5 or 6 digits
A111=series which is letter and number combo
b) Also contains a phone number
c) The phone number can contain spaces and brackets
d) Also can contain email address
AIM:
* To convert text file into multiple text files based on file names above
* Filenames should be unique
* Filenames should unclude YYYYMMDD_HHMM to ensure uniqueness
* Files produced should maintain the pipe delimiter as line number one
* Junk text found in file prior to first pipe should be discarded
INPUT FILE EXAMPLE:
VVVVVVVVVVVVVVVVVVVVVVVVVV
Possible junk text to be discarded
||AA|111111|E123|||
Contents of file one
||XX|222222|A111|(02) 12345678||
Contents of file two
||XX|222222|A111|021234567 8||
Contents of file three
^^^^^^^^^^^^^^^^^^^^^^^
OUTPUT FILES PRODUCED:
AA111111_20060518_1614_001 .DOC
XX222222_20060518_1614_002 .DOC
XX222222_20060518_1614_003 .DOC
NOTE:
AA111111_20060518_1614_001 .DOC would contain:
VVVVVVVVVVVVVVVVVVVVVVVVVV
||AA|111111|E123|||
Contents of file one
^^^^^^^^^^^^^^^^^^^^^^^
Thanks in advance,
A nice easy one being an extension of two previously answered questions. Should be nice easy points up for grabs.
For reference see the following two PAQ's:
First Question:
https://www.experts-exchange.com/questions/21763311/Simply-awk-sed-solution-required-Split-a-file-on-fields.html
Second Question:
https://www.experts-exchange.com/questions/21764891/Simply-awk-sed-solution-required-Split-a-file-on-fields-additional-question.html
The input is ever so slightly different, and the output required is ever so slightly different.
Ozo's answer from the second PAQ is great - it works.
So can base off this or re-work if you like.
--------------------------
INPUT ATTRIBUTES:
* Text file to be split into multiple files
* Delimeters start with two consecutive pipes and consist of 7 pipes
DELIMITER EXAMPLES:
a) ||XX|123456|A111|||
b) ||XX|123456|A111|021234567
c) ||XX|123456|A111|(02) 12345678||
d) ||XX|123456|A111|(02) 12345678|test@email.com|
a) The first example are the only compulsory fields
XX=company code
123456=booking which could be 5 or 6 digits
A111=series which is letter and number combo
b) Also contains a phone number
c) The phone number can contain spaces and brackets
d) Also can contain email address
AIM:
* To convert text file into multiple text files based on file names above
* Filenames should be unique
* Filenames should unclude YYYYMMDD_HHMM to ensure uniqueness
* Files produced should maintain the pipe delimiter as line number one
* Junk text found in file prior to first pipe should be discarded
INPUT FILE EXAMPLE:
VVVVVVVVVVVVVVVVVVVVVVVVVV
Possible junk text to be discarded
||AA|111111|E123|||
Contents of file one
||XX|222222|A111|(02) 12345678||
Contents of file two
||XX|222222|A111|021234567
Contents of file three
^^^^^^^^^^^^^^^^^^^^^^^
OUTPUT FILES PRODUCED:
AA111111_20060518_1614_001
XX222222_20060518_1614_002
XX222222_20060518_1614_003
NOTE:
AA111111_20060518_1614_001
VVVVVVVVVVVVVVVVVVVVVVVVVV
||AA|111111|E123|||
Contents of file one
^^^^^^^^^^^^^^^^^^^^^^^
Thanks in advance,
gawk -F'|' '{if(x==$4){c++}else{c=0}f =sprintf(" %s%s_20060 518_1614_% 03d.DOC",$ 3,$4,c);x= $4;print $0 > f}' file
ASKER
Not sure if GNU awk is installed on the client systems, but for the sake of keeping with what we have - okay to use nawk or awk.
The above gawk has the fixed string 20060518_1614. This was purely for example and needs to be the time at the time the script is run. Therefore, I guess my only dilemma was figuring out how to get a ksh string variable into the awk/nawk.
The following appears in the script:
typeset -r DateTime=$(date \+%Y%m%d_%H%M)
My first (albeit very clumsy) attempt was to alter the original solution by:
awk -F'|' 'BEGIN{f="dummy"} \
/^\|\|/{f=$3 $4 sprintf("_%03d",++count)}{ print >>f_$DateTime}' \
$CAL_REPTDIR/$SplitFile.OL D
The file GLENN.TST contains:
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|22222|E222|||
TEST 2
||AU|33333|E333|3333333333 |test@emai l.com|
TEST 3
This is split up into 3 files:
||AU|11111|E111|(11) 11111111||
||AU|22222|E222|||
||AU|33333|E333|3333333333 |test@emai l.com|
Instead of if run at let's say, 19th May 2006 @ 0916:
AU11111_20060519_0916_001. DOC
AU22222_20060519_0916_002. DOC
AU33333_20060519_0916_003. DOC
i.e.
AU11111_${DateTime}_001.DO C
AU22222_${DateTime}_002.DO C
AU33333_${DateTime}_003.DO C
The above gawk has the fixed string 20060518_1614. This was purely for example and needs to be the time at the time the script is run. Therefore, I guess my only dilemma was figuring out how to get a ksh string variable into the awk/nawk.
The following appears in the script:
typeset -r DateTime=$(date \+%Y%m%d_%H%M)
My first (albeit very clumsy) attempt was to alter the original solution by:
awk -F'|' 'BEGIN{f="dummy"} \
/^\|\|/{f=$3 $4 sprintf("_%03d",++count)}{
$CAL_REPTDIR/$SplitFile.OL
The file GLENN.TST contains:
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|22222|E222|||
TEST 2
||AU|33333|E333|3333333333
TEST 3
This is split up into 3 files:
||AU|11111|E111|(11) 11111111||
||AU|22222|E222|||
||AU|33333|E333|3333333333
Instead of if run at let's say, 19th May 2006 @ 0916:
AU11111_20060519_0916_001.
AU22222_20060519_0916_002.
AU33333_20060519_0916_003.
i.e.
AU11111_${DateTime}_001.DO
AU22222_${DateTime}_002.DO
AU33333_${DateTime}_003.DO
ASKER
Have increased points due to no replies
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
This last answer was very helpful with the file names.
Unfortunately the files no longer contain contents.
I received files:
AU11111060523.1634000.DOC
AU11111060523.1634001.DOC
AU33333060523.1634000.DOC
Each of these contained simply the header, but no text pre header.
The traditional awk solution I have been using did redirect to different files:
awk -F'|' 'BEGIN{f="dummy"}/^\|\|/{f =$3 $4 sprintf("_%03d",++count)}{ print >>f}'
The dummy was used to discard any text prior to the first delimiter.
Need a solution still, so have increased points yet again :)
Thanks for help so far.
Unfortunately the files no longer contain contents.
I received files:
AU11111060523.1634000.DOC
AU11111060523.1634001.DOC
AU33333060523.1634000.DOC
Each of these contained simply the header, but no text pre header.
The traditional awk solution I have been using did redirect to different files:
awk -F'|' 'BEGIN{f="dummy"}/^\|\|/{f
The dummy was used to discard any text prior to the first delimiter.
Need a solution still, so have increased points yet again :)
Thanks for help so far.
hmm, no idea why your file is empty. As you say that your awk can redirect to different files, then I guess it is nawk (or similar) anyway. Can you get some more informations about your awk version? Which OS?
ASKER
I have changed the nawk to now look like this:
nawk -F'|' 'BEGIN{f="dummy"}/^\|\|/{i f(x==$4){c ++} \
else{c=0}f=sprintf("%s%s'$ DateTime'% 03d.DOC",$ 3,$4,c); \
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OL D
Using a bit of the old and a bit of the new.
And to get rid of the dummy file:
nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\ |/{if(x==$ 4){c++} \
else{c=0}f=sprintf("%s%s'$ DateTime'% 03d.DOC",$ 3,$4,c); \
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OL D
So this above answer is very good.....
One small problem:
In testing I decided to throw a curve ball. I noted that the c++ increment is resetting per occurence of the $1.
For example:
INPUT:
TEST 0
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|11111|E111|||
TEST 2
||AU|33333|E333|3333333333 |test@emai l.com|
TEST 3
||AU|11111|E111|||
TEST 4
Should produce 4 files. The "TEST0" prior to the first delimiter is correctly going to /dev/null.
The 4th file, is overwriting the second file.
OUTPUT FILES:
AU11111060523.1652000.DOC
AU11111060523.1652001.DOC
AU33333060523.1652000.DOC
Two improvements that would be nice:
Counter could be 000 through 999
So I changed this to:
nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\ |/{if(x==$ 4){c++} \
else{c++}f=sprintf("%s%s'$ DateTime'% 03d.DOC",$ 3,$4,c); \
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OL D
So now output files are:
AU11111060523.1658001.DOC
AU11111060523.1658002.DOC
AU33333060523.1658003.DOC
AU11111060523.1658004.DOC
Case closed :)
nawk -F'|' 'BEGIN{f="dummy"}/^\|\|/{i
else{c=0}f=sprintf("%s%s'$
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OL
Using a bit of the old and a bit of the new.
And to get rid of the dummy file:
nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\
else{c=0}f=sprintf("%s%s'$
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OL
So this above answer is very good.....
One small problem:
In testing I decided to throw a curve ball. I noted that the c++ increment is resetting per occurence of the $1.
For example:
INPUT:
TEST 0
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|11111|E111|||
TEST 2
||AU|33333|E333|3333333333
TEST 3
||AU|11111|E111|||
TEST 4
Should produce 4 files. The "TEST0" prior to the first delimiter is correctly going to /dev/null.
The 4th file, is overwriting the second file.
OUTPUT FILES:
AU11111060523.1652000.DOC
AU11111060523.1652001.DOC
AU33333060523.1652000.DOC
Two improvements that would be nice:
Counter could be 000 through 999
So I changed this to:
nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\
else{c++}f=sprintf("%s%s'$
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OL
So now output files are:
AU11111060523.1658001.DOC
AU11111060523.1658002.DOC
AU33333060523.1658003.DOC
AU11111060523.1658004.DOC
Case closed :)
> The 4th file, is overwriting the second file.
what's the problem with that? they are identical anyway (in your example:)
you last "improvement" now no longer ensures that data goes to same file if 4'th field is identical
what's the problem with that? they are identical anyway (in your example:)
you last "improvement" now no longer ensures that data goes to same file if 4'th field is identical
ASKER
I'm sorry I didn't make that clear in my original question.
The files all had to be unique, even if any of the fields repeated.
I made slight changes to your solution to suit - but because your solution was ultimately what I used, I increased the points, accepted your answer and graded it with an A.
So... no problems - the nawk is working as I required it to.
I thank you very much for taking the time to answer my question.
Glenn Stewart
The files all had to be unique, even if any of the fields repeated.
I made slight changes to your solution to suit - but because your solution was ultimately what I used, I increased the points, accepted your answer and graded it with an A.
So... no problems - the nawk is working as I required it to.
I thank you very much for taking the time to answer my question.
Glenn Stewart
thank, good luck