Link to home
Start Free TrialLog in
Avatar of glennstewart
glennstewart

asked on

Simply awk/sed solution required - Split a file on fields - another additional question

Hi All,

A nice easy one being an extension of two previously answered questions. Should be nice easy points up for grabs.

For reference see the following two PAQ's:

First Question:
https://www.experts-exchange.com/questions/21763311/Simply-awk-sed-solution-required-Split-a-file-on-fields.html
Second Question:
https://www.experts-exchange.com/questions/21764891/Simply-awk-sed-solution-required-Split-a-file-on-fields-additional-question.html

The input is ever so slightly different, and the output required is ever so slightly different.
Ozo's answer from the second PAQ is great - it works.

So can base off this or re-work if you like.

------------------------------------------
INPUT ATTRIBUTES:

* Text file to be split into multiple files
* Delimeters start with two consecutive pipes and consist of 7 pipes

DELIMITER EXAMPLES:

a) ||XX|123456|A111|||
b) ||XX|123456|A111|0212345678||
c) ||XX|123456|A111|(02) 12345678||
d) ||XX|123456|A111|(02) 12345678|test@email.com|

a) The first example are the only compulsory fields
XX=company code
123456=booking which could be 5 or 6 digits
A111=series which is letter and number combo

b) Also contains a phone number

c) The phone number can contain spaces and brackets

d) Also can contain email address

AIM:

* To convert text file into multiple text files based on file names above
* Filenames should be unique
* Filenames should unclude YYYYMMDD_HHMM to ensure uniqueness
* Files produced should maintain the pipe delimiter as line number one
* Junk text found in file prior to first pipe should be discarded

INPUT FILE EXAMPLE:
VVVVVVVVVVVVVVVVVVVVVVVVVV
Possible junk text to be discarded
||AA|111111|E123|||
Contents of file one
||XX|222222|A111|(02) 12345678||
Contents of file two
||XX|222222|A111|0212345678||
Contents of file three
^^^^^^^^^^^^^^^^^^^^^^^

OUTPUT FILES PRODUCED:

AA111111_20060518_1614_001.DOC
XX222222_20060518_1614_002.DOC
XX222222_20060518_1614_003.DOC

NOTE:

AA111111_20060518_1614_001.DOC would contain:
VVVVVVVVVVVVVVVVVVVVVVVVVV
||AA|111111|E123|||
Contents of file one
^^^^^^^^^^^^^^^^^^^^^^^

Thanks in advance,
Avatar of ahoffmann
ahoffmann
Flag of Germany image

gawk -F'|' '{if(x==$4){c++}else{c=0}f=sprintf("%s%s_20060518_1614_%03d.DOC",$3,$4,c);x=$4;print $0 > f}' file
Avatar of glennstewart
glennstewart

ASKER

Not sure if GNU awk is installed on the client systems, but for the sake of keeping with what we have - okay to use nawk or awk.

The above gawk has the fixed string 20060518_1614. This was purely for example and needs to be the time at the time the script is run. Therefore, I guess my only dilemma was figuring out how to get a ksh string variable into the awk/nawk.

The following appears in the script:

typeset -r DateTime=$(date \+%Y%m%d_%H%M)

My first (albeit very clumsy) attempt was to alter the original solution by:

awk -F'|' 'BEGIN{f="dummy"} \
/^\|\|/{f=$3 $4 sprintf("_%03d",++count)}{print >>f_$DateTime}' \
$CAL_REPTDIR/$SplitFile.OLD

The file GLENN.TST contains:
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|22222|E222|||
TEST 2
||AU|33333|E333|3333333333|test@email.com|
TEST 3

This is split up into 3 files:
||AU|11111|E111|(11) 11111111||
||AU|22222|E222|||
||AU|33333|E333|3333333333|test@email.com|

Instead of if run at let's say, 19th May 2006 @ 0916:

AU11111_20060519_0916_001.DOC
AU22222_20060519_0916_002.DOC
AU33333_20060519_0916_003.DOC

i.e.
AU11111_${DateTime}_001.DOC
AU22222_${DateTime}_002.DOC
AU33333_${DateTime}_003.DOC
Have increased points due to no replies
ASKER CERTIFIED SOLUTION
Avatar of ahoffmann
ahoffmann
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This last answer was very helpful with the file names.
Unfortunately the files no longer contain contents.

I received files:

AU11111060523.1634000.DOC
AU11111060523.1634001.DOC
AU33333060523.1634000.DOC

Each of these contained simply the header, but no text pre header.

The traditional awk solution I have been using did redirect to different files:
awk -F'|' 'BEGIN{f="dummy"}/^\|\|/{f=$3 $4 sprintf("_%03d",++count)}{print >>f}'

The dummy was used to discard any text prior to the first delimiter.

Need a solution still, so have increased points yet again :)

Thanks for help so far.
hmm, no idea why your file is empty. As you say that your awk can redirect to different files, then I guess it is nawk (or similar) anyway. Can you get some more informations about your awk version? Which OS?
I have changed the nawk to now look like this:

    nawk -F'|' 'BEGIN{f="dummy"}/^\|\|/{if(x==$4){c++} \
    else{c=0}f=sprintf("%s%s'$DateTime'%03d.DOC",$3,$4,c); \
    x=$4}{print >> f}' \
    $CAL_REPTDIR/$SplitFile.OLD

Using a bit of the old and a bit of the new.

And to get rid of the dummy file:

    nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\|/{if(x==$4){c++} \
    else{c=0}f=sprintf("%s%s'$DateTime'%03d.DOC",$3,$4,c); \
    x=$4}{print >> f}' \
    $CAL_REPTDIR/$SplitFile.OLD

So this above answer is very good.....

One small problem:

In testing I decided to throw a curve ball. I noted that the c++ increment is resetting per occurence of the $1.

For example:

INPUT:
TEST 0
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|11111|E111|||
TEST 2
||AU|33333|E333|3333333333|test@email.com|
TEST 3
||AU|11111|E111|||
TEST 4

Should produce 4 files. The "TEST0" prior to the first delimiter is correctly going to /dev/null.
The 4th file, is overwriting the second file.

OUTPUT FILES:
AU11111060523.1652000.DOC
AU11111060523.1652001.DOC
AU33333060523.1652000.DOC

Two improvements that would be nice:

Counter could be 000 through 999

So I changed this to:

    nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\|/{if(x==$4){c++} \
    else{c++}f=sprintf("%s%s'$DateTime'%03d.DOC",$3,$4,c); \
    x=$4}{print >> f}' \
    $CAL_REPTDIR/$SplitFile.OLD

So now output files are:

AU11111060523.1658001.DOC
AU11111060523.1658002.DOC
AU33333060523.1658003.DOC
AU11111060523.1658004.DOC

Case closed :)
> The 4th file, is overwriting the second file.
what's the problem with that? they are identical anyway (in your example:)
you last "improvement" now no longer ensures that data goes to same file if 4'th field is identical
I'm sorry I didn't make that clear in my original question.
The files all had to be unique, even if any of the fields repeated.

I made slight changes to your solution to suit - but because your solution was ultimately what I used, I increased the points, accepted your answer and graded it with an A.

So... no problems - the nawk is working as I required it to.

I thank you very much for taking the time to answer my question.

Glenn Stewart
thank, good luck