asked on

Simply awk/sed solution required - Split a file on fields - another additional question

Hi All,

A nice easy one being an extension of two previously answered questions. Should be nice easy points up for grabs.

For reference see the following two PAQ's:

First Question:
https://www.experts-exchange.com/questions/21763311/Simply-awk-sed-solution-required-Split-a-file-on-fields.html
Second Question:
https://www.experts-exchange.com/questions/21764891/Simply-awk-sed-solution-required-Split-a-file-on-fields-additional-question.html

The input is ever so slightly different, and the output required is ever so slightly different.
Ozo's answer from the second PAQ is great - it works.

So can base off this or re-work if you like.

------------------------------------------
INPUT ATTRIBUTES:

* Text file to be split into multiple files
* Delimeters start with two consecutive pipes and consist of 7 pipes

DELIMITER EXAMPLES:

a) ||XX|123456|A111|||
b) ||XX|123456|A111|0212345678||
c) ||XX|123456|A111|(02) 12345678||
d) ||XX|123456|A111|(02) 12345678|test@email.com|

a) The first example are the only compulsory fields
XX=company code
123456=booking which could be 5 or 6 digits
A111=series which is letter and number combo

b) Also contains a phone number

c) The phone number can contain spaces and brackets

d) Also can contain email address

AIM:

* To convert text file into multiple text files based on file names above
* Filenames should be unique
* Filenames should unclude YYYYMMDD_HHMM to ensure uniqueness
* Files produced should maintain the pipe delimiter as line number one
* Junk text found in file prior to first pipe should be discarded

INPUT FILE EXAMPLE:
VVVVVVVVVVVVVVVVVVVVVVVVVV
Possible junk text to be discarded
||AA|111111|E123|||
Contents of file one
||XX|222222|A111|(02) 12345678||
Contents of file two
||XX|222222|A111|0212345678||
Contents of file three
^^^^^^^^^^^^^^^^^^^^^^^

OUTPUT FILES PRODUCED:

AA111111_20060518_1614_001.DOC
XX222222_20060518_1614_002.DOC
XX222222_20060518_1614_003.DOC

NOTE:

AA111111_20060518_1614_001.DOC would contain:
VVVVVVVVVVVVVVVVVVVVVVVVVV
||AA|111111|E123|||
Contents of file one
^^^^^^^^^^^^^^^^^^^^^^^

Thanks in advance,

ahoffmann

gawk -F'|' '{if(x==$4){c++}else{c=0}f=sprintf("%s%s_20060518_1614_%03d.DOC",$3,$4,c);x=$4;print $0 > f}' file

glennstewart

ASKER

Not sure if GNU awk is installed on the client systems, but for the sake of keeping with what we have - okay to use nawk or awk.

The above gawk has the fixed string 20060518_1614. This was purely for example and needs to be the time at the time the script is run. Therefore, I guess my only dilemma was figuring out how to get a ksh string variable into the awk/nawk.

The following appears in the script:

typeset -r DateTime=$(date \+%Y%m%d_%H%M)

My first (albeit very clumsy) attempt was to alter the original solution by:

awk -F'|' 'BEGIN{f="dummy"} \
/^\|\|/{f=$3 $4 sprintf("_%03d",++count)}{print >>f_$DateTime}' \
$CAL_REPTDIR/$SplitFile.OLD

The file GLENN.TST contains:
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|22222|E222|||
TEST 2
||AU|33333|E333|3333333333|test@email.com|
TEST 3

This is split up into 3 files:
||AU|11111|E111|(11) 11111111||
||AU|22222|E222|||
||AU|33333|E333|3333333333|test@email.com|

Instead of if run at let's say, 19th May 2006 @ 0916:

AU11111_20060519_0916_001.DOC
AU22222_20060519_0916_002.DOC
AU33333_20060519_0916_003.DOC

i.e.
AU11111_${DateTime}_001.DOC
AU22222_${DateTime}_002.DOC
AU33333_${DateTime}_003.DOC

glennstewart

ASKER

Have increased points due to no replies

ASKER CERTIFIED SOLUTION

ahoffmann

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

glennstewart

ASKER

This last answer was very helpful with the file names.
Unfortunately the files no longer contain contents.

I received files:

AU11111060523.1634000.DOC
AU11111060523.1634001.DOC
AU33333060523.1634000.DOC

Each of these contained simply the header, but no text pre header.

The traditional awk solution I have been using did redirect to different files:
awk -F'|' 'BEGIN{f="dummy"}/^\|\|/{f=$3 $4 sprintf("_%03d",++count)}{print >>f}'

The dummy was used to discard any text prior to the first delimiter.

Need a solution still, so have increased points yet again :)

Thanks for help so far.

ahoffmann

hmm, no idea why your file is empty. As you say that your awk can redirect to different files, then I guess it is nawk (or similar) anyway. Can you get some more informations about your awk version? Which OS?

glennstewart

ASKER

I have changed the nawk to now look like this:

nawk -F'|' 'BEGIN{f="dummy"}/^\|\|/{if(x==$4){c++} \
else{c=0}f=sprintf("%s%s'$DateTime'%03d.DOC",$3,$4,c); \
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OLD

Using a bit of the old and a bit of the new.

And to get rid of the dummy file:

nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\|/{if(x==$4){c++} \
else{c=0}f=sprintf("%s%s'$DateTime'%03d.DOC",$3,$4,c); \
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OLD

So this above answer is very good.....

One small problem:

In testing I decided to throw a curve ball. I noted that the c++ increment is resetting per occurence of the $1.

For example:

INPUT:
TEST 0
||AU|11111|E111|(11) 11111111||
TEST 1
||AU|11111|E111|||
TEST 2
||AU|33333|E333|3333333333|test@email.com|
TEST 3
||AU|11111|E111|||
TEST 4

Should produce 4 files. The "TEST0" prior to the first delimiter is correctly going to /dev/null.
The 4th file, is overwriting the second file.

OUTPUT FILES:
AU11111060523.1652000.DOC
AU11111060523.1652001.DOC
AU33333060523.1652000.DOC

Two improvements that would be nice:

Counter could be 000 through 999

So I changed this to:

nawk -F'|' 'BEGIN{f="/dev/null"}/^\|\|/{if(x==$4){c++} \
else{c++}f=sprintf("%s%s'$DateTime'%03d.DOC",$3,$4,c); \
x=$4}{print >> f}' \
$CAL_REPTDIR/$SplitFile.OLD

So now output files are:

AU11111060523.1658001.DOC
AU11111060523.1658002.DOC
AU33333060523.1658003.DOC
AU11111060523.1658004.DOC

Case closed :)

ahoffmann

> The 4th file, is overwriting the second file.
what's the problem with that? they are identical anyway (in your example:)
you last "improvement" now no longer ensures that data goes to same file if 4'th field is identical

glennstewart

ASKER

I'm sorry I didn't make that clear in my original question.
The files all had to be unique, even if any of the fields repeated.

I made slight changes to your solution to suit - but because your solution was ultimately what I used, I increased the points, accepted your answer and graded it with an A.

So... no problems - the nawk is working as I required it to.

I thank you very much for taking the time to answer my question.

Glenn Stewart

ahoffmann

thank, good luck