asked on

Parsing another programs log file

I've been collecting log files from a program I'm using and 90% of the time I can correctly parse the log with my current script. Unfortunetly I can find no rhyme or reason to the header file they use for the log file, it is never exactly the same characters, bits or end in the same character. The data I want is alwasy in the following form:

11,22,3333,4444,555555,666666,77,88,99,10,11,1212121212121212121212

Now the problem I am having is that sometimes the header has a comma "," in it and the cut command will keep that line of the header. I've only experienced one line having this problem at the top of the file but would like to check all lines for this common format and remove the lines that do not follow this format. I am new at bash scripting and can not figure out how to do this.

NOTE: the 12th comma deliminated info could contain a comma in it which is also another problem I can't figure out. If there is a comma in the 12th spot it could give me 13,14,15,ect fields if I use the cut command. Is there a way to have the last field just include everything after the last comma or does it already do that? The cut command I've been using is:

cut -f "1 2 3 4 5 6 7 8 9 10 11 12" -d , -s < file.log

Thanks for any help.

Tintin

Can you show us a real sample header and data line? There must be something else that can be used to distinguish them.

BTW, your current cut command can be reduced to

cut -f1-12 -d, -s <file.log

cpwems

ASKER

Below are two examples of the beggining of the log file. The first example the data line starts at '0f,00,00,806050a0' and the second example the data line starts at 'c8,00,00,ff5020ff,' and each data line is seperated by the NUL character '^@'. I have many log files each about 4.5k, so I can send whole files if anyone needs.

Sample 1:
^@^@M^@¹^@^H^Ah^AÈ^A^W^B<85>^BÖ^BE^C<94>^C^B^DS^Dª^D^A^Eq^EÀ^E.^F^?^Fï^F@^G°^G^C^Hs^Hã^H5 <82> î =
<8a>
ö
E^K<96>^K^F^Lq^LÄ^L/^M<9a>^M
^Nz^Nê^N<^O^O^A^Pq^Pá^PQ^QÁ^Q"^R|^R0f,00,00,806050a0,000001ab,000001c2,0014,00,01,01,00,^^^A^_^OJet examines you.^?1^@79,03,00,80c0c050,000001ac,000001c3,0035,00,01,02,00,^^^AJet begins to browse the merchandise in your bazaar.^@

Sample 2:
^@^@S^@<8c>^@à^@6^A<87>^A÷^AJ^B<97>^B^C^CR^Cº^C^V^Dg^D×^D*^Ew^Eã^E2^F<86>^Fù^FO^G<9d>^G
^HZ^HÉ^H^Y k ¹ &
v
Ò
#^K<93>^Kæ^K7^L§^Lú^LJ^M¹^M^K^NZ^NÈ^N^Y^Om^Oà^O6^P<85>^Pó^PD^Qc8,00,00,ff5020ff,00000000,00000000,001c,00,01,00,00,^^^A<<< Welcome to Phoenix! >>>^@c8,00,00,ff5020ff,00000001,00000001,0002,00,01,00,00,^^^A ^@00,00,00,80808080,00000002,00000002,001d,00,01,00,00,^^^A=== Area: Bastok Markets ===^@

This is the script I have written so far:
for i in $log_dir/*; do
if [ -f $i ]; then
# if the file is there
filename=${i#$log_dir/}
tr '\0' '\n' < $log_dir/$filename > temp.log
csplit -s temp.log "/\0/"
if [ -f xx01 ]; then
cut -f1-12 -d , -s xx01 > $clean_dir/$filename
fi
rm -rf temp.log
rm -rf xx*
fi
done

Thanks for the short command Tintin.

ASKER CERTIFIED SOLUTION

Tintin

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

cpwems

ASKER

The data does have control characters like ^A and ^B which indicate what color the text is supose to be from that point on.

Tintin

Instead of your cut, try:

sed "s/[^a-f0-9]*$[a-f0-9][a-f0-9],.*$/\1/" xx01 >$clean_dir/$filename

cpwems

ASKER

Ok that suggestion helped out as long as I piped it with the cut, but I've found other problems, here are some more raw sample files:

sample3:
^@^@B^@<9b>^@^A^AV^A¯^A^B^BR^B´^Bü^BY^C«^C^\^Dÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0d,00,00,8020c0a0,0000015d,00000190,000b,00,01,01,00,^^^A(Goduro) k^@00,00,00,80808080,0000015e,00000191,0022,00,01,00,00,^^^ACash's title: Black Dragon Slayer^@

sample4:
^@^@q^@â^@3^A×^A^T^Be^B®^B^F^Cc^C¾^C0^Ds^D¶^Dü^D<8e>^Eä^EU^FØ^F:^GÆ^G^L^Hj^Hµ^H& t ¿ K
®
^N^Kj^KÐ^KB^L ^Lø^L{^MÊ^MV^N¼^N^P^Oe^Oº^O,^Po^Pè^Pt^QÔ^Q>^RÁ^R'^S79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.^@79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.^@

The problem with sample4 is that in the header there is a comma and it is left over with:
,
79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.

Tintin thanks again for all your help. I've been trying for weeks to do this myself.

Tintin

You should be able to just run:

grep , xx01 | sed "s/[^a-f0-9]*$[a-f0-9][a-f0-9],.*$/\1/"

It works for me (note I have changed the control characters to X's and added newlines for the nulls to make it easier to test).

$ cat file
^@^@q^@â^@3^A×^A^T^Be^B®^B^F^Cc^C¾^C0^Ds^D¶^Dü^D<8e>^Eä^EU^FØ^F:^GÆ^G^L^Hj^Hµ^H& t ¿ K
®
XX,YY,ZZ79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.

$ grep , file | sed "s/[^a-f0-9]*$[a-f0-9][a-f0-9],.*$/\1/"

79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.

aib_42

This looks like a job for awk, I wonder why it hasn't been suggested...

cpwems

ASKER

Well Tintin set me in the right direction in thought for me to solve it myself. Found some huge unix book and just looked through it at all the commands and stumbled on the dd command. So here is the following code that works on my log files. I never realized that the control characters took up two bytes, hence why I couldn't figure out the size of the header. Would still love to know what is in the header but it's not important.

log_dir='/home/tabber/ffxi_logs'
clean_dir='/home/tabber/clean'

for i in $log_dir/*; do
if [ -f $i ]; then
# if the file is there
filename=${i#$log_dir/}
dd bs=1 skip=100 < $log_dir/$filename | tr '\0' '\n' > $clean_dir/$filename
fi
done

I'm not to sure how to assign points but Tintin you will get them all, please let me know if I do it wrong.