cpwems
asked on
Parsing another programs log file
I've been collecting log files from a program I'm using and 90% of the time I can correctly parse the log with my current script. Unfortunetly I can find no rhyme or reason to the header file they use for the log file, it is never exactly the same characters, bits or end in the same character. The data I want is alwasy in the following form:
11,22,3333,4444,555555,666 666,77,88, 99,10,11,1 2121212121 2121212121 2
Now the problem I am having is that sometimes the header has a comma "," in it and the cut command will keep that line of the header. I've only experienced one line having this problem at the top of the file but would like to check all lines for this common format and remove the lines that do not follow this format. I am new at bash scripting and can not figure out how to do this.
NOTE: the 12th comma deliminated info could contain a comma in it which is also another problem I can't figure out. If there is a comma in the 12th spot it could give me 13,14,15,ect fields if I use the cut command. Is there a way to have the last field just include everything after the last comma or does it already do that? The cut command I've been using is:
cut -f "1 2 3 4 5 6 7 8 9 10 11 12" -d , -s < file.log
Thanks for any help.
11,22,3333,4444,555555,666
Now the problem I am having is that sometimes the header has a comma "," in it and the cut command will keep that line of the header. I've only experienced one line having this problem at the top of the file but would like to check all lines for this common format and remove the lines that do not follow this format. I am new at bash scripting and can not figure out how to do this.
NOTE: the 12th comma deliminated info could contain a comma in it which is also another problem I can't figure out. If there is a comma in the 12th spot it could give me 13,14,15,ect fields if I use the cut command. Is there a way to have the last field just include everything after the last comma or does it already do that? The cut command I've been using is:
cut -f "1 2 3 4 5 6 7 8 9 10 11 12" -d , -s < file.log
Thanks for any help.
ASKER
Below are two examples of the beggining of the log file. The first example the data line starts at '0f,00,00,806050a0' and the second example the data line starts at 'c8,00,00,ff5020ff,' and each data line is seperated by the NUL character '^@'. I have many log files each about 4.5k, so I can send whole files if anyone needs.
Sample 1:
^@^@M^@¹^@^H^Ah^AÈ^A^W^B<8 5>^BÖ^BE^C <94>^C^B^D S^Dª^D^A^E q^EÀ^E.^F^ ?^Fï^F@^G° ^G^C^Hs^Hã ^H5 <82> î =
<8a>
ö
E^K<96>^K^F^Lq^LÄ^L/^M<9a> ^M
^Nz^Nê^N<^O^O^A^Pq^Pá^PQ^ QÁ^Q"^R|^R 0f,00,00,8 06050a0,00 0001ab,000 001c2,0014 ,00,01,01, 00,^^^A^_^ OJet examines you.^?1^@79,03,00,80c0c050 ,000001ac, 000001c3,0 035,00,01, 02,00,^^^A Jet begins to browse the merchandise in your bazaar.^@
Sample 2:
^@^@S^@<8c>^@à^@6^A<87>^A÷ ^AJ^B<97>^ B^C^CR^Cº^ C^V^Dg^D×^ D*^Ew^Eã^E 2^F<86>^Fù ^FO^G<9d>^ G
^HZ^HÉ^H^Y k ¹ &
v
Ò
#^K<93>^Kæ^K7^L§^Lú^LJ^M¹^ M^K^NZ^NÈ^ N^Y^Om^Oà^ O6^P<85>^P ó^PD^Qc8,0 0,00,ff502 0ff,000000 00,0000000 0,001c,00, 01,00,00,^ ^^A<<< Welcome to Phoenix! >>>^@c8,00,00,ff5020ff,000 00001,0000 0001,0002, 00,01,00,0 0,^^^A ^@00,00,00,80808080,000000 02,0000000 2,001d,00, 01,00,00,^ ^^A=== Area: Bastok Markets ===^@
This is the script I have written so far:
for i in $log_dir/*; do
if [ -f $i ]; then
# if the file is there
filename=${i#$log_dir/}
tr '\0' '\n' < $log_dir/$filename > temp.log
csplit -s temp.log "/\0/"
if [ -f xx01 ]; then
cut -f1-12 -d , -s xx01 > $clean_dir/$filename
fi
rm -rf temp.log
rm -rf xx*
fi
done
Thanks for the short command Tintin.
Sample 1:
^@^@M^@¹^@^H^Ah^AÈ^A^W^B<8
<8a>
ö
E^K<96>^K^F^Lq^LÄ^L/^M<9a>
^Nz^Nê^N<^O^O^A^Pq^Pá^PQ^
Sample 2:
^@^@S^@<8c>^@à^@6^A<87>^A÷
^HZ^HÉ^H^Y k ¹ &
v
Ò
#^K<93>^Kæ^K7^L§^Lú^LJ^M¹^
This is the script I have written so far:
for i in $log_dir/*; do
if [ -f $i ]; then
# if the file is there
filename=${i#$log_dir/}
tr '\0' '\n' < $log_dir/$filename > temp.log
csplit -s temp.log "/\0/"
if [ -f xx01 ]; then
cut -f1-12 -d , -s xx01 > $clean_dir/$filename
fi
rm -rf temp.log
rm -rf xx*
fi
done
Thanks for the short command Tintin.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
The data does have control characters like ^A and ^B which indicate what color the text is supose to be from that point on.
Instead of your cut, try:
sed "s/[^a-f0-9]*\([a-f0-9][a- f0-9],.*\) /\1/" xx01 >$clean_dir/$filename
sed "s/[^a-f0-9]*\([a-f0-9][a-
ASKER
Ok that suggestion helped out as long as I piped it with the cut, but I've found other problems, here are some more raw sample files:
sample3:
^@^@B^@<9b>^@^A^AV^A¯^A^B^ BR^B´^Bü^B Y^C«^C^\^D ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿ0d,00, 00,8020c0a 0,0000015d ,00000190, 000b,00,01 ,01,00,^^^ A(Goduro) k^@00,00,00,80808080,00000 15e,000001 91,0022,00 ,01,00,00, ^^^ACash's title: Black Dragon Slayer^@
sample4:
^@^@q^@â^@3^A×^A^T^Be^B®^B ^F^Cc^C¾^C 0^Ds^D¶^Dü ^D<8e>^Eä^ EU^FØ^F:^G Æ^G^L^Hj^H µ^H& t ¿ K
®
^N^Kj^KÐ^KB^L ^Lø^L{^MÊ^MV^N¼^N^P^Oe^Oº^ O,^Po^Pè^P t^QÔ^Q>^RÁ ^R'^S79,03 ,00,80c0c0 50,0000008 c,00000096 ,003a,00,0 1,02,00,^^ ^ASearch result: Only one person found in the entire world.^@79,03,00,80c0c050, 0000008d,0 0000097,00 3a,00,01,0 2,00,^^^AS earch result: Only one person found in the entire world.^@
The problem with sample4 is that in the header there is a comma and it is left over with:
,
79,03,00,80c0c050,0000008c ,00000096, 003a,00,01 ,02,00,^^^ ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d ,00000097, 003a,00,01 ,02,00,^^^ ASearch result: Only one person found in the entire world.
Tintin thanks again for all your help. I've been trying for weeks to do this myself.
sample3:
^@^@B^@<9b>^@^A^AV^A¯^A^B^
sample4:
^@^@q^@â^@3^A×^A^T^Be^B®^B
®
^N^Kj^KÐ^KB^L ^Lø^L{^MÊ^MV^N¼^N^P^Oe^Oº^
The problem with sample4 is that in the header there is a comma and it is left over with:
,
79,03,00,80c0c050,0000008c
79,03,00,80c0c050,0000008d
Tintin thanks again for all your help. I've been trying for weeks to do this myself.
You should be able to just run:
grep , xx01 | sed "s/[^a-f0-9]*\([a-f0-9][a- f0-9],.*\) /\1/"
It works for me (note I have changed the control characters to X's and added newlines for the nulls to make it easier to test).
$ cat file
^@^@q^@â^@3^A×^A^T^Be^B®^B ^F^Cc^C¾^C 0^Ds^D¶^Dü ^D<8e>^Eä^ EU^FØ^F:^G Æ^G^L^Hj^H µ^H& t ¿ K
®
XX,YY,ZZ79,03,00,80c0c050, 0000008c,0 0000096,00 3a,00,01,0 2,00,^^^AS earch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d ,00000097, 003a,00,01 ,02,00,^^^ ASearch result: Only one person found in the entire world.
$ grep , file | sed "s/[^a-f0-9]*\([a-f0-9][a- f0-9],.*\) /\1/"
79,03,00,80c0c050,0000008c ,00000096, 003a,00,01 ,02,00,^^^ ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d ,00000097, 003a,00,01 ,02,00,^^^ ASearch result: Only one person found in the entire world.
grep , xx01 | sed "s/[^a-f0-9]*\([a-f0-9][a-
It works for me (note I have changed the control characters to X's and added newlines for the nulls to make it easier to test).
$ cat file
^@^@q^@â^@3^A×^A^T^Be^B®^B
®
XX,YY,ZZ79,03,00,80c0c050,
79,03,00,80c0c050,0000008d
$ grep , file | sed "s/[^a-f0-9]*\([a-f0-9][a-
79,03,00,80c0c050,0000008c
79,03,00,80c0c050,0000008d
This looks like a job for awk, I wonder why it hasn't been suggested...
ASKER
Well Tintin set me in the right direction in thought for me to solve it myself. Found some huge unix book and just looked through it at all the commands and stumbled on the dd command. So here is the following code that works on my log files. I never realized that the control characters took up two bytes, hence why I couldn't figure out the size of the header. Would still love to know what is in the header but it's not important.
log_dir='/home/tabber/ffxi _logs'
clean_dir='/home/tabber/cl ean'
for i in $log_dir/*; do
if [ -f $i ]; then
# if the file is there
filename=${i#$log_dir/}
dd bs=1 skip=100 < $log_dir/$filename | tr '\0' '\n' > $clean_dir/$filename
fi
done
I'm not to sure how to assign points but Tintin you will get them all, please let me know if I do it wrong.
log_dir='/home/tabber/ffxi
clean_dir='/home/tabber/cl
for i in $log_dir/*; do
if [ -f $i ]; then
# if the file is there
filename=${i#$log_dir/}
dd bs=1 skip=100 < $log_dir/$filename | tr '\0' '\n' > $clean_dir/$filename
fi
done
I'm not to sure how to assign points but Tintin you will get them all, please let me know if I do it wrong.
BTW, your current cut command can be reduced to
cut -f1-12 -d, -s <file.log