Parsing another programs log file

I've been collecting log files from a program I'm using and 90% of the time I can correctly parse the log with my current script.  Unfortunetly I can find no rhyme or reason to the header file they use for the log file, it is never exactly the same characters, bits or end in the same character.  The data I want is alwasy in the following form:

11,22,3333,4444,555555,666666,77,88,99,10,11,1212121212121212121212

Now the problem I am having is that sometimes the header has a comma "," in it and the cut command will keep that line of the header.  I've only experienced one line having this problem at the top of the file but would like to check all lines for this common format and remove the lines that do not follow this format.  I am new at bash scripting and can not figure out how to do this.

NOTE: the 12th comma deliminated info could contain a comma in it which is also another problem I can't figure out.  If there is a comma in the 12th spot it could give me 13,14,15,ect fields if I use the cut command.  Is there a way to have the last field just include everything after the last comma or does it already do that?  The cut command I've been using is:

cut -f "1 2 3 4 5 6 7 8 9 10 11 12" -d , -s < file.log

Thanks for any help.
cpwemsAsked:
Who is Participating?
 
TintinConnect With a Mentor Commented:
Are the ^R, ^Q etc, actual control characters?

If so, it would appear (from the sample data) that each header lines ends in a control character.  Assuming the data has no control characters, you could strip out all lines that end in a control character, that would then leave you with just the data.
0
 
TintinCommented:
Can you show us a real sample header and data line?  There must be something else that can be used to distinguish them.

BTW, your current cut command can be reduced to

cut -f1-12 -d, -s <file.log

0
 
cpwemsAuthor Commented:
Below are two examples of the beggining of the log file.  The first example the data line starts at '0f,00,00,806050a0' and the second example the data line starts at 'c8,00,00,ff5020ff,' and each data line is seperated by the NUL character '^@'.  I have many log files each about 4.5k, so I can send whole files if anyone needs.

Sample 1:
^@^@M^@¹^@^H^Ah^AÈ^A^W^B<85>^BÖ^BE^C<94>^C^B^DS^Dª^D^A^Eq^EÀ^E.^F^?^Fï^F@^G°^G^C^Hs^Hã^H5       <82>    î       =
<8a>
ö
E^K<96>^K^F^Lq^LÄ^L/^M<9a>^M
^Nz^Nê^N<^O­^O^A^Pq^Pá^PQ^QÁ^Q"^R|^R0f,00,00,806050a0,000001ab,000001c2,0014,00,01,01,00,^^^A^_^OJet examines you.^?1^@79,03,00,80c0c050,000001ac,000001c3,0035,00,01,02,00,^^^AJet begins to browse the merchandise in your bazaar.^@

Sample 2:
^@^@S^@<8c>^@à^@6^A<87>^A÷^AJ^B<97>^B^C^CR^Cº^C^V^Dg^D×^D*^Ew^Eã^E2^F<86>^Fù^FO^G<9d>^G
^HZ^HÉ^H^Y      k       ¹       &
v
Ò
#^K<93>^Kæ^K7^L§^Lú^LJ^M¹^M^K^NZ^NÈ^N^Y^Om^Oà^O6^P<85>^Pó^PD^Qc8,00,00,ff5020ff,00000000,00000000,001c,00,01,00,00,^^^A<<< Welcome to Phoenix! >>>^@c8,00,00,ff5020ff,00000001,00000001,0002,00,01,00,00,^^^A ^@00,00,00,80808080,00000002,00000002,001d,00,01,00,00,^^^A=== Area: Bastok Markets ===^@

This is the script I have written so far:
for i in $log_dir/*; do
  if [ -f $i ]; then
  # if the file is there
    filename=${i#$log_dir/}
    tr '\0' '\n' < $log_dir/$filename > temp.log
    csplit -s temp.log "/\0/"
    if [ -f xx01 ]; then
      cut -f1-12 -d , -s xx01 > $clean_dir/$filename
    fi
    rm -rf temp.log
    rm -rf xx*
  fi
done


Thanks for the short command Tintin.
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

 
cpwemsAuthor Commented:
The data does have control characters like ^A and ^B which indicate what color the text is supose to be from that point on.
0
 
TintinCommented:
Instead of your cut, try:

sed "s/[^a-f0-9]*\([a-f0-9][a-f0-9],.*\)/\1/" xx01 >$clean_dir/$filename
0
 
cpwemsAuthor Commented:
Ok that suggestion helped out as long as I piped it with the cut, but I've found other problems, here are some more raw sample files:

sample3:
^@^@B^@<9b>^@^A^AV^A¯^A^B^BR^B´^Bü^BY^C«^C^\^Dÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0d,00,00,8020c0a0,0000015d,00000190,000b,00,01,01,00,^^^A(Goduro) k^@00,00,00,80808080,0000015e,00000191,0022,00,01,00,00,^^^ACash's title: Black Dragon Slayer^@

sample4:
^@^@q^@â^@3^A×^A^T^Be^B®^B^F^Cc^C¾^C0^Ds^D¶^Dü^D<8e>^Eä^EU^FØ^F:^GÆ^G^L^Hj^Hµ^H&        t       ¿       K
®
^N^Kj^KÐ^KB^L ^Lø^L{^MÊ^MV^N¼^N^P^Oe^Oº^O,^Po^Pè^Pt^QÔ^Q>^RÁ^R'^S79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.^@79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.^@

The problem with sample4 is that in the header there is a comma and it is left over with:
,
79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.


Tintin thanks again for all your help.  I've been trying for weeks to do this myself.
0
 
TintinCommented:
You should be able to just run:

grep , xx01 | sed "s/[^a-f0-9]*\([a-f0-9][a-f0-9],.*\)/\1/"

It works for me (note I have changed the control characters to X's and added newlines for the nulls to make it easier to test).

$ cat file
^@^@q^@â^@3^A×^A^T^Be^B®^B^F^Cc^C¾^C0^Ds^D¶^Dü^D<8e>^Eä^EU^FØ^F:^GÆ^G^L^Hj^Hµ^H&        t       ¿       K
®
XX,YY,ZZ79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.

$ grep , file | sed "s/[^a-f0-9]*\([a-f0-9][a-f0-9],.*\)/\1/"

79,03,00,80c0c050,0000008c,00000096,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.
79,03,00,80c0c050,0000008d,00000097,003a,00,01,02,00,^^^ASearch result: Only one person found in the entire world.



0
 
aib_42Commented:
This looks like a job for awk, I wonder why it hasn't been suggested...
0
 
cpwemsAuthor Commented:
Well Tintin set me in the right direction in thought for me to solve it myself.  Found some huge unix book and just looked through it at all the commands and stumbled on the dd command.  So here is the following code that works on my log files.  I never realized that the control characters took up two bytes, hence why I couldn't figure out the size of the header.  Would still love to know what is in the header but it's not important.

log_dir='/home/tabber/ffxi_logs'
clean_dir='/home/tabber/clean'
                                                                               
for i in $log_dir/*; do
  if [ -f $i ]; then
  # if the file is there
    filename=${i#$log_dir/}
    dd bs=1 skip=100 < $log_dir/$filename | tr '\0' '\n' > $clean_dir/$filename
  fi
done


I'm not to sure how to assign points but Tintin you will get them all, please let me know if I do it wrong.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.