asked on

Using sort and uniq to pare down large syslog

I have a very large syslog file with lots of duplications of source and destination IP addresses and port number in the records. What I want is to have this abbreviated such that I get just one line for each unique source and destination IP and destination port number. A count of number of occurrences of each of these would be needed too. It looks to me like unix sort and uniq commands might be a start. But syslog records have more info than just the fields I mentioned. So the example here while intriguing makes it look like it might not be up to the task. http://www.computerhope.com/unix/uuniq.htm Perhaps I could run sed and awk first to limit the fields in play and *then* run sort and finally unique and then get a count somehow? Any thought on approach would be welcome.

Abhimanyu Suri

Please share excerpt from the file.

Thanks

MURUGESAN N

@Amigan,

Without input file, we cannot propose the required command to get the expected result.
Sample command to use:
/bin/sort -u inputfile.txt

simon3270

You are right, sort and uniq will give the count of unique entries:
awk '{process files to return just IP addresses}' | sort | uniq -c
will produce one line for each unique ip address/port set, with the number of instances of that line, followed by the line itself, e.g.
5 1.2.3.4 6.7.8.9 4000
7 192.1.0.3 10.1.3.70 443
1 192.168.0.33 172.34.0.3 53

ASKER CERTIFIED SOLUTION

MURUGESAN N

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

amigan_99

ASKER

Great TY

simon3270

I'm a little bemused by the answer you've accepted - it doesn't seem to answer your query. Nor does their earlier posting (sort -u will give one each of the unique entries, but not tell you how many of each).

Yes, my solution used three programs but a) they are all quite small (30 to 90 kilobytes on my system), and b) they are written in C, so are blindingly fast. You said that the source files were very large, so the time taken by an interpreter such as awk will greatly exceed the extra time taken to load three small programs. The unix philosphy is do one thing ad do it well - that's what sort and uniq do here.

And I didn't put the full path in because if /usr/bin is not in your PATH, or someone using your system decides to define an "awk" function, then more things will go wrong than you can imagine.