I change my text files here manually which have the data in the following format.
1085961616.474 172 126.96.36.199 TCP_MISS/200 2146 GET http://cfg.mywebsearch.com/mysaconfg.jsp?
[07fiG10lKmU4M3R:IQ.TCH] - DIRECT/188.8.131.52 text/html
1085961622.602 60 184.108.40.206 TCP_HIT/200 9476 GET http://image.linkexchange.com/01/73/69/31/banner468x60.gif
- NONE/- image/gif
1085961627.502 159 220.127.116.11 TCP_REFRESH_HIT/304 339 GET http://scripts.lycos.com/catman/login.mail.lycos.com.cm/logout.js
1085961648.792 6 18.104.22.168 TCP_MISS/503 1585 GET http://erp.water.com:9930/
- NONE/- text/html
I change these files to have the url's only i.e. as below
Now I filter out my urls according these ways
1) I look for everything else other than _hit/200 and _miss/200 and remove it from the file
2) Then I take the line with _hit/200 and _miss/200 and copy the url only.
Currently I only have 10 records in the text file, however down the line I'll be receiving these files in megs. As a result I am trying to make my life easier before hand by creating a shell script.
Now here is my algorithm. get the line number that do not have the expression _hit/200 andn _miss/200, pick it and delete it. then for the ones left truncate everything before http
: and after .js or .jsp or .html or .gif it will leave me with only the urls.
I am not so good at script syntax but here's what I've come up with so far.
sed -n -d '/!_hit/200 /='log.txt -- search in log.txt file the line with the given expression and delete it. it's not correct as it doesn't give me the expected result i.e. delete the other lines.
sed -n -d '/!_miss/200 /='log.txt -- same here
sed -n -d '/^* /200 /='log.txt -- now here I checked man pages and I believe I am supposed to use the source and destination text in the file but don't know what expression would fit in.
Would someone tell me what am I doing wrong?
Thanks in advance