Geoff Millikan
asked on
Regex for parsing out Apache common log (plus one extra field)
The below pattern parses out the fields from $str1 just fine but it doesn't work on to other strings. Can you fix the regex pattern so it works on all strings?
Thanks, http://www.t1shopper.com/
PS: We've been working to get this Regex right for a few years but we keep having exceptions come up. Here's the thread of past answers:
https://www.experts-exchange.com/questions/26028925/Help-fixing-my-Regex-pattern.html
https://www.experts-exchange.com/questions/26184204/Parsing-text-string-with-regex-preg-match.html
https://www.experts-exchange.com/questions/25968268/Help-fixing-my-Regex-pattern.html
https://www.experts-exchange.com/questions/23544714/Simple-Regex-Question.html
Thanks, http://www.t1shopper.com/
$str1='67.195.37.124 - - [20/Apr/2010:05:32:22 +0000] "GET /us/ga/b.html HTTP/1.0" 200 1761 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" "-"';
$str2='67.225.164.12 - - [03/Jan/2011:21:15:39 +0000] "GET / HTTP/1.1" 200 8973 "" "\"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)\"" "-"';
$str3='77.238.196.184 - - [01/Jan/2011:20:54:07 +0000] "GET /tools/port-scan/result/?scan_host=cairosat.zapto.org&ports=2000&portscansubmit=Scan&port_start=&port_end= HTTP/1.1" 200 9640 "http://www.t1shopper.com/tools/port-scan/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.8) Gecko/20100722 Firefox/2.0.0.3 \"MEGAUPLOAD 1.0\"" "-"';
$str4='91.55.106.90 - - [27/Dec/2010:17:16:07 +0000] "GET /ssi/t1shopper.js HTTP/1.1" 200 2012 "http://www.t1shopper.com/tools/port-scan/" "\"Bundestrojaner 2.0 - www.rettedeinefreiheit.de\"" "-"';
$str5='72.37.171.76 - - [22/Dec/2010:17:55:51 +0000] "GET /tools/calculate/ HTTP/1.0" 200 37778 "http://www.bing.com/search?q=\"kilobyte+to+megabyte\"+\"converter\"&src=IE-Address" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2; MS-RTC LM 8)" "-"';
$log_pattern = '#^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^ ]+) ([^ ]+) ([^/]+)/([^"]+)" ([^ ]+) ([^ ]+) "([^"]*)" "([^"]+)" "([^"]+)"#';
preg_match($log_pattern, $str2, $matches);
print_r($matches);
PS: We've been working to get this Regex right for a few years but we keep having exceptions come up. Here's the thread of past answers:
https://www.experts-exchange.com/questions/26028925/Help-fixing-my-Regex-pattern.html
https://www.experts-exchange.com/questions/26184204/Parsing-text-string-with-regex-preg-match.html
https://www.experts-exchange.com/questions/25968268/Help-fixing-my-Regex-pattern.html
https://www.experts-exchange.com/questions/23544714/Simple-Regex-Question.html
Here's an example of accessing the named capture groups:
print "Access Date: " . $matches['accessdate'] . "<br />";
print "Request: " . $matches['request'] . "<br />";
print "Response: " . $matches['response'] . "<br />";
ASKER
Touche! :-)
We shouldn't need two patterns because there's always 10 fields to parse out of the text string - sometimes the fields will be empty like "" but there will always be 10 fields as shown below.
That said, my version of PHP (5.1.6) isn't liking the question marks or angle brackets (I think that's the named capture group part?).
For the named capture groups, the fields in the text string are:
1. %h Remote host %h
2. %l Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
3. %u Remote user (from auth; may be bogus if return status (%s) is 401)
4. %t Time the request was received (standard english format)
5. \"%r\" First line of request
6. %>s Status. For requests that got internally redirected, this is the status of the *original* request --- %>s for the last.
7. %b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
8 \"%{Referer}i\"
9 \"%{User-Agent}i\"
10 \"%{PHPSESSID}C\
I'm using the standard combined log format plus one extra field described on the Apache site here.
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{PHPSESSID}C\"" combinedcookie
We shouldn't need two patterns because there's always 10 fields to parse out of the text string - sometimes the fields will be empty like "" but there will always be 10 fields as shown below.
That said, my version of PHP (5.1.6) isn't liking the question marks or angle brackets (I think that's the named capture group part?).
PHP Warning: preg_match(): Compilation failed: unrecognized character after (?< at offset 4
For the named capture groups, the fields in the text string are:
1. %h Remote host %h
2. %l Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
3. %u Remote user (from auth; may be bogus if return status (%s) is 401)
4. %t Time the request was received (standard english format)
5. \"%r\" First line of request
6. %>s Status. For requests that got internally redirected, this is the status of the *original* request --- %>s for the last.
7. %b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
8 \"%{Referer}i\"
9 \"%{User-Agent}i\"
10 \"%{PHPSESSID}C\
I'm using the standard combined log format plus one extra field described on the Apache site here.
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{PHPSESSID}C\"" combinedcookie
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ray_Paseur: You and I came to the same result at the same time. You hit the nail on the head, again.
Seems we were not the first ones to think of this:
http://serverfault.com/questions/192654/potential-issues-with-tab-delimited-logformat-for-apache-2-2x
For the record, we ended up solving this by making TAB the new delimiter and explode()'ing just like Ray suggested. So the new config in Apache is:
Or for posterity, the official NCSA extended/combined log format would be:
http://httpd.apache.org/docs/current/mod/mod_log_config.html#logformat
Seems we were not the first ones to think of this:
http://serverfault.com/questions/192654/potential-issues-with-tab-delimited-logformat-for-apache-2-2x
For the record, we ended up solving this by making TAB the new delimiter and explode()'ing just like Ray suggested. So the new config in Apache is:
LogFormat "%h\t%l\t%u\t%t\t%r\t%>s\t%b\t%{Referer}i\t%{User-Agent}i\t%{PHPSESSID}C" combinedcookie
Or for posterity, the official NCSA extended/combined log format would be:
LogFormat "%h\t%l\t%u\t%t\t\"%r\"\t%>s\t%b\t\"%{Referer}i\"\t\"%{User-agent}i\""
http://httpd.apache.org/docs/current/mod/mod_log_config.html#logformat
Thanks for the points. Glad you're onto a good solution, ~Ray
This works with the above cases. Note, I've also included named capture groups. It should make extraction a tad more intuitive :)
Open in new window