Link to home
Start Free TrialLog in
Avatar of Geoff Millikan
Geoff MillikanFlag for United States of America

asked on

Regex for parsing out Apache common log (plus one extra field)

The below pattern parses out the fields from $str1 just fine but it doesn't work on to other strings.  Can you fix the regex pattern so it works on all strings?

Thanks, http://www.t1shopper.com/

$str1='67.195.37.124 - - [20/Apr/2010:05:32:22 +0000] "GET /us/ga/b.html HTTP/1.0" 200 1761 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" "-"';

$str2='67.225.164.12 - - [03/Jan/2011:21:15:39 +0000] "GET / HTTP/1.1" 200 8973 "" "\"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)\"" "-"';

$str3='77.238.196.184 - - [01/Jan/2011:20:54:07 +0000] "GET /tools/port-scan/result/?scan_host=cairosat.zapto.org&ports=2000&portscansubmit=Scan&port_start=&port_end= HTTP/1.1" 200 9640 "http://www.t1shopper.com/tools/port-scan/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.8) Gecko/20100722 Firefox/2.0.0.3 \"MEGAUPLOAD 1.0\"" "-"';

$str4='91.55.106.90 - - [27/Dec/2010:17:16:07 +0000] "GET /ssi/t1shopper.js HTTP/1.1" 200 2012 "http://www.t1shopper.com/tools/port-scan/" "\"Bundestrojaner 2.0 - www.rettedeinefreiheit.de\"" "-"';

$str5='72.37.171.76 - - [22/Dec/2010:17:55:51 +0000] "GET /tools/calculate/ HTTP/1.0" 200 37778 "http://www.bing.com/search?q=\"kilobyte+to+megabyte\"+\"converter\"&src=IE-Address" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2; MS-RTC LM 8)" "-"';

$log_pattern = '#^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^ ]+) ([^ ]+) ([^/]+)/([^"]+)" ([^ ]+) ([^ ]+) "([^"]*)" "([^"]+)" "([^"]+)"#';

preg_match($log_pattern, $str2, $matches);

print_r($matches);

Open in new window


PS: We've been working to get this Regex right for a few years but we keep having exceptions come up.  Here's the thread of past answers:
https://www.experts-exchange.com/questions/26028925/Help-fixing-my-Regex-pattern.html
https://www.experts-exchange.com/questions/26184204/Parsing-text-string-with-regex-preg-match.html
https://www.experts-exchange.com/questions/25968268/Help-fixing-my-Regex-pattern.html
https://www.experts-exchange.com/questions/23544714/Simple-Regex-Question.html
Avatar of kaufmed
kaufmed
Flag of United States of America image

Well we can only go by what sample data you provide us  :)

This works with the above cases. Note, I've also included named capture groups. It should make extraction a tad more intuitive  :)
// Without "referrer"
$log_pattern = '#^(?<ipaddress>[^\s]+)\s*-\s*-\s*(?<accessdate>\[[^\]]+\])\s*(?<request>"[^"]+")\s*(?<response>[^\s]+)\s*([^\s]+)\s*"(?:(?<=\\\\)"|[^"])*"\s*(?<useragent>"(?:(?<=\\\\)"|[^"])*")\s*"[^"]*"$#';

// With "referrer"
$log_pattern = '#^(?<ipaddress>[^\s]+)\s*-\s*-\s*(?<accessdate>\[[^\]]+\])\s*(?<request>"[^"]+")\s*(?<response>[^\s]+)\s*([^\s]+)\s*"(?<referrer>(?:(?<=\\\\)"|[^"])*)"\s*(?<useragent>"(?:(?<=\\\\)"|[^"])*")\s*"[^"]*"$#';

Open in new window

Here's an example of accessing the named capture groups:

print "Access Date: " . $matches['accessdate'] . "<br />";
print "Request: " . $matches['request'] . "<br />";
print "Response: " . $matches['response'] . "<br />";

Open in new window

Avatar of Geoff Millikan

ASKER

Touche! :-)

We shouldn't need two patterns because there's always 10 fields to parse out of the text string - sometimes the fields will be empty like "" but there will always be 10 fields as shown below.

That said, my version of PHP (5.1.6) isn't liking the question marks or angle brackets (I think that's the named capture group part?).
PHP Warning:  preg_match(): Compilation failed: unrecognized character after (?< at offset 4

Open in new window


For the named capture groups, the fields in the text string are:
1. %h Remote host %h
2. %l Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
3. %u Remote user (from auth; may be bogus if return status (%s) is 401)
4. %t Time the request was received (standard english format)
5. \"%r\"  First line of request
6. %>s Status. For requests that got internally redirected, this is the status of the *original* request --- %>s for the last.
7. %b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
8 \"%{Referer}i\"
9 \"%{User-Agent}i\"
10 \"%{PHPSESSID}C\

I'm using the standard combined log format plus one extra field described on the Apache site here.

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{PHPSESSID}C\"" combinedcookie
SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ray_Paseur: You and I came to the same result at the same time.  You hit the nail on the head, again.

Seems we were not the first ones to think of this:
http://serverfault.com/questions/192654/potential-issues-with-tab-delimited-logformat-for-apache-2-2x

For the record, we ended up solving this by making TAB the new delimiter and explode()'ing just like Ray suggested.  So the new config in Apache is:
LogFormat "%h\t%l\t%u\t%t\t%r\t%>s\t%b\t%{Referer}i\t%{User-Agent}i\t%{PHPSESSID}C" combinedcookie

Open in new window


Or for posterity, the official NCSA extended/combined log format would be:
LogFormat "%h\t%l\t%u\t%t\t\"%r\"\t%>s\t%b\t\"%{Referer}i\"\t\"%{User-agent}i\""

Open in new window


http://httpd.apache.org/docs/current/mod/mod_log_config.html#logformat
Thanks for the points.  Glad you're onto a good solution, ~Ray