Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Regular Expression problem

Posted on 2006-03-30
11
Medium Priority
?
842 Views
Last Modified: 2008-01-09
Hi,
I  have a string like this,
192.168.31.6 - - [27/Feb/2006:17:09:16 -0800] "GET /HealthCheck HTTP/1.1" 200 77 "-" "-" 1

I want to split it, now any space is considered as a delimiter. Unless it is inside a [ ] or a  "".
For above when i do str.split("---some reg exp----");

i must get a string array with,
arr[0] = 192.168.31.6
arr[1] = -
arr[2] = -
arr[3] = 27/Feb/2006:17:09:16 -0800
arr[4] = GET /HealthCheck HTTP/1.1

can someone provide me the regular expression i dont seem to find it inside the reg ex coach
0
Comment
Question by:kannan_ekanath
11 Comments
 
LVL 5

Author Comment

by:kannan_ekanath
ID: 16339943
arr[5] = 200
arr[6] = 77
arr[7] = -
arr[8] = -
arr[9] = 1
0
 
LVL 2

Expert Comment

by:JoshdanG
ID: 16340316
I believe this would require a variable length negative lookbehind, which is unlikely to be supported by your Java.

I'm not a master of regex's, so I wouldn't listen to me as the last word, but I am relatively confident.  What you want to do is match the space as the thing to break on, but once you've found one, go back and make sure you haven't run into an opening character (i.e. [ or ") without the corresponding closing character.  This would be a lookbehind assertion, and of variable length as there could be any number of characters between the [ or " and the closing " or ].

I think the following regex is "correct"; now I'm just trying to find someplace to test it... The outer quotation marks are not part of the regex, but are used to delimit it.

"(?<!(?:\[[^\]]*|"[^"]*)) "
0
 
LVL 5

Author Comment

by:kannan_ekanath
ID: 16340336
Hmm in case you need a sample code,

String str = "192.168.31.6 - - [27/Feb/2006:17:09:16 -0800] \"GET /HealthCheck HTTP/1.1\" 200 77 \"-\" \"-\" 1";
String regEx = "(?<!(?:\\[[^\\]]*|\"[^\"]*))";
String[] array = str.split(regEx);
System.out.println(array);

It says "java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 22
(?<!(?:\[[^\]]*|"[^"]*))"

Can you help me?
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 2

Expert Comment

by:JoshdanG
ID: 16340707
hmm... that error actually sounds promising.  It seems like Java supports variable length lookbehind so long as there is a character cap.  You could try:

"(?<!(?:\\[[^\\]]{255}|\"[^\"]{255}))"

(btw, good catch on the missing backslashes, I was testing in Perl and totally forgot about that).

If that doesn't work (and even if it does), then here is a link to somebody way smarter than me giving a solution for the same problem in Java, but with just the brackets and not the quotes:

http://saloon.javaranch.com/cgi-bin/ubb/ultimatebb.cgi?ubb=get_topic&f=1&t=007779#000004

Actually as I look over that, I realize that it couldn't be extended to quotes because it relies on the difference between an opening and a closing brace.

Also, as I start at it more, it becomes clear that the regex I gave must have some serious flaws dealing with the quotes as well.  I'll have to regroup and attack it again tomorrow...
0
 
LVL 23

Expert Comment

by:Siva Prasanna Kumar
ID: 16344555
0
 
LVL 2

Accepted Solution

by:
JoshdanG earned 2000 total points
ID: 16348514
hmm... well I got ahold of Java and hacked at it a bit...  this seems to work:

String regEx = " (?!(?:[^\\[\\]]*\\]|(?:[^\"]*\"[^\"]*\")*[^\"]*\"[^\"]*$))"

This one has all the backslashes and such, so be sure to use it as-is so you don't miss anything. I know it looks like a mess, but it's actually all there for a reason.  The code before the bar "|" is the stuff from the link to eliminate the space inside the brackets.  The stuff after the bar looks for an odd number of quote marks after the current space by pulling out pairs and then finding only one left.  There's probably a shorter way to do it, but it's not coming to me.

Also, for your print statement in your test program, I think you need to use:

    for(String piece:array)
      System.out.println(piece);
0
 
LVL 5

Author Comment

by:kannan_ekanath
ID: 16357996
works :) i know life will be difficult moment my requirement changes even *SLIGHTLY".
0
 
LVL 5

Author Comment

by:kannan_ekanath
ID: 16358257
Oops is it possible to fix one more problem idea is quote inside a quote is not a delimiter for example,
String line = "205.179.143.234 - - [17/Mar/2006:12:40:19 -0800] \"GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\" 302 481 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" 0";


The above line throws an out of index exception. i would expect
array[0] = 205.179.143.234
array[1] = -
array[2] = -
array[3] = 17/Mar/2006:12:40:19 -0800
array[4] = GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\
array[5] = 302
array[6] = 481
array[7] = -
array[8] = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
array[9] = 0

Problem here is with array[5] here. It is because array[5] will internally have a quote :( I think it is kinda difficult here but just give it a try
0
 
LVL 2

Expert Comment

by:JoshdanG
ID: 16397606
Hey, sorry for the slow reply, I wasn't checking up on EE.

I have two questions about this error:

1. Why is there a quote in the URL?  This should be converted to %22
2. How can I tell the quote that is a delimiter apart from the one that isn't?

Also, the matching isn't what causes the error, it just returns less elements than you expect, so if you try to find 10 items without counting how many are really there, you'll get the error.

As a completely roundabout answer to the question, you might consider using a different approach that takes in a little more of the context.  This would've been my first thought in PERL, but it took a little looking around to figure out how to do in Java.  

The program below was able to successfully figure out that the enclosed quote wasn't a delimiter because the rest of the pattern wouldn't work out in that case.  Even if you put a space after the quote it is still clever/robust enough to figure it out. The other advantage is that it's a much more linear pattern, so it's far easier to break into pieces and comprehend. I think it's probably slower, but I couldn't say for sure.

Some extra random thoughts:
* Most of the pattern strings have a trailing quote embedded in them because I thought the line putting them all together looked uglier if I had to put the spaces there.
* The overall pattern involves a fair amount of guessing.  For example, the 2nd and 3rd items are left as matching any string because I have no idea what they are.

import java.util.regex.*;
public class Test2
{
 
  public static void main(String[] args){  
    String str = "205.179.143.234 - - [17/Mar/2006:12:40:19 -0800] \"GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\" 302 481 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" 0";
    String regEx;

    String ipAddress, bracketed, quoted, any_text, number, number_nospace;
    ipAddress = "(\\d+\\.\\d+\\.\\d+\\.\\d+) ";
    bracketed = "(\\[.*\\]) ";
    quoted = "(\".*\") ";
    number = "(\\d+) ";
    number_nospace = "(\\d+)";    
    any_text = "(\\S+) ";

    regEx = ipAddress + any_text + any_text + bracketed + quoted + number + number + quoted + quoted + number_nospace;

    Pattern p = Pattern.compile(regEx);
    Matcher m = p.matcher(str);
   
    if(m.matches())
      for(int i=1; i<=m.groupCount(); i++)
        System.out.println(m.group(i));
  }

}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 16398436
JoshdanG is right - there shouldn't *be* a quote in the URL


            String LOG_FILE_RE = "^([\\d.]+) (.+) (.+) \\[(.+):(.+) (.+)\\] \"(.+) (.+) (.+)\" (\\d+) (\\d+).*";
            String s = "192.168.31.6 - - [27/Feb/2006:17:09:16 -0800] \"GET /HealthCheck HTTP/1.1\" 200 77 \"-\" \"-\" 1";
            Pattern p = Pattern.compile(LOG_FILE_RE);
            Matcher m = p.matcher(s);
            if (m.matches()) {
                  System.out.printf("Site = %s\n", m.group(1));
                  System.out.printf("Log name = %s\n", m.group(2));
                  System.out.printf("Full name = %s\n", m.group(3));
                  System.out.printf("Date = %s\n", m.group(4));
                  System.out.printf("Time = %s\n", m.group(5));
                  System.out.printf("GMT  offset = %s\n", m.group(6));
                  System.out.printf("Request = %s\n", m.group(7));
                  System.out.printf("File = %s\n", m.group(8));
                  System.out.printf("Protocol = %s\n", m.group(9));
                  System.out.printf("Status = %s\n", m.group(10));
                  System.out.printf("Length = %s\n", m.group(11));
            }
0
 
LVL 5

Author Comment

by:kannan_ekanath
ID: 16398499
Thanks Joshdan G,
I think the access logs are showing up the quote in the url for some reason. But the ratio is negligible (just 12 records out of 3 million)
Thanks :)
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
The viewer will learn how to implement Singleton Design Pattern in Java.
Suggested Courses
Course of the Month14 days, 3 hours left to enroll

581 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question