Regular Expression problem

Posted on 2006-03-30
Last Modified: 2008-01-09
I  have a string like this, - - [27/Feb/2006:17:09:16 -0800] "GET /HealthCheck HTTP/1.1" 200 77 "-" "-" 1

I want to split it, now any space is considered as a delimiter. Unless it is inside a [ ] or a  "".
For above when i do str.split("---some reg exp----");

i must get a string array with,
arr[0] =
arr[1] = -
arr[2] = -
arr[3] = 27/Feb/2006:17:09:16 -0800
arr[4] = GET /HealthCheck HTTP/1.1

can someone provide me the regular expression i dont seem to find it inside the reg ex coach
Question by:kannan_ekanath
    LVL 5

    Author Comment

    arr[5] = 200
    arr[6] = 77
    arr[7] = -
    arr[8] = -
    arr[9] = 1
    LVL 2

    Expert Comment

    I believe this would require a variable length negative lookbehind, which is unlikely to be supported by your Java.

    I'm not a master of regex's, so I wouldn't listen to me as the last word, but I am relatively confident.  What you want to do is match the space as the thing to break on, but once you've found one, go back and make sure you haven't run into an opening character (i.e. [ or ") without the corresponding closing character.  This would be a lookbehind assertion, and of variable length as there could be any number of characters between the [ or " and the closing " or ].

    I think the following regex is "correct"; now I'm just trying to find someplace to test it... The outer quotation marks are not part of the regex, but are used to delimit it.

    "(?<!(?:\[[^\]]*|"[^"]*)) "
    LVL 5

    Author Comment

    Hmm in case you need a sample code,

    String str = " - - [27/Feb/2006:17:09:16 -0800] \"GET /HealthCheck HTTP/1.1\" 200 77 \"-\" \"-\" 1";
    String regEx = "(?<!(?:\\[[^\\]]*|\"[^\"]*))";
    String[] array = str.split(regEx);

    It says "java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 22

    Can you help me?
    LVL 2

    Expert Comment

    hmm... that error actually sounds promising.  It seems like Java supports variable length lookbehind so long as there is a character cap.  You could try:


    (btw, good catch on the missing backslashes, I was testing in Perl and totally forgot about that).

    If that doesn't work (and even if it does), then here is a link to somebody way smarter than me giving a solution for the same problem in Java, but with just the brackets and not the quotes:

    Actually as I look over that, I realize that it couldn't be extended to quotes because it relies on the difference between an opening and a closing brace.

    Also, as I start at it more, it becomes clear that the regex I gave must have some serious flaws dealing with the quotes as well.  I'll have to regroup and attack it again tomorrow...
    LVL 23

    Expert Comment

    LVL 2

    Accepted Solution

    hmm... well I got ahold of Java and hacked at it a bit...  this seems to work:

    String regEx = " (?!(?:[^\\[\\]]*\\]|(?:[^\"]*\"[^\"]*\")*[^\"]*\"[^\"]*$))"

    This one has all the backslashes and such, so be sure to use it as-is so you don't miss anything. I know it looks like a mess, but it's actually all there for a reason.  The code before the bar "|" is the stuff from the link to eliminate the space inside the brackets.  The stuff after the bar looks for an odd number of quote marks after the current space by pulling out pairs and then finding only one left.  There's probably a shorter way to do it, but it's not coming to me.

    Also, for your print statement in your test program, I think you need to use:

        for(String piece:array)
    LVL 5

    Author Comment

    works :) i know life will be difficult moment my requirement changes even *SLIGHTLY".
    LVL 5

    Author Comment

    Oops is it possible to fix one more problem idea is quote inside a quote is not a delimiter for example,
    String line = " - - [17/Mar/2006:12:40:19 -0800] \"GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\" 302 481 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" 0";

    The above line throws an out of index exception. i would expect
    array[0] =
    array[1] = -
    array[2] = -
    array[3] = 17/Mar/2006:12:40:19 -0800
    array[4] = GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\
    array[5] = 302
    array[6] = 481
    array[7] = -
    array[8] = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
    array[9] = 0

    Problem here is with array[5] here. It is because array[5] will internally have a quote :( I think it is kinda difficult here but just give it a try
    LVL 2

    Expert Comment

    Hey, sorry for the slow reply, I wasn't checking up on EE.

    I have two questions about this error:

    1. Why is there a quote in the URL?  This should be converted to %22
    2. How can I tell the quote that is a delimiter apart from the one that isn't?

    Also, the matching isn't what causes the error, it just returns less elements than you expect, so if you try to find 10 items without counting how many are really there, you'll get the error.

    As a completely roundabout answer to the question, you might consider using a different approach that takes in a little more of the context.  This would've been my first thought in PERL, but it took a little looking around to figure out how to do in Java.  

    The program below was able to successfully figure out that the enclosed quote wasn't a delimiter because the rest of the pattern wouldn't work out in that case.  Even if you put a space after the quote it is still clever/robust enough to figure it out. The other advantage is that it's a much more linear pattern, so it's far easier to break into pieces and comprehend. I think it's probably slower, but I couldn't say for sure.

    Some extra random thoughts:
    * Most of the pattern strings have a trailing quote embedded in them because I thought the line putting them all together looked uglier if I had to put the spaces there.
    * The overall pattern involves a fair amount of guessing.  For example, the 2nd and 3rd items are left as matching any string because I have no idea what they are.

    import java.util.regex.*;
    public class Test2
      public static void main(String[] args){  
        String str = " - - [17/Mar/2006:12:40:19 -0800] \"GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\" 302 481 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" 0";
        String regEx;

        String ipAddress, bracketed, quoted, any_text, number, number_nospace;
        ipAddress = "(\\d+\\.\\d+\\.\\d+\\.\\d+) ";
        bracketed = "(\\[.*\\]) ";
        quoted = "(\".*\") ";
        number = "(\\d+) ";
        number_nospace = "(\\d+)";    
        any_text = "(\\S+) ";

        regEx = ipAddress + any_text + any_text + bracketed + quoted + number + number + quoted + quoted + number_nospace;

        Pattern p = Pattern.compile(regEx);
        Matcher m = p.matcher(str);
          for(int i=1; i<=m.groupCount(); i++)

    LVL 86

    Expert Comment

    JoshdanG is right - there shouldn't *be* a quote in the URL

                String LOG_FILE_RE = "^([\\d.]+) (.+) (.+) \\[(.+):(.+) (.+)\\] \"(.+) (.+) (.+)\" (\\d+) (\\d+).*";
                String s = " - - [27/Feb/2006:17:09:16 -0800] \"GET /HealthCheck HTTP/1.1\" 200 77 \"-\" \"-\" 1";
                Pattern p = Pattern.compile(LOG_FILE_RE);
                Matcher m = p.matcher(s);
                if (m.matches()) {
                      System.out.printf("Site = %s\n",;
                      System.out.printf("Log name = %s\n",;
                      System.out.printf("Full name = %s\n",;
                      System.out.printf("Date = %s\n",;
                      System.out.printf("Time = %s\n",;
                      System.out.printf("GMT  offset = %s\n",;
                      System.out.printf("Request = %s\n",;
                      System.out.printf("File = %s\n",;
                      System.out.printf("Protocol = %s\n",;
                      System.out.printf("Status = %s\n",;
                      System.out.printf("Length = %s\n",;
    LVL 5

    Author Comment

    Thanks Joshdan G,
    I think the access logs are showing up the quote in the url for some reason. But the ratio is negligible (just 12 records out of 3 million)
    Thanks :)

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Security Threats Are You Missing?

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Suggested Solutions

    After being asked a question last year, I went into one of my moods where I did some research and code just for the fun and learning of it all.  Subsequently, from this journey, I put together this article on "Range Searching Using Visual Basic.NET …
    Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
    Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
    This video teaches viewers about errors in exception handling.

    761 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    7 Experts available now in Live!

    Get 1:1 Help Now