Regular Expression problem

Hi,
I  have a string like this,
192.168.31.6 - - [27/Feb/2006:17:09:16 -0800] "GET /HealthCheck HTTP/1.1" 200 77 "-" "-" 1

I want to split it, now any space is considered as a delimiter. Unless it is inside a [ ] or a  "".
For above when i do str.split("---some reg exp----");

i must get a string array with,
arr[0] = 192.168.31.6
arr[1] = -
arr[2] = -
arr[3] = 27/Feb/2006:17:09:16 -0800
arr[4] = GET /HealthCheck HTTP/1.1

can someone provide me the regular expression i dont seem to find it inside the reg ex coach
LVL 5
kannan_ekanathAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

kannan_ekanathAuthor Commented:
arr[5] = 200
arr[6] = 77
arr[7] = -
arr[8] = -
arr[9] = 1
0
JoshdanGCommented:
I believe this would require a variable length negative lookbehind, which is unlikely to be supported by your Java.

I'm not a master of regex's, so I wouldn't listen to me as the last word, but I am relatively confident.  What you want to do is match the space as the thing to break on, but once you've found one, go back and make sure you haven't run into an opening character (i.e. [ or ") without the corresponding closing character.  This would be a lookbehind assertion, and of variable length as there could be any number of characters between the [ or " and the closing " or ].

I think the following regex is "correct"; now I'm just trying to find someplace to test it... The outer quotation marks are not part of the regex, but are used to delimit it.

"(?<!(?:\[[^\]]*|"[^"]*)) "
0
kannan_ekanathAuthor Commented:
Hmm in case you need a sample code,

String str = "192.168.31.6 - - [27/Feb/2006:17:09:16 -0800] \"GET /HealthCheck HTTP/1.1\" 200 77 \"-\" \"-\" 1";
String regEx = "(?<!(?:\\[[^\\]]*|\"[^\"]*))";
String[] array = str.split(regEx);
System.out.println(array);

It says "java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 22
(?<!(?:\[[^\]]*|"[^"]*))"

Can you help me?
0
Introducing Cloud Class® training courses

Tech changes fast. You can learn faster. That’s why we’re bringing professional training courses to Experts Exchange. With a subscription, you can access all the Cloud Class® courses to expand your education, prep for certifications, and get top-notch instructions.

JoshdanGCommented:
hmm... that error actually sounds promising.  It seems like Java supports variable length lookbehind so long as there is a character cap.  You could try:

"(?<!(?:\\[[^\\]]{255}|\"[^\"]{255}))"

(btw, good catch on the missing backslashes, I was testing in Perl and totally forgot about that).

If that doesn't work (and even if it does), then here is a link to somebody way smarter than me giving a solution for the same problem in Java, but with just the brackets and not the quotes:

http://saloon.javaranch.com/cgi-bin/ubb/ultimatebb.cgi?ubb=get_topic&f=1&t=007779#000004

Actually as I look over that, I realize that it couldn't be extended to quotes because it relies on the difference between an opening and a closing brace.

Also, as I start at it more, it becomes clear that the regex I gave must have some serious flaws dealing with the quotes as well.  I'll have to regroup and attack it again tomorrow...
0
Siva Prasanna KumarPrincipal Solutions ArchitectCommented:
0
JoshdanGCommented:
hmm... well I got ahold of Java and hacked at it a bit...  this seems to work:

String regEx = " (?!(?:[^\\[\\]]*\\]|(?:[^\"]*\"[^\"]*\")*[^\"]*\"[^\"]*$))"

This one has all the backslashes and such, so be sure to use it as-is so you don't miss anything. I know it looks like a mess, but it's actually all there for a reason.  The code before the bar "|" is the stuff from the link to eliminate the space inside the brackets.  The stuff after the bar looks for an odd number of quote marks after the current space by pulling out pairs and then finding only one left.  There's probably a shorter way to do it, but it's not coming to me.

Also, for your print statement in your test program, I think you need to use:

    for(String piece:array)
      System.out.println(piece);
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
kannan_ekanathAuthor Commented:
works :) i know life will be difficult moment my requirement changes even *SLIGHTLY".
0
kannan_ekanathAuthor Commented:
Oops is it possible to fix one more problem idea is quote inside a quote is not a delimiter for example,
String line = "205.179.143.234 - - [17/Mar/2006:12:40:19 -0800] \"GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\" 302 481 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" 0";


The above line throws an out of index exception. i would expect
array[0] = 205.179.143.234
array[1] = -
array[2] = -
array[3] = 17/Mar/2006:12:40:19 -0800
array[4] = GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\
array[5] = 302
array[6] = 481
array[7] = -
array[8] = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
array[9] = 0

Problem here is with array[5] here. It is because array[5] will internally have a quote :( I think it is kinda difficult here but just give it a try
0
JoshdanGCommented:
Hey, sorry for the slow reply, I wasn't checking up on EE.

I have two questions about this error:

1. Why is there a quote in the URL?  This should be converted to %22
2. How can I tell the quote that is a delimiter apart from the one that isn't?

Also, the matching isn't what causes the error, it just returns less elements than you expect, so if you try to find 10 items without counting how many are really there, you'll get the error.

As a completely roundabout answer to the question, you might consider using a different approach that takes in a little more of the context.  This would've been my first thought in PERL, but it took a little looking around to figure out how to do in Java.  

The program below was able to successfully figure out that the enclosed quote wasn't a delimiter because the rest of the pattern wouldn't work out in that case.  Even if you put a space after the quote it is still clever/robust enough to figure it out. The other advantage is that it's a much more linear pattern, so it's far easier to break into pieces and comprehend. I think it's probably slower, but I couldn't say for sure.

Some extra random thoughts:
* Most of the pattern strings have a trailing quote embedded in them because I thought the line putting them all together looked uglier if I had to put the spaces there.
* The overall pattern involves a fair amount of guessing.  For example, the 2nd and 3rd items are left as matching any string because I have no idea what they are.

import java.util.regex.*;
public class Test2
{
 
  public static void main(String[] args){  
    String str = "205.179.143.234 - - [17/Mar/2006:12:40:19 -0800] \"GET /goto/orderpreview?orderIdParam=8340973848805245&configurationNumber=255281&navPurchaseOrderNo=78\"FG%20-%20DOYLE HTTP/1.1\" 302 481 \"-\" \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)\" 0";
    String regEx;

    String ipAddress, bracketed, quoted, any_text, number, number_nospace;
    ipAddress = "(\\d+\\.\\d+\\.\\d+\\.\\d+) ";
    bracketed = "(\\[.*\\]) ";
    quoted = "(\".*\") ";
    number = "(\\d+) ";
    number_nospace = "(\\d+)";    
    any_text = "(\\S+) ";

    regEx = ipAddress + any_text + any_text + bracketed + quoted + number + number + quoted + quoted + number_nospace;

    Pattern p = Pattern.compile(regEx);
    Matcher m = p.matcher(str);
   
    if(m.matches())
      for(int i=1; i<=m.groupCount(); i++)
        System.out.println(m.group(i));
  }

}
0
CEHJCommented:
JoshdanG is right - there shouldn't *be* a quote in the URL


            String LOG_FILE_RE = "^([\\d.]+) (.+) (.+) \\[(.+):(.+) (.+)\\] \"(.+) (.+) (.+)\" (\\d+) (\\d+).*";
            String s = "192.168.31.6 - - [27/Feb/2006:17:09:16 -0800] \"GET /HealthCheck HTTP/1.1\" 200 77 \"-\" \"-\" 1";
            Pattern p = Pattern.compile(LOG_FILE_RE);
            Matcher m = p.matcher(s);
            if (m.matches()) {
                  System.out.printf("Site = %s\n", m.group(1));
                  System.out.printf("Log name = %s\n", m.group(2));
                  System.out.printf("Full name = %s\n", m.group(3));
                  System.out.printf("Date = %s\n", m.group(4));
                  System.out.printf("Time = %s\n", m.group(5));
                  System.out.printf("GMT  offset = %s\n", m.group(6));
                  System.out.printf("Request = %s\n", m.group(7));
                  System.out.printf("File = %s\n", m.group(8));
                  System.out.printf("Protocol = %s\n", m.group(9));
                  System.out.printf("Status = %s\n", m.group(10));
                  System.out.printf("Length = %s\n", m.group(11));
            }
0
kannan_ekanathAuthor Commented:
Thanks Joshdan G,
I think the access logs are showing up the quote in the url for some reason. But the ratio is negligible (just 12 records out of 3 million)
Thanks :)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.