Link to home
Start Free TrialLog in
Avatar of mbormann
mbormann

asked on

need help in fast parsing

hello,
I have got a file under 100KB guarenteed and I have got to pick up parts of it and store some parts,lets say structure is

START=TYPE1=BLAH
...
...
END=TYPE1=BLAH
...
...
START=TYPE2=BLAH
...
...
END=TYPE2=BLAH

It is always guaranteed that one type of block is contiguos i.e there will be no overlapping START,END blocks.
But there are millions of 'StART=' blocks

I have done using simple indexOf() and also teh Boyer-Moore algorithm from http://mindprod.com/products.html
search on keyword 'search' :-)

But i finally think that I will have to use something like this
http://www.cs.princeton.edu/~appel/modern/java/JLex/

Now has anybody used something like this?I think this is what Lex and Yacc r abt.

I was thinking of searching for these 'START=' using this RegEx stuff and then sometimes depending on some conditions, I have store this chunk of data between the delimiters in a Data Structure in Memory.
Right now I have done hard coding and storing then as a String in a pre defined string Array.
I feel that am going astray and would like some advice before going completely bonkers.
Can't use readLine() since I have to pick up the data and store it for later modification

Any suggestions in how to implement using the regex stuff would be welcome

is this of any help?
http://www.quateams.com/oro/software/OROMatcher1.1.html

the exact structure of record is something like this with attached algorithm using this regex code AND sample data file.

basically I have a record like this
START=TYPE1=200=2    ---->header
.......
.......
.......  -----> body
.......
END=TYPE1=200=2      ------> footer

which gives me output like
'START=TYPE1=' only doesnt give me the '200=2' at end

my code will go like

while(matcher doesnt rech end of file)
{
   (1)look at header of each record ,a sort of look ahead

   (2) if(Type 1)
        {
              if(body also present)
              {
                   pick up whole record & process
              }
              else
              {
                   discard record
              }
        }
        else if(Type 2)
        {
              if(body also present)
              {
                   pick up whole record & process
              }
              else
              {
                   discard record
              }
        }
        else if (type 3)
        {
              if(body also present)
              {
                   pick up whole record & process
              }
              else
              {
                   discard record
              }
        }
        ...and so on
}
 

Sample data file is


START=type1=200=0
END=type1=200=0

START=type1=200=0
Name=mickey
data=0xQv4gBjAAEIQwoAAQIAAAABABABqqqqqqqqqqqqqqqqqqqqIAAAAQEIATMzMzMzMzMAAAA
END=type1=200=0

START=type1=200=1
END=type1=200=1

START=type1=200=1
Name=GueesssMan
data=WL8vsAABCEMKAAECAAAAUQAQUaqqqqqqqqqqqqqqqqqqqiAAAFEBCFEzMzMzMzMzAAAAUQI
Q
END=type1=200=1

START=type2=200=0
Name=maximumsecurity
END=type2=200=0


START=type1=200=2
END=type1=200=2

START=type1=200=2
Name=mickey2
data=0xQv4gBjAAEIQwoAAQIAAAABABABqqqqqqqqqqqqqqqqqqqqIAAAAQEIATMzMzMzMzMAAAA
END=type1=200=2


START=type2=200=1
Name=conn2
END=type2=200=1

Avatar of mbormann
mbormann

ASKER

I have used the streamInputExample.java and it looks OK in the OROinc package but I have a problem with my RegEx

which is right now
String regex = "START=\\w+\\W+";

I want to pick up whole string from
START=........
....
....
....
END=........

what should that string be?
Thanks
the modified code if it's of any help

import java.io.*;
import com.oroinc.text.regex.*;

public final class streamInputExample
{
      public static final void main(String args[])      {            String regex = "START=\\w+\\W+";            Perl5Matcher matcher;
            Perl5Compiler compiler;
            Perl5Pattern pattern = null;
            Perl5StreamInput input;
            MatchResult result;
            Reader file = null;

            // Create Perl5Compiler and Perl5Matcher instances.
            compiler = new Perl5Compiler();
            matcher  = new Perl5Matcher();

            // Attempt to compile the pattern.  If the pattern is not valid,
            // report the error and exit.
            try            {                  pattern = (Perl5Pattern)compiler.compile(regex,Perl5Compiler.CASE_INSENSITIVE_MASK);
            }            catch(MalformedPatternException e)            {
                  System.err.println("Bad pattern.");
                  System.err.println(e.getMessage());
                  System.exit(1);
            }


            // Open input file.
            try            {
                  file = new BufferedReader(new FileReader("TestFile.txt"));
            }            catch(IOException e)            {
                  System.err.println("Error opening streamInputExample.txt.");
                  System.err.println(e.getMessage());
                  System.exit(1);
            }

            // Create a Perl5StreamInput instance to search the input stream.
            input   = new Perl5StreamInput(file);

            // We need to put the search loop in a try block because when searching
            // a Perl5StreamInput instance, an IOException may occur, and it must be
            // caught.
            try            {                  // Loop until there are no more matches left.
                  while(matcher.contains(input, pattern))                  {
                        // Since we're still in the loop, fetch match that was found.
                        result = matcher.getMatch();  
                        System.out.println("result.toString()="+result.toString());                        System.out.println("Begin offset: " + result.beginOffset(0));
                        System.out.println("End offset: " + result.endOffset(0));
                  }
        }        catch(IOException e)        {              System.err.println("Error occurred while reading file.");
              System.err.println(e.getMessage());
              System.exit(1);
        }
      }
}
hello mbormann,
I'm not used to Perl5 syntax but this pattern seems to work :
            String regex ="START=" + args[0] + "=" + args[1] + "(\\n|.)*END=" + args[0] + "=" + args[1];

with this TestFile :

START=TYPE1=BLAH
type one
hello world
123456
END=TYPE1=BLAH
START=TYPE2=BLAH
type two
9876545645343243
$$$$$$$$$$$$$thvr$hvr$$vthrzthvrhrs$vrthhrz$jh
END=TYPE2=BLAH
START=TYPE3=BLAH type 3 END=TYPE3=BLAH

I run : java streamInputExample TYPE2 BLAH
sorry pyxide I was not clear enough in my explanation ,perhaps u would check out my question in Perl section?
https://www.experts-exchange.com/jsp/qShow.jsp?ta=perl&qid=10244086 

and the BLAH is not constant it is a fixed number followed by a variable number.

like
START=TYPE1=200=2
another
START=TYPE1=200=6
etc...

thanks
P.s I too am not used to Perl5 Syntax
I have to make this stuff entirely dynamic and chain LIGHTNING,tell me am I on right track on this RegEx stuff?
mb, is there any reason you can't read the file something like this? You mentioned you can't use readline, but I'm not sure why?

import java.io.*;

public class te {

  public static void main(String args[])
            throws FileNotFoundException, IOException {

    if (args.length == 1) {
     
      BufferedReader in = new BufferedReader(new FileReader(args[0]));
      String s = "";
      String currentType = "";
      StringBuffer sb = new StringBuffer();
     
      while ( (s = in.readLine()) != null ) {
       
        if ( s.startsWith("START=") ) {
         
          boolean reachedEnd = false;
         
          // read data
          while (!reachedEnd && (s != null) ) {
           
            s = in.readLine();
           
            if ( (s != null) && (!s.startsWith("END=")) ) {
              sb.append("\n" + s);
            } else if (s == null) {
              reachedEnd = true;              
            }
          }
          sb.append("\n\n");
        }
      }
     
      System.out.println("\n\n" + sb.toString());
     
    } else {
      System.out.println("You must provide a file to search");
    }

  }
}

I used this test file....

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH

START=TYPE1=BLAH
lhglsdfjg
sl;kfgjsfd
;

;lskfhg
END=TYPE1=BLAH
Jod,
this has to be as lightning fast as possible in Java with very less garbage objects created ,u no that for that each readLine() a new strigBuffer is done.

Also i tried indexOf() and BoyerMoore's indexOf() ,but that hasn't got a facility of indexOf(string,int beginIndex) which would have been super for me.

My senior colleague said try to use the above stuff like i am doing,I saw that JLex is sort of complicated right now,OROinc's stuff is easier to implement in a shorter time which is a BIG constraint right now for us.
hope i get a solution soon.
I use OROMatcher in my grep tool and it works fine :)
What exactly u want to know?
vladi,
pls lookat my question in Perl section
https://www.experts-exchange.com/jsp/qShow.jsp?ta=perl&qid=10244086 

I have a problem with specifing the Regex string if u look at the code above which i have given with
String regex = "START=\\w+\\W+";

just tell me the regex string.if u have any clarfications pls ask,else am mailing u ,is it OK?
>> u no that for that each readLine() a new strigBuffer is done

Yes, I think we've been here before, with previous questions.

There is also JCC which is a Java program which can produce a praser based on a specification you give it. There are some details here:

http://www.suntest.com/JavaCC/

though how optimised the code it produces is, I am not to sure.

You do also have:

regionMatches
public boolean regionMatches(int toffset,
                             String other,
                             int ooffset,
                             int len)
Tests if two string regions are equal.
A substring of this String object is compared to a substring of the argument other. The result is true if these substrings represent identical character sequences. The substring of this String object to be compared begins at index toffset and has length len. The substring of other to be compared begins at index ooffset and has length len. The result is false if and only if at least one of the following is true:

toffset is negative.
ooffset is negative.
toffset+len is greater than the length of this String object.
ooffset+len is greater than the length of the other argument.
There is some nonnegative integer k less than len such that: this.charAt(toffset+k) != other.charAt(ooffset+k)
Parameters:
toffset - the starting offset of the subregion in this string.
other - the string argument.
ooffset - the starting offset of the subregion in the string argument.
len - the number of characters to compare.
Returns:
true if the specified subregion of this string exactly matches the specified subregion of the string argument; false otherwise.
Throws:
NullPointerException - if other is null.


for example...
Edited text of question.
Jod,
We had a long and heated discussion for abt 2 hrs and finally he concluded we need to do this kinda stuff after examinign ALL options.If I am not able to implement it quickly then its back to good old String and rest of the optimistaions involved but as u said I am also convinced that RegEx is the one to achieve lightning speed.

thanks
ASKER CERTIFIED SOLUTION
Avatar of pyxide
pyxide

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I haven't used it myself, but it looks promising - it seems to be particularly aimed at parsing info when you don't need or want to read the whole file in.

Don't use perl either so can't tell you much about the specifics of perl string matching either here...
pyxide ,
heartfelt thanks man for trying but I will definitely see along the pattern u pointed out and will then post final working code over here tommorrow.
amigo muchas gracias

Jod ,i too am learning.But believe me from what I have seen this is chain n greased lightning rolled into one.
I am trying with this

String st_regex_start_block ="START=(.+?)=([0-9]+)=([0-9]+)";

String st_regex_end_block ="((\\n.*?)*?)(\\nEND=(.+?)=[0-9]+=[0-9]+){1}?";

the empty body works pefectly man and I think you r a gennius in the way u picked the  RegEx String ,I can decide on the type which code to execute and the last variable constant number.

THANKS
it fails with only 1 thing ,is a little strange i think

INPUT
-----

START=TYPEa=200=0
DataLine 1
DataLine 2
DataLine 3
DataLine 4
DataLine 5
JavaWebServer version...izzitOK=Yes
END=TYPEa=200=0

OUTPUT
------

type     [TYPEa]
cons     [200]
variable [0]
--------------------
 body :[
DataLine 1
DataLine 2
DataLine 3
DataLine 4
DataLine 5
]avaWebServer version...izzitOK=Yes
--------------------

,as u can see the ']' which denotes the end of the record gets printed and then the last line minus the first char 'J' gets printed.

I am testing on WinNT4.0 which has line separator as \r\n,i think this is significant.

when i put
System.out.println(" body :[" + result.group(1) + "]***");
then the result is again of the last line
]***WebServer version...izzitOK=Yes

what is happening here?I have to stop for today and maybe come back later eveningish.

Thanks man Thanks
Adjusted points to 300
I noticed the little strange thing too:
When i look at output in DOS window (NT 4)
this happens :
]avaWebServer version...izzitOK=Yes

But when I send output in a file, it's OK
An hexadecimal editor shows that the group begins with a line feed character and ends with a return character, so I think DOS console puts the character(s) following 0x0D at the beginning of the current line, erasing initials character in the buffer, before printing it.

Anyway, this should'nt bother in a Java string.

type     [TYPEa]
cons     [200]
variable [0]
--------------------
 body :[ (0x0A) DataLine 1
....
....
JavaWebServer version...izzitOK=Yes (0x0D)]
--------------------

One thing : regexps work, but can be enhanced.
for example, i assume you don't need the "(\\nEND=(.+?)=[0-9]+=[0-9]+){1}?" group.
In Oro documentation, I saw an extended expression like this

(?:regexp) Groups things like "()" but doesn't cause the group match to be saved.

But I don't have time at the moment to play sorcerer's apprentice.

Good luck
I see that u can now collect a free T-Shirt since u have crossed 5000 .
Again muchas gracias amigo.
Free T Shirt? Wot free T Shirt....

How are you finding this package mb? Is it any good?
Jod,
U r entitled to a free T-Shirt after u reach 5000,15000,30000,50000,75000,100000
points .
This goes on & on after 100000 u dont get anything

Look at my question on T-Shirt here.
https://www.experts-exchange.com/jsp/qShow.jsp?ta=commspt&qid=10233717 

also look at
https://www.experts-exchange.com/info/tshirt.htm 

And yes I have been doing some other stuff but from what I heard from the others its faster ,as these com.oroinc Guys are not loading the whole file into memroy they r using a char[] buffer,and pyxide already parsed the way we want it.

Now we have to convince our clients to use this as they generally have a problem accepting third party stuff.

Again (sigh!) didja read the question on Multiple inheritance?

I brought Scott Meyers 2 books 'Effective C++'  & 'More 'Effective C++',am trying to get hold of Arthur J Riel's 'Obect Oriented Design Heuristics'

If u hvae time look at these questions i asked in the past week or so.

https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=10243109 
https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=10245184 
https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=10243045 

Also if u have anything to say pls say so over there!

Thanks n cheers
mb,

btw have you seen this:

http://www.cacas.org/java/gnu/regexp/

Just as an alternative to try perhaps.
Jod,

I surely dont like to try out the GNU Java code as it has been a bad experience for me.My colleagues who do C++ say it's good.
It used to break after 65536 bytes and used to gimme a Error packet everytime.

That's the reson I didn't even look at their implementation,yes if u want to tinker around it's fine.

Compare the TFTP implementation of theirs and NetComponents from OROInc ,ORO code is blazingly fast and stable ,though IBM has removed the 32MB limit from their implementation,I couldnt use it as its meant for a totally commandline based utility and what I wanted was satisfied mainly by ORO.

Jod ,this is the second time inside a couple of months that I have used ORO code and its been a good experience for me and I trust them and have started recommending that to all my friends who r intesredt.