Link to home
Start Free TrialLog in
Avatar of Cumbrowski
CumbrowskiFlag for United States of America

asked on

Text Parser for Hierarchical Structured Data via RegEx (not necessary) and VB.NET

Open in new window

I am trying to try to use VB.NET to parse text data that is structured hierarchically like a classic "Tree".

The obvious solution for me was writing a lot of code to traverse and parse the text data character by character and build the tree with its chunks of data nodes slowly one byte at a time.

Well, I hope that there is a better way than that. I was thinking of Regular Expressions. I am familar with them, but far away from considering myself an Expert. I am also new to the VB.NET flavor of this and don't know what it can and cannot do. I hope that somebody here can help me to find an elegant and efficient solution.

Here are 3 Examples of Text and it's structure. I am using the characters "{" as Begin of Block and the "}" as End of Block markers, but that could also be other characters or even "Keywords".


Example 1
{
"Block 1"
Any Characters, except for block marker,
Line Breaks, White Spaces, Letters, Numbers,
+-[]:;'?/\-()*&%$#@!.,<>A-Za-z0-9 TAB LF CR
}

--------------------------------

Example 2
{
"Block 1"
}
{
"Block 2"
}

Okay, those were the easy ones, whichI don't have a problem with solving.Here is now the problematic one.

--------------------------------
Example 3
{
"Block 1"
 .. (maybe) Data ...
  {
  "Block 1.1"
  .. (maybe) Data ...         
    {
    "Block 1.1.1"
    .. Data ...
    }
    .. (maybe) Data (for Block 1.1)
    {
    "Block 1.1.N"
    .. Data ...
    }      
   .. (maybe) Data ...
  }
 .. (maybe) Data..
  {
  "Block 1.N"
   .. Data ..
  }
   .. (maybe) Data ...
}
{
"Block 2"
.. Data..
}

As you can see, there can be 0-N levels of nesting within each block regardless of the level where the block is located. A typical TREE hierarchy. And something like a tree I would like to
get back. A reversed tree would be even better, starting with the deepest levels of blocks first, working my way up to the top level, because that is how I will have to process the data eventually anyway.

I am not sure, if this can be done using Regular Expressions, but I thought it might and because I already use RegEx to parse the data within those blocks anyway, but it does not have to be a RegEx solution. The only other thing that I will require is that the Block Markers migth be only single character, but they could also be multi-character (like keywords e.G.  "While"  ... "EndWhile" etc.).

 User generated image
ASKER CERTIFIED SOLUTION
Avatar of Todd Gerbert
Todd Gerbert
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
PCRE (Perl Compatible Regular Expression) has "recursive pattern".
There is no recursive pattern in .NET. Instead, it provides "balancing groups" for stack-based manipulation for matching simple nested patterns.
I'm not a .NET programmer, try it for yourself.

If it doesn't help, I think you have two options:
1. hard coding probably with a recursive function.
2. yacc & lex, although it looks like an overkill...
@jhp333

Nice. I hadn't hear of either recursive patterns or balanced groups. Mom was right, learning can be fun   = )
Avatar of Cumbrowski

ASKER

tgerbert:

Yes, I can and actually already do. I reformat the input already to unify line breaks to CR+LF (e.G. Unix LF -> CR + LF or Mac CR -> CR + LF), replace TAB characters with SPACE, convert all double SPACES to Single SPACES and remove all SPACES before and after any types of Brackets "{}(){}", COMMAs and SEMICOLONs.I am also thinking about the option to preformat the text that any BLOCK Begin and BLOCK End markers will be on a separate liine by themselves.

So again, the answer is yes. What do you have in mind?

Thanks
tgerbert: Your question gave me an idea, which might be the same you had in mind.

Formatting the data to a XML file replacing the Block markers with BEGIN and END Tags, enclosing the rest of the Data into CDATA sections. Then using VB's XmlTextReader or XmlDocument to traverse the XML tree.  That seems to me like a good idea, but I ran into another problem there which has to do with the nesting and proper formating of XML opening and closing tags. I posted another question here for this problem.

https://www.experts-exchange.com/questions/27038891/Regular-Expression-Replace-using-Negative-lookahead-in-VB-NET.html

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
One of the comments gave me new ideas, but none provided a solution for my problem. However, I want to credit the one who gave me the inspirations and give him some points for that. I came up withe the final solution for this myself, so I credit myself for the rest :)