Text Parser for Hierarchical Structured Data via RegEx (not necessary) and VB.NET

Cumbrowski
Cumbrowski used Ask the Experts™
on

Open in new window

I am trying to try to use VB.NET to parse text data that is structured hierarchically like a classic "Tree".

The obvious solution for me was writing a lot of code to traverse and parse the text data character by character and build the tree with its chunks of data nodes slowly one byte at a time.

Well, I hope that there is a better way than that. I was thinking of Regular Expressions. I am familar with them, but far away from considering myself an Expert. I am also new to the VB.NET flavor of this and don't know what it can and cannot do. I hope that somebody here can help me to find an elegant and efficient solution.

Here are 3 Examples of Text and it's structure. I am using the characters "{" as Begin of Block and the "}" as End of Block markers, but that could also be other characters or even "Keywords".


Example 1
{
"Block 1"
Any Characters, except for block marker,
Line Breaks, White Spaces, Letters, Numbers,
+-[]:;'?/\-()*&%$#@!.,<>A-Za-z0-9 TAB LF CR
}

--------------------------------

Example 2
{
"Block 1"
}
{
"Block 2"
}

Okay, those were the easy ones, whichI don't have a problem with solving.Here is now the problematic one.

--------------------------------
Example 3
{
"Block 1"
 .. (maybe) Data ...
  {
  "Block 1.1"
  .. (maybe) Data ...         
    {
    "Block 1.1.1"
    .. Data ...
    }
    .. (maybe) Data (for Block 1.1)
    {
    "Block 1.1.N"
    .. Data ...
    }      
   .. (maybe) Data ...
  }
 .. (maybe) Data..
  {
  "Block 1.N"
   .. Data ..
  }
   .. (maybe) Data ...
}
{
"Block 2"
.. Data..
}

As you can see, there can be 0-N levels of nesting within each block regardless of the level where the block is located. A typical TREE hierarchy. And something like a tree I would like to
get back. A reversed tree would be even better, starting with the deepest levels of blocks first, working my way up to the top level, because that is how I will have to process the data eventually anyway.

I am not sure, if this can be done using Regular Expressions, but I thought it might and because I already use RegEx to parse the data within those blocks anyway, but it does not have to be a RegEx solution. The only other thing that I will require is that the Block Markers migth be only single character, but they could also be multi-character (like keywords e.G.  "While"  ... "EndWhile" etc.).

 Data Structure Graphical Illustration
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
IT Consultant
Top Expert 2010
Commented:
Are you married to this text format for some reason, or do you have the option of changing the layout of the text file?  Seems like XML would be a good fit here...

Commented:
PCRE (Perl Compatible Regular Expression) has "recursive pattern".
There is no recursive pattern in .NET. Instead, it provides "balancing groups" for stack-based manipulation for matching simple nested patterns.
I'm not a .NET programmer, try it for yourself.

If it doesn't help, I think you have two options:
1. hard coding probably with a recursive function.
2. yacc & lex, although it looks like an overkill...
ǩa̹̼͍̓̂ͪͤͭ̓u͈̳̟͕̬ͩ͂̌͌̾̀ͪf̭̤͉̅̋͛͂̓͛̈m̩̘̱̃e͙̳͊̑̂ͦ̌ͯ̚d͋̋ͧ̑ͯ͛̉Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
@jhp333

Nice. I hadn't hear of either recursive patterns or balanced groups. Mom was right, learning can be fun   = )
Success in ‘20 With a Profitable Pricing Strategy

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
tgerbert:

Yes, I can and actually already do. I reformat the input already to unify line breaks to CR+LF (e.G. Unix LF -> CR + LF or Mac CR -> CR + LF), replace TAB characters with SPACE, convert all double SPACES to Single SPACES and remove all SPACES before and after any types of Brackets "{}(){}", COMMAs and SEMICOLONs.I am also thinking about the option to preformat the text that any BLOCK Begin and BLOCK End markers will be on a separate liine by themselves.

So again, the answer is yes. What do you have in mind?

Thanks

Author

Commented:
tgerbert: Your question gave me an idea, which might be the same you had in mind.

Formatting the data to a XML file replacing the Block markers with BEGIN and END Tags, enclosing the rest of the Data into CDATA sections. Then using VB's XmlTextReader or XmlDocument to traverse the XML tree.  That seems to me like a good idea, but I ran into another problem there which has to do with the nesting and proper formating of XML opening and closing tags. I posted another question here for this problem.

http://www.experts-exchange.com/Programming/Misc/Q_27038891.html

Well, I came up with a different solution, since I started to doubt that it will work using RegEx. I am now parsing the data line by line, use regex for the marker identification and The "Treeview" control to collect the data in hierarchical format.

Author

Commented:
One of the comments gave me new ideas, but none provided a solution for my problem. However, I want to credit the one who gave me the inspirations and give him some points for that. I came up withe the final solution for this myself, so I credit myself for the rest :)

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial