Solved

I have a file specification in BNF format.  How do I implement a parser that can read those files?

Posted on 2010-09-01
7
842 Views
Last Modified: 2013-11-18
I have a specification for a proprietary text-based file format.  The spec contains about 10 pages worth of BNF logic to describe the formatting rules for these files.  I would like to implement a program that can read files in this format.  My understanding is that I'll need some kind of tool set (like lex and yacc) to parse and validate the file, and then read the contents of the file to allocate and populate objects in my program.

There are commercial implementations of this spec, but the vendors that I have found charge *annual* licensing fees in the tens of thousands of dollars.  To my knowledge, the organization that devised the spec charges no such royalty fees--it sounds like the cost is just an issue of the vendors' business models, and the fact that we're a niche industry.

When we look at what we need to do with the files vs. how much the 3rd party licenses cost, the build vs. buy equation makes it seem much more practical to roll our own implementation.

Getting back to the BNF issue, I *vaguely* remember dealing with this in my compilers class in college, but that was several years ago, and I haven't used that stuff since that class.  Should I look into a tool set like lex and yacc?  Or are there other options available?  I'm open to a variety of options (C++, .NET, etc).
0
Comment
Question by:ThoughtProcess
7 Comments
 
LVL 53

Accepted Solution

by:
Infinity08 earned 84 total points
ID: 33578452
>> My understanding is that I'll need some kind of tool set (like lex and yacc) to parse and validate the file

lex/yacc (or one of their variants) would be my first choice.

        http://dinosaur.compilertools.net/

If the BNF specification is clean, there should not be too much issues generating a parser for it using those tools.

Do you have any specific questions about them ?
0
 

Author Comment

by:ThoughtProcess
ID: 33578597
As far as specific questions go, I'd like to build an in-memory representation of the data.

Here's a dumbed-down look at how the file is formatted:

/BEGIN Foo
[optional whitespace]
attribute 1
attribute 2
/BEGIN [child element]
...
/END [child element]
/END Foo


I plan to define a set of classes to match the possible elements.  Here's an example class that could work for the above file fragment:

class Foo : public Bar
   {
   public:
      Foo();
      virutal ~Foo();

   protected:
      ContainerClass<Attribute> attributeList;
      ContainerClass<Bar> children;
   };

I need a way to allocate and populate a series of objects to match the data in the file.  Then I'd like to perform some checks on the resulting data.

I imagine that yacc/lex have the logic to determine if a file is well-formeed, but do they have a callback mechanism that can call constructors when certain patterns are matched?  

I guess I'm unsure of how I can make the jump from "knowing that the file is well-formed" to "populating data from the file into a set of objects".

Thanks!
0
 

Author Comment

by:ThoughtProcess
ID: 33578683
Not-quite-ninja edit:

I did more reading on lex and yacc, and I think this fits the bill almost exactly.  Sorry for the previous "RTFM" comment.
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 11

Assisted Solution

by:cup
cup earned 83 total points
ID: 33581180
There are several ways to parse BNF - if you want speed, use recrsive descent but you need to make sure you do not have any left recursive definitions.   If you want good error messages, use LALR - lex/yacc/bison.  Lex/yacc/bison is OK if you keep to using it as a tool and don't try to debug the generated code.  Note that lex and yacc/bison are separate tools with their own distinct syntax rules.
0
 
LVL 7

Assisted Solution

by:jhp333
jhp333 earned 83 total points
ID: 33581626
I recently used Flex/Bison, 15 years after I took compiler theory class. It was not too difficult to catch up, but it surely took some time especially when I wanted more accurate error messages. I used only C, not C++, as those tools looked more native to C, and built a syntax tree with C structure.

Looking at your example, I got an impression that its grammar is so simple (no operator precedence etc.) that it might be easily converted to XML using only lex. Yacc is far more complicated than lex.

It seems you're working on a Windows platform. Then you can find your tools from these links:
http://gnuwin32.sourceforge.net/packages/flex.htm
http://gnuwin32.sourceforge.net/packages/bison.htm
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 33584389
>> I imagine that yacc/lex have the logic to determine if a file is well-formeed,

They provide you the tools to implement this kind of check. You'll still have to add the logic yourself though. In the most basic way, you can show an error message whenever an unexpected token is encountered.


>> but do they have a callback mechanism that can call constructors when certain patterns are matched?  

Yes. Whenever a rule is matched, yacc allows you to execute arbitrary code. So, it's just a matter of specifying the right rules, and adding the right code to the rules.


>> /BEGIN Foo
>> [optional whitespace]
>> attribute 1
>> attribute 2
>> /BEGIN [child element]
>> ...
>> /END [child element]
>> /END Foo

That doesn't look too complicated, so it shouldn't be too difficult to parse it.
0
 

Author Closing Comment

by:ThoughtProcess
ID: 33694439
Thank you for your help, guys!  This is a great launching point for parsing these files.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The viewer will learn how to implement Singleton Design Pattern in Java.
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now