Best Way to parse SQL


I was working on a SQL query generator, and was wondering what'd be the best way to parse SQL using ANSI C.

Right now what i do is:

Tokenize the query based on spaces.
Determine which SQL statement it is based on these tokens.
Evaluate it accordingly.I have different syntax check mechanisms for different statements,

I dont think this is the best way to do it.

What would be a better,faster method of parsing SQL?
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

Kent OlsenConnect With a Mentor Data Warehouse Architect / DBACommented:
Hi ankuratvb,

My, but this is an ambitious project.  (One that is certainly worthy of one of the board's stars!)

SQL is recursive, so tokenizing "on the fly" has its drawbacks.  Besides, it's pretty easy to generate a token list.  But you will have to base your tokens on more than spaces.  (Semicolons, parenthesis, equals, etc are all stand-alone tokens than don't require leading spaces.)

The next thing that you're going to need is a language definition.  Parsing the SQL will be much easier with a defined set of rules that your parser can traverse.  Again, the recursive process comes to mind.

In a lot of ways, there's some comfort in performing "brute force" syntax checking, but it gets very messy, particularly as language elements change.  It's much easier to modifhy the rules tables and insert exits at the appropriate spot for code generation.

Are you planning to support user functions and stored procedures, too?


Jaime OlivaresSoftware ArchitectCommented:
If you can parse it correctly you can execute it! (at least in theory), so I recommend you to use a genuine SQL parser from any SQL implementation.
ankuratvbAuthor Commented:
Hi Kent,

Basically,i look at the first word of the query string to determine which statement it is,then i extract the substrings required and process them.

For e.g.:
char query[]="insert into emp values('as',1234)";

By looking at the first word, i know that its an insert command,so i extract the content between "insert into" and "values" to get the table name,validate it and then move on to validating the values within the parentheses.

Right now,it works fine but i need to support sub-queries as well.Thats where the problem comes.

For instance,

char query[]="select * from emp where empno=(select empno from emp where empno=1)";

>Are you planning to support user functions and stored procedures, too?
I might, but sub-queries is first priority right now.

If i use an SQL parser written by somebody else, half my work would've been done by somebody else.Yes,i can definitely learn from their parser's implementation.
Do you have links to any simple SQL parser's implementation?
I've seen MySQL's code but it was way too complicated for me.
I've downloaded SQLite's code which is supposed to be a very small,simple and fast database engine(adopted by PHP in place of MySQL).I'll be studying the code in the next few days.
Live webcast with Pinal Dave

Pinal Dave will teach you tricks to help identify the real root cause of database problems rather than red herrings. Attendees will learn scripts that they can use in their environment to immediately figure out their performance Blame Shifters and fix them quickly.

Kent OlsenData Warehouse Architect / DBACommented:

Hi Ankur,

That's what I was referring to when I said that SQL is recursive.  Just like evaluating the expression (A*(B+C)), subqueries force you to process outward from the middle.

Unless you want to get REALLY ugly, you're going to have to utilize a stack and recursive evaluation techniques.  It adds a bunch of work to your project, but in the long run it's the only practical way to process a recursive language.

Jaime OlivaresConnect With a Mentor Software ArchitectCommented:
Parsers are generally written as "Finite State Machine" implementations. This is a very detailed subject. You may want to google abit.

gseidmanConnect With a Mentor Commented:
When creating a parser for use in a C program which parses any moderately complex language, you generally want to use lex and yacc. They are a lexical scanner and a parser generator, respectively, and are pretty standard for such things. I recommend using (and modifying, if necessary) the lex and yacc sources from PostgreSQL. It is under the BSD license, so they should be suitable for inclusion in your work. Looking at the source tree, you probably want scan.l and gram.y from src/backend/parser
ankuratvbAuthor Commented:
Hi gseidman,

I was looking at lex and yacc for doing the job,i even have the language definition for SQL for them.

But,i was aiming for a portable solution, and i guess there arent such tools as lex and yacc in Windoze.
Kent OlsenConnect With a Mentor Data Warehouse Architect / DBACommented:

Hi Ankur,

lex(1) and yaxx(1) don't exist on Winblows, but that doesn't mean that your solution can't exist there.  Use the tools to build the definitions on *nix and then simply ?:) recompile it in Windows.

gseidmanConnect With a Mentor Commented:
Though lex and yacc may not be available on Windows by default, there are implementations available for Windows. In particular, flex and bison have been ported. See:

All Courses

From novice to tech pro — start learning today.