Regular expression SQL syntax highlighting in .NET

I am trying the syntax highlight some SQL statements using VB.NET.

What I have got so far works (but I am sure that it isn't the most efficient)
One bug that I have is that if I put a single quote within a comment block my syntax checking thinks this is the start of a string so colours it red when in fact anything within a comment block should be ignored - some suggestions would be appreciated...
'set everything to black to start with
            txtRTB.SelectAll()
            txtRTB.SelectionColor = Color.Black

            'reserved words are all blue
            FormatWithRegEx("\b(ADD|ALL|ALLOCATE|ALTER|AND|ANY|ARE|ARRAY|AS|ASENSITIVE|ASYMMETRIC|AT|ATOMIC|AUTHORIZATION|BEGIN|BETWEEN|BIGINT|BINARY|BLOB|BOOLEAN|BOTH|BY|CALL|CALLED|CASCADED|CASE|CAST|CHAR|CHARACTER|CHECK|CLOB|CLOSE|COLLATE|COLUMN|COMMIT|CONDITION|CONNECT|CONSTRAINT|CONTINUE|CORRESPONDING|CREATE|CROSS|CUBE|CURRENT|CURRENT_DATE|CURRENT_DEFAULT_TRANSFORM_GROUP|CURRENT_PATH|CURRENT_ROLE|CURRENT_TIME|CURRENT_TIMESTAMP|CURRENT_TRANSFORM_GROUP_FOR_TYPE|CURRENT_USER|CURSOR|CYCLE|DATE|DAY|DEALLOCATE|DEC|DECIMAL|DECLARE|DEFAULT|DELETE|DEREF|DESCRIBE|DETERMINISTIC|DISCONNECT|DISTINCT|DO|DOUBLE|DROP|DYNAMIC|EACH|ELEMENT|ELSE|ELSEIF|END|ESCAPE|EXCEPT|EXEC|EXECUTE|EXISTS|EXIT|EXTERNAL|FETCH|FILTER|FLOAT|FOR|FOREIGN|FREE|FROM|FULL|FUNCTION|GET|GLOBAL|GRANT|GROUP|GROUPING|HANDLER|HAVING|HOLD|HOUR|IDENTITY|IF|IMMEDIATE|IN|INDICATOR|INNER|INOUT|INPUT|INSENSITIVE|INSERT|INT|INTEGER|INTERSECT|INTERVAL|INTO|IS|ITERATE|JOIN|LANGUAGE|LARGE|LATERAL|LEADING|LEAVE|LEFT|LIKE|LOCAL|LOCALTIME|LOCALTIMESTAMP|LOOP|MATCH|MEMBER|MERGE|METHOD|MINUTE|MODIFIES|MODULE|MONTH|MULTISET|NATIONAL|NATURAL|NCHAR|NCLOB|NEW|NO|NONE|NOT|NULL|NUMERIC|OF|OLD|ON|ONLY|OPEN|OR|ORDER|OUT|OUTER|OUTPUT|OVER|OVERLAPS|PARAMETER|PARTITION|PRECISION|PREPARE|PRIMARY|PROCEDURE|RANGE|READS|REAL|RECURSIVE|REF|REFERENCES|REFERENCING|RELEASE|REPEAT|RESIGNAL|RESULT|RETURN|RETURNS|REVOKE|RIGHT|ROLLBACK|ROLLUP|ROW|ROWS|SAVEPOINT|SCOPE|SCROLL|SEARCH|SECOND|SELECT|SENSITIVE|SESSION_USER|SET|SIGNAL|SIMILAR|SMALLINT|SOME|SPECIFIC|SPECIFICTYPE|SQL|SQLEXCEPTION|SQLSTATE|SQLWARNING|START|STATIC|SUBMULTISET|SYMMETRIC|SYSTEM|SYSTEM_USER|TABLE|TABLESAMPLE|THEN|TIME|TIMESTAMP|TIMEZONE_HOUR|TIMEZONE_MINUTE|TO|TRAILING|TRANSLATION|TREAT|TRIGGER|UNDO|UNION|UNIQUE|UNKNOWN|UNNEST|UNTIL|UPDATE|USER|USING|VALUE|VALUES|VARCHAR|VARYING|WHEN|WHENEVER|WHERE|WHILE|WINDOW|WITH|WITHIN|WITHOUT|YEAR)\b", txtRTB, Color.Blue)

            ' single quoted text 'blah' is red
            FormatWithRegEx("\'([^\']*)\'", txtRTB, Color.Red)

            'make sure [blah] is all black
            FormatWithRegEx("\[([^\]]*)\]", txtRTB, Color.Black)

            'double quoted strings are all black
            FormatWithRegEx("\" & Chr(34) & "([^\" & Chr(34) & "]*)\" & Chr(34), txtRTB, Color.Black)

            'multiline comments 
            FormatWithRegEx("\/*([^\']*)\*/", txtRTB, Color.Green)

            'single line comments 
            FormatWithRegEx("\--([^\" & vbLf & "]*)\" & vbLf, txtRTB, Color.Green)

            txtRTB.Select(selPos, selLen)
            txtRTB.SelectionColor = Color.Black



Private Sub FormatWithRegEx(ByVal strRegEx As String, ByRef txtRTB As RichTextBox, ByVal colour As System.Drawing.Color)
        Dim regex As New Regex(strRegEx, _
        RegexOptions.IgnoreCase _
        Or RegexOptions.Multiline _
        Or RegexOptions.IgnorePatternWhitespace _
        Or RegexOptions.Compiled)

        Dim myMatches As MatchCollection = regex.Matches(txtRTB.Text)
        For Each GoodMatch As Match In myMatches
            txtRTB.Select(GoodMatch.Index, GoodMatch.Length)
            txtRTB.SelectionColor = colour
        Next
    End Sub

Open in new window

MattWilkinsonAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

planoczCommented:
Can you tell me want you are trying to do  with the SQL statements?
Are you looking for a spaeial word  or what?
MattWilkinsonAuthor Commented:
I am trying to create syntax colour highlighting the same as SQL Server Management Studio. So part of it is looking for reserved words (these are blue) part is looking for strings enclosed in single quotes (these are red) and comments following two dashes (--) (these are green)

The problem I have is that if I put a single quote into a comment it thinks this is the start of the a string enclosed in single quotes whereas I want to only look for single quotes that are NOT in a comment block.
planoczCommented:
I have some code at home that might work.
I will check tonight and see what I can come up with.
PMI ACP® Project Management

Prepare for the PMI Agile Certified Practitioner (PMI-ACP)® exam, which formally recognizes your knowledge of agile principles and your skill with agile techniques.

käµfm³d 👽Commented:
Couldn't you just give the single-quote string a higher precedence than the multi-line comment by moving that particular function call above the one for multi-line comment?
käµfm³d 👽Commented:
P.S.

I actually got *way* better performance by removing the COMPILED option on the regex  :)
käµfm³d 👽Commented:
Please disregard my first comment. I was looking at lines 20 & 21 as the single-quoted string.

I have made a couple of modifications to your patterns and I am attaching before and after results when I tested the changes. If you do not experience similar results, can you post an example string?
    'set everything to black to start with
    txtRTB.SelectAll()
    txtRTB.SelectionColor = Color.Black

    'reserved words are all blue
    FormatWithRegEx("\b(?:ADD|ALL|ALLOCATE|ALTER|AND|ANY|ARE|ARRAY|AS|ASENSITIVE|ASYMMETRIC|AT|ATOMIC|AUTHORIZATION|BEGIN|BETWEEN|BIGINT|BINARY|BLOB|BOOLEAN|BOTH|BY|CALL|CALLED|CASCADED|CASE|CAST|CHAR|CHARACTER|CHECK|CLOB|CLOSE|COLLATE|COLUMN|COMMIT|CONDITION|CONNECT|CONSTRAINT|CONTINUE|CORRESPONDING|CREATE|CROSS|CUBE|CURRENT|CURRENT_DATE|CURRENT_DEFAULT_TRANSFORM_GROUP|CURRENT_PATH|CURRENT_ROLE|CURRENT_TIME|CURRENT_TIMESTAMP|CURRENT_TRANSFORM_GROUP_FOR_TYPE|CURRENT_USER|CURSOR|CYCLE|DATE|DAY|DEALLOCATE|DEC|DECIMAL|DECLARE|DEFAULT|DELETE|DEREF|DESCRIBE|DETERMINISTIC|DISCONNECT|DISTINCT|DO|DOUBLE|DROP|DYNAMIC|EACH|ELEMENT|ELSE|ELSEIF|END|ESCAPE|EXCEPT|EXEC|EXECUTE|EXISTS|EXIT|EXTERNAL|FETCH|FILTER|FLOAT|FOR|FOREIGN|FREE|FROM|FULL|FUNCTION|GET|GLOBAL|GRANT|GROUP|GROUPING|HANDLER|HAVING|HOLD|HOUR|IDENTITY|IF|IMMEDIATE|IN|INDICATOR|INNER|INOUT|INPUT|INSENSITIVE|INSERT|INT|INTEGER|INTERSECT|INTERVAL|INTO|IS|ITERATE|JOIN|LANGUAGE|LARGE|LATERAL|LEADING|LEAVE|LEFT|LIKE|LOCAL|LOCALTIME|LOCALTIMESTAMP|LOOP|MATCH|MEMBER|MERGE|METHOD|MINUTE|MODIFIES|MODULE|MONTH|MULTISET|NATIONAL|NATURAL|NCHAR|NCLOB|NEW|NO|NONE|NOT|NULL|NUMERIC|OF|OLD|ON|ONLY|OPEN|OR|ORDER|OUT|OUTER|OUTPUT|OVER|OVERLAPS|PARAMETER|PARTITION|PRECISION|PREPARE|PRIMARY|PROCEDURE|RANGE|READS|REAL|RECURSIVE|REF|REFERENCES|REFERENCING|RELEASE|REPEAT|RESIGNAL|RESULT|RETURN|RETURNS|REVOKE|RIGHT|ROLLBACK|ROLLUP|ROW|ROWS|SAVEPOINT|SCOPE|SCROLL|SEARCH|SECOND|SELECT|SENSITIVE|SESSION_USER|SET|SIGNAL|SIMILAR|SMALLINT|SOME|SPECIFIC|SPECIFICTYPE|SQL|SQLEXCEPTION|SQLSTATE|SQLWARNING|START|STATIC|SUBMULTISET|SYMMETRIC|SYSTEM|SYSTEM_USER|TABLE|TABLESAMPLE|THEN|TIME|TIMESTAMP|TIMEZONE_HOUR|TIMEZONE_MINUTE|TO|TRAILING|TRANSLATION|TREAT|TRIGGER|UNDO|UNION|UNIQUE|UNKNOWN|UNNEST|UNTIL|UPDATE|USER|USING|VALUE|VALUES|VARCHAR|VARYING|WHEN|WHENEVER|WHERE|WHILE|WINDOW|WITH|WITHIN|WITHOUT|YEAR)\b", txtRTB, Color.Blue)

    ' single quoted text 'blah' is red
    FormatWithRegEx("'(?!.*?\*/.*?').*?'", txtRTB, Color.Red)

    'make sure [blah] is all black
    FormatWithRegEx("\[[^\]]*\]", txtRTB, Color.Black)

    'double quoted strings are all black
    FormatWithRegEx("""[^""]*""", txtRTB, Color.Black)

    'single line comments 
    FormatWithRegEx("^--.+?$", txtRTB, Color.Green)

    'multiline comments 
    FormatWithRegEx("/\*.+?\*/", txtRTB, Color.Green)

    ''txtRTB.Select(selPos, selLen)
    ''txtRTB.SelectionColor = Color.Black


End Sub

Private Sub FormatWithRegEx(ByVal strRegEx As String, ByRef txtRTB As RichTextBox, ByVal colour As System.Drawing.Color)
    Dim regex As New Regex(strRegEx, _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace)

    Dim myMatches As MatchCollection = regex.Matches(txtRTB.Text)
    For Each GoodMatch As Match In myMatches
        txtRTB.Select(GoodMatch.Index, GoodMatch.Length)
        txtRTB.SelectionColor = colour
    Next
End Sub

Open in new window

untitled.JPG

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
käµfm³d 👽Commented:
If you copy the code from the window above, please uncomment lines 23 and 24--I had to comment them out for my testing  :)
MattWilkinsonAuthor Commented:
Thanks for the update - can you try and explain how it works? Although a seasoned VB programmer regex is relatively new ground for me!
käµfm³d 👽Commented:
Sorry, I just saw your request...   and it's not a problem  :)

I'll start with the patterns. For some of the patterns, they are very similar to yours--with some, the only difference is that I removed some of the extraneous escapes you placed in  :)

    \b(?:ADD|ALL ...  WITHOUT|YEAR)\

This is essentially the same pattern. The only difference is I made the group a non-capturing group using the question/colon syntax --  (?: )  This performs slightly better than a capturing group -- parentheses alone

    '(?!.*?\*/.*?').*?'

This one was fun! This pattern uses negative lookahead -- question/exclamation (?! ) -- to check that a multi-line comment closing-indicator does not come between two single-quotes. It does this by first finding a single-quote (not included in the negative lookahead), then, stepping into the negative lookahead, it steps through each character up to a single quote. If at some point during this stepping a closing multi-line comment symbol is encountered, the negative lookahead will fail, causing the match to fail. If the negative lookahead succeeds, meaning that the multi-line closer was not found before a closing single quote, then it matches all characters up to the first closing single quote.

    \[[^\]]*\]

This one matches an opening square bracket, followed by zero-or-more characters which are not a closing square bracket, followed by a closing square bracket.

    ""[^""]*""

Find a double-quote, followed by zero-or-more characters which are not a double-quote, followed by a double-quote. You see the double-quotes paired inside the pattern because this is how you escape a double-quote inside of a string in VB. I thought this was a little cleaner than your Char(34) constructions  :)

    ^--.+?$

This matches the start of the line (and not the start of the string becuase we turned on Multiline) followed by 2 dashes, then every character up to the end of the line

In hindsight, you may want to adjust this pattern as a comment which has spaces before it will not be matched by this. The following would probably be better:

    \s*--.+?$

This would match any number of whitespaces at the beginning of the comment; but we removed the start-of-line marker ( ^ ) so that we can have single-line comments at the end of the string.


    /\*.+?\*/

This matches the start of multi-line comment marker, followed by all characters up to, then including, the end of multi-line comment marker.

#########################################################################################
#########################################################################################

Now, for the logic. The keywords get highlighted first, as you already know. This is followed by the single-quote string. Because of the way we structured the single-quote pattern to check for the multi-line comment marker, only the strings which are truly single-quote strings should be matched (highlighted) by this pattern.

The next step is doing the bracketed expressions. This is pretty much the same as what you had before. If you think there is a chance you may have an opening square bracket inside of a multi-line comment without the accompanying closing bracket, the you will have the same problem you had with single-quotes. I leave it to you to judge how likely this could occur.

The last two steps include searching for the single-line comments and then the multi-line comments, as you know. The key to the overall process is making sure the multi-line comment comes last, since it is the pattern which can match the most, including subsets of the other patterns.

As to the performance gain with removing COMPILED (if your interested) -- compiled is good if you are going to run the SAME regex object multiple times. The way you have your logic structured, however, you are creating a new regex object every time the function is called. This means that every time your formatting function gets called, you are doing 6 compilations. If you wanted to stick with the COMPILED option, a better route would be to create 6 global (class-level) regex objects, all with COMPILED, as well as the other options and each with one of the different patterns, then pass those individual regex objects to your FormatWithRegEx() function. The COMPILED option should, in this case, yield slightly faster execution on subsequent executions, outside of the first.

I'm not sure how good of an explanation that is, but if anything is unclear, please let me know!
käµfm³d 👽Commented:
Here is an example of what I described in the last paragraph:
          'set everything to black to start with
    txtRTB.SelectAll()
    txtRTB.SelectionColor = Color.Black

    'reserved words are all blue
    FormatWithRegEx(Me.regKeyword, txtRTB, Color.Blue)

    ' single quoted text 'blah' is red
    FormatWithRegEx(Me.regSingleQuoteString, txtRTB, Color.Red)

    'make sure [blah] is all black
    FormatWithRegEx(Me.regBracketed, txtRTB, Color.Black)

    'double quoted strings are all black
    FormatWithRegEx(Me.regDoubleQuotedString, txtRTB, Color.Black)

    'single line comments
    FormatWithRegEx(Me.regSingleLineComment, txtRTB, Color.Green)

    'multiline comments
    FormatWithRegEx(Me.regMultiLineComment, txtRTB, Color.Green)

    ''txtRTB.Select(selPos, selLen)
    ''txtRTB.SelectionColor = Color.Black


End Sub

Private Sub FormatWithRegEx(ByVal regObj As Regex, ByRef txtRTB As RichTextBox, ByVal colour As System.Drawing.Color)
    Dim myMatches As MatchCollection = regObj.Matches(txtRTB.Text)

    For Each GoodMatch As Match In myMatches
        txtRTB.Select(GoodMatch.Index, GoodMatch.Length)
        txtRTB.SelectionColor = colour
    Next
End Sub

Private regKeyword As Regex("\b(?:ADD|ALL|ALLOCATE|ALTER|AND|ANY|ARE|ARRAY|AS|ASENSITIVE|ASYMMETRIC|AT|ATOMIC|AUTHORIZATION|BEGIN|BETWEEN|BIGINT|BINARY|BLOB|BOOLEAN|BOTH|BY|CALL|CALLED|CASCADED|CASE|CAST|CHAR|CHARACTER|CHECK|CLOB|CLOSE|COLLATE|COLUMN|COMMIT|CONDITION|CONNECT|CONSTRAINT|CONTINUE|CORRESPONDING|CREATE|CROSS|CUBE|CURRENT|CURRENT_DATE|CURRENT_DEFAULT_TRANSFORM_GROUP|CURRENT_PATH|CURRENT_ROLE|CURRENT_TIME|CURRENT_TIMESTAMP|CURRENT_TRANSFORM_GROUP_FOR_TYPE|CURRENT_USER|CURSOR|CYCLE|DATE|DAY|DEALLOCATE|DEC|DECIMAL|DECLARE|DEFAULT|DELETE|DEREF|DESCRIBE|DETERMINISTIC|DISCONNECT|DISTINCT|DO|DOUBLE|DROP|DYNAMIC|EACH|ELEMENT|ELSE|ELSEIF|END|ESCAPE|EXCEPT|EXEC|EXECUTE|EXISTS|EXIT|EXTERNAL|FETCH|FILTER|FLOAT|FOR|FOREIGN|FREE|FROM|FULL|FUNCTION|GET|GLOBAL|GRANT|GROUP|GROUPING|HANDLER|HAVING|HOLD|HOUR|IDENTITY|IF|IMMEDIATE|IN|INDICATOR|INNER|INOUT|INPUT|INSENSITIVE|INSERT|INT|INTEGER|INTERSECT|INTERVAL|INTO|IS|ITERATE|JOIN|LANGUAGE|LARGE|LATERAL|LEADING|LEAVE|LEFT|LIKE|LOCAL|LOCALTIME|LOCALTIMESTAMP|LOOP|MATCH|MEMBER|MERGE|METHOD|MINUTE|MODIFIES|MODULE|MONTH|MULTISET|NATIONAL|NATURAL|NCHAR|NCLOB|NEW|NO|NONE|NOT|NULL|NUMERIC|OF|OLD|ON|ONLY|OPEN|OR|ORDER|OUT|OUTER|OUTPUT|OVER|OVERLAPS|PARAMETER|PARTITION|PRECISION|PREPARE|PRIMARY|PROCEDURE|RANGE|READS|REAL|RECURSIVE|REF|REFERENCES|REFERENCING|RELEASE|REPEAT|RESIGNAL|RESULT|RETURN|RETURNS|REVOKE|RIGHT|ROLLBACK|ROLLUP|ROW|ROWS|SAVEPOINT|SCOPE|SCROLL|SEARCH|SECOND|SELECT|SENSITIVE|SESSION_USER|SET|SIGNAL|SIMILAR|SMALLINT|SOME|SPECIFIC|SPECIFICTYPE|SQL|SQLEXCEPTION|SQLSTATE|SQLWARNING|START|STATIC|SUBMULTISET|SYMMETRIC|SYSTEM|SYSTEM_USER|TABLE|TABLESAMPLE|THEN|TIME|TIMESTAMP|TIMEZONE_HOUR|TIMEZONE_MINUTE|TO|TRAILING|TRANSLATION|TREAT|TRIGGER|UNDO|UNION|UNIQUE|UNKNOWN|UNNEST|UNTIL|UPDATE|USER|USING|VALUE|VALUES|VARCHAR|VARYING|WHEN|WHENEVER|WHERE|WHILE|WINDOW|WITH|WITHIN|WITHOUT|YEAR)\b", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace
    Or RegexOptions.Compiled)

Private regSingleQuoteString As Regex("'(?!.*?\*/.*?').*?'", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace
    Or RegexOptions.Compiled)

Private regBracketed As New Regex("\[[^\]]*\]", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace
    Or RegexOptions.Compiled)

Private regDoubleQuotedString As New Regex("""[^""]*""", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace
    Or RegexOptions.Compiled)

Private regSingleLineComment As New Regex("^--.+?$", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace
    Or RegexOptions.Compiled)

Private regMultiLineComment As New Regex("/\*.+?\*/", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Multiline _
    Or RegexOptions.Singleline _
    Or RegexOptions.IgnorePatternWhitespace
    Or RegexOptions.Compiled)

Open in new window

käµfm³d 👽Commented:
It's late...   and I missed a couple of "New" on the member declarations for the Regex objects. You would add those where appropriate  :)
MattWilkinsonAuthor Commented:
Many thanks
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.