Solved

C# Regex removing VB(A) comments wanted.

Posted on 2014-02-22
35
357 Views
Last Modified: 2014-03-05
Let code variable value be a fragment of VB(A) code like:
const string code = @"
Option explicit 'comment
sub t
dim X 'comment with cont _
inuation _
end of comment
X = ""It's mine!"" ' inside "" is not a comment designation
end sub
";

Open in new window

The following code shall remove all comments
Console.WriteLine(Regex(code,<pattern I am asking about>,<null or $1 or what?>))

Open in new window

The result shall be as follows:
Option explicit 
sub t
dim X 
X = "It's mine!" 
end sub

Open in new window

I expect something working (a little bit) better than my:
Debug.Print(Regex.Replace(code, @"(?<![""].*)(?s)'.*?[^_](\r\n)", "$1"));
0
Comment
Question by:midfde
  • 16
  • 10
  • 6
  • +1
35 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39879995
This is not a good problem to tackle with regex. You really need a parser to parse a grammar such as a programming language.
0
 
LVL 1

Assisted Solution

by:midfde
midfde earned 0 total points
ID: 39880097
No, I do not. I just want to know whether RE is powerful enough to solve the problem, hence my question.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39880116
It might be powerful enough, but the pattern would be very complicated.
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39880198
This
Debug.Print(Regex.Replace(code, @"'[\w\s""]*\r\n", "\r\n"))

Open in new window

works well with your sample data.

HTH,
Dan
0
 
LVL 1

Author Comment

by:midfde
ID: 39880549
>>ID: 39880198 ...works well
No, it does not.

>>ID: 39880116: ...very complicated.
Something more specific than just emotions please?
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39880584
Tested my regex using RegexBuddy using .NET compatibility.
Where are you testing, so we can use the same software?
0
 
LVL 1

Author Comment

by:midfde
ID: 39880597
Dan:
The following code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using System.IO;
using System.Diagnostics;

namespace removeVBComments {
    class Program {
        const string VBA = @"
Option explicit 'comment
sub t
dim X 'comment with cont _
inuation _
end of comment
X = ""It's mine!"" ' inside "" is not a comment designation
end sub
";
        static void Main(string[] args) {
            string txt = VBA;
            Debug.Print("11)" + txt + Regex.Replace(txt,  @"'[\w\s""]*\r\n", "\r\n") + "-----");
        }
    }
}

Open in new window

results in
11)
Option explicit 'comment
sub t
dim X 'comment with cont _
inuation _
end of comment
X = "It's mine!" ' inside " is not a comment designation
end sub

Option explicit 
dim X 
X = "It's mine!" 
-----

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 39880684
Your statement
X = ""It's mine!"" ' inside "" is not a comment designation

Open in new window

is not correct syntax.

Did you mean to use
X = """It's mine!""" ' inside "" is not a comment designation

Open in new window


Is this VBA or VBScript code?
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 250 total points
ID: 39880693
The only one being emotional, unfortunately, is you. Ask around:  Regex is intended to parse a regular language, not a context-free language. Regex does not handle recursive constructs, which is what you would need in order to test for nested quotation marks. There are simply too many variations of code that you need to account for in order to capture every place where a comment can be.

The only reason that I say it "might be powerful enough" is because the regex engine that is built into .NET has additional functionality that is not a part of theoretical regular expressions. But as I said, even with these extra options, the pattern you write would be overly complicated, and damn near incomprehensible.
0
 
LVL 1

Author Comment

by:midfde
ID: 39880752
To:ID: 39880684
Please see line 7 in results ID: 39880597

To: ID: 39880693
Dear kaufmed.
With all due respect I sincerely comprehend as (generally inevitable) emotions anything like "too many" and "overly complicated" in such a simple context as: "Is it possible? Prove it please."
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39880848
You might cover two of the three cases you posted with the following:
1. Use regexp (or equivalent string replace function) to remove the line continuations.
replace: _\r\n
with: "" -- empty string

2. Then replace with this pattern
replace: '[^"]*?[^_]\r\n
with: \r\n -- carriage return & line feed

==========================
that leaves you with the statement that I've identified as not syntactically correct.
0
 
LVL 1

Author Comment

by:midfde
ID: 39880905
Sorry, aikimark, I do not actually understand if your consideration has much to do with my initial request, which is to remove comments (without changing anything else) from syntactically correct VB(A) with a [one-liner] single C# regular expression. All comments. Nothing but comments please.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39881469
which is to remove comments (without changing anything else)
The requirement to not change anything else was not explicitly stated in your question.

As I have stated twice, your line 7 is not syntactically correct.  I have posted a correct version.
0
 
LVL 1

Author Comment

by:midfde
ID: 39882323
>>The requirement to not change anything else was not explicitly stated in your question.
Sorry, it's my fault. I thought if I ask e.g. "How to remove a file?" it by default means "not to change anything else" (or else DEL *.* or even FORMAT C: might do).

>>...line 7 is not syntactically correct
MS Access "thinks" otherwise -- please see the image and compare line 7 (results!) above with penultimate line in the code on the picture.
MS Access does not find any syntax errors.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39882709
The code you originally posted is not the same code as what is displayed in your screenshot.
0
 
LVL 1

Author Comment

by:midfde
ID: 39882851
Originally I posted C# const code whose value is VBA code (oh, see line 7 above).
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39894097
@midfde

What version of the .Net framework are you using?  There is an updated compiler interface that should allow you to get a compiler's-eye-view of any VB.Net or C# code.
http://www.hanselman.com/blog/AnnouncingTheNewRoslynpoweredNETFrameworkReferenceSource.aspx
http://msdn.microsoft.com/en-us/vstudio/roslyn.aspx
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 1

Author Comment

by:midfde
ID: 39894633
Thank you, aikimark.

Q: How to... using RegEx?
A: There is an updated compiler...

Looks irrelevant to me, sorry, as well as "What version...?" (BTW, it's a normal one, ".Net 4 Framework Client Profile").
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39894957
* I don't think a single regular expression pattern will do what you want.

* I think that you would be able to solve your problem if you used the compiler to parse your code.
0
 
LVL 1

Author Comment

by:midfde
ID: 39895217
>> I don't think a single regular expression pattern will do what you want.
... because of what???

I thought I might save some efforts by consulting with an expert who is fluent in RE language. My expectations are not met so far, and I'll try to solve my puzzle myself again. Later. I believe it is possible. I'll let this respectable forum know my results.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39895276
You have a fundamental flaw in your thinking of what regex is intended to be used for. You're not alone. Many people have tried to apply regex in scenarios where it does not make sense. As I mentioned above, it *may* be possible, and only because the regex implementations that we work with in today's programming languages have extra features that are not defined by theoretical regular expressions. If you would like an example of how something simple to understand becomes complicated when defining it in regex, take a look at a pattern in my article under the section, "Tokenizer on Steroids." What you are now trying to do is much more complicated than what I was doing in that part of the article.

The simple fact is:  You are not hearing the answer you want to hear, and so you think no one is helping you. You need to come to terms with the idea that sometimes "it won't work" or "it's not a good idea" is the right answer.
0
 
LVL 1

Author Comment

by:midfde
ID: 39895364
>>You have a fundamental flaw in your thinking
Sorry, my question is not about my (or your for that matter) thinking.

>>...sometimes "it won't work" or "it's not a good idea" is the right answer.
Sometimes it is, but not in computing when it is not corroborated. It is not an answer at all unless it is followed by convincing "because..." consideration. (Remember Fermat's theorem?)

>>... it does not make sense
Some people deemed such a simple "thingy" as Turing machine may not make practical sense (because they just knew it was so).

Your reference is very good though. Thanks kaufmed.
0
 
LVL 45

Assisted Solution

by:aikimark
aikimark earned 250 total points
ID: 39895366
... because of what???
Because I've used Regexp, answered EE questions with regexp solutions, written articles that include regexp components, and given presentations at user groups and developer conferences.  Like kaufmed, I know its limitations and applicable problem contexts.

Have you had any formal training with grammars and lexical processing (or equivalent self-training/experience)?  If not, you can do a little reading on those subjects to confirm what kaufmed and I are asserting -- regexp is not the tool to use for the problem you have presented to us.

Since you are in the .Net environment (C#), I'm suggesting that the compiler interface might provide a solution path to your problem.  There are alternatives, but they are more complicated and would likely require more effort to implement.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39895385
and now...a little humor:
http://xkcd.com/1313/
http://xkcd.com/1171/

And Jeff Atwood's blog (omage/love letter) to regular expressions.
http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/
Note #4:
Regular expressions are not Parsers.
0
 
LVL 1

Author Comment

by:midfde
ID: 39895444
Splendid!

>>Regular expressions are not Parsers.
Excellent observation. It is as good as "C-language is not a parser."
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39895616
Excellent observation

I would have assumed that to be implied in my very first comment   : \

This is not a good problem to tackle with regex. You really need a parser...
0
 
LVL 1

Accepted Solution

by:
midfde earned 0 total points
ID: 39895866
Excellent observations, various emotions, not enough (for me) evidences. Therefore I had to solve my problem with the following code:
using System;
using System.Text.RegularExpressions;
using System.Diagnostics;

namespace removeVBComments {
    class Program {
        const string VBA = @"
Option explicit 'comment
sub t
dim X 'comment with cont _
inuation _
end of comment
X = ""It's mine!"" ' inside "" is not a comment designation
end sub
";
        static void Main(string[] args) {
            string txt = VBA;
            Debug.Print("11)" + txt + Regex.Replace(txt, @"(?<!^[^""]*""(?:[^""]*""[^""]*"")*[^""]*)(?s)'.*?[^_](\r\n)", "$1"));
        }
    }
}

Open in new window

Its output is this:
11)
Option explicit 'comment
sub t
dim X 'comment with cont _
inuation _
end of comment
X = "It's mine!" ' inside " is not a comment designation
end sub

Option explicit 
sub t
dim X 
X = "It's mine!" 
end sub

Open in new window

Heed line (guess what?) 7 and line 13. The RE distinguishes the context of apostrophes.
I thank participants for a fruitful discussion that made me concentrate.  We all know that the easiest answer to any "Is it possible...?" question is "No!!!". Do we not?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39895975
very good.

does your pattern also remove the comments when they are the only thing on one or more lines or follow nothing but space/tab characters?
0
 
LVL 1

Author Comment

by:midfde
ID: 39896143
You better try it and let me know.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39896795
I see the following:
11)
Option explicit 'comment
sub t
' the purpose of the routine is
dim X 'comment with cont _
inuation _
end of comment
X = "It's mine!" ' inside " is not a comment designation
end sub

Option explicit 'comment
sub t
' the purpose of the routine is
dim X 'comment with cont _
inuation _
end of comment
X = "It's mine!" ' inside " is not a comment designation
end sub

Open in new window

0
 
LVL 1

Author Comment

by:midfde
ID: 39897214
Unbelievable! (You certainly copied and pasted the code. Didn't you?)
Sysinfo returns
OS Name:                   Microsoft Windows 8.1
OS Version:                6.3.9600 N/A Build 9600
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Workstation
OS Build Type:             Multiprocessor Free
on the computer I am running MSVS 2010 on.
Please see the attached image.
It works on my computer (W8, MSVS 2010)
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39897294
I replaced debug.print with  console.write
after Copy/paste your solution code snippet:
using System;
using System.Text.RegularExpressions;
using System.Diagnostics;

namespace removeVBComments {
    class Program {
        const string VBA = @"
Option explicit 'comment
sub t
' the purpose of the routine is
dim X 'comment with cont _
inuation _
end of comment
X = ""It's mine!"" ' inside "" is not a comment designation
end sub
";
        static void Main(string[] args) {
            string txt = VBA;
            Console.WriteLine("11)" + txt + Regex.Replace(txt, @"(?<!^[^""]*""(?:[^""]*""[^""]*"")*[^""]*)(?s)'.*?[^_](\r\n)", "$1"));
        }
    }
}

Open in new window

I did this run in a virtual environment (Mono 2.10.2.0).  Not sure of the sysinfo in that particular environment, but I'm pretty sure it isn't Win8.
http://www.compileonline.com/compile_csharp_online.php

I see that your environment does delete the single line comment.
0
 
LVL 1

Author Comment

by:midfde
ID: 39897305
I think the test of web page fails here.
0
 
LVL 1

Author Closing Comment

by:midfde
ID: 39905899
Good for discussants, good for me!
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

This article describes a simple method to resize a control at runtime.  It includes ready-to-use source code and a complete sample demonstration application.  We'll also talk about C# Extension Methods. Introduction In one of my applications…
This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now