regular expressions in .net

mikha
mikha used Ask the Experts™
on
Given an array or collection of words as below. I want to find each word in the string text and replace it with say an empty string .



var words = new string[] { "apple", "cat", "red" };

var text = "I have a red apple and a small cat";        
         

            foreach (var w in words) {

                output = Regex.Replace(text, @"\b" + w + @"\b", " ");
                text = output;
             
            }

is there a better way to do this?

also, if I have an array of special characters. If any of the character is found , it will be replaced by an empty string.
In my example below, I have $$$, since it is the same character $ , appearing multiple of times, I would just like to replace my one empty string character,
how do i achieve this?

 Regex reg = new Regex("[*&#$^@{}]");
 var result = reg.Replace("I have * , as well as $$$" , " ");
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Your first solution looks reasonable.

For your second question, this should do it:
Regex reg = new Regex("[*&#$^@{}]+");
var result = reg.Replace("I have * , as well as $$$" , " ");

Open in new window

Shaun VermaakTechnical Specialist
Awarded 2017
Distinguished Expert 2018

Commented:
I would use
Regex reg = new Regex("[^a-zA-Z\d\s:]+");
var result = reg.Replace("I have * , as well as $$$" , " ");

Open in new window


This might as well be a string replace. Do you care about casing?
output = Regex.Replace(text, @"\b" + w + @"\b", " ");

Open in new window

Author

Commented:
@shaun- is your first solution looking for all non alphanumeric character?

For second case - case doesn't matter . But I want exact word match .
Become a Microsoft Certified Solutions Expert

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

Author

Commented:
Also out of curiosity, with regular expression can we figure out if a word or a phrase is within double quotes or not.

Say user inputs - "my name is mikha" vs

My name is mikha .
Yes, Shaun's first solution is looking for all non-alphanumeric characters.

I had a thought on how to make your code more efficient for the words.  I don't know .net so you'll likely need to change my added line of code to make it valid .net code.
var words = new string[] { "apple", "cat", "red" };
var text = "I have a red apple and a small cat";         
var rx = "\b(" + words.join("|") + ")\b";  // should end up with \b(apple|cat|red)\b
output = Regex.Replace(text, @rx, " ");

Open in new window


It should be significantly more efficient as the list of words gets longer (unless you have so many words it blows a buffer).

Author

Commented:
@wilcoxon - thanks again . I have only 15-20 reserved words and few special characters that I am looking for.

So I think , even a for loop is good enough .

The buffer outage you mention , I'm guessing is because of how regex looks for pattern and when using | operator , it has to scan the input string multiple times .
I was more talking about blowing string length (or regex length) if there were a ton of words that you threw together as word1|word2|...|wordX.

With 15-20, the single regex should be more efficient but I would be surprised if it was wall-clock measurable compared to the loop (depends on .net internals though).  I'd probably test each against a large sample of text (if there is such a thing - it's unclear where text is coming from exactly) and see if it makes a difference.  However, if you are short on time, the for loop should be fine.

Author

Commented:
@thank again .

I tested this like you had mentioned, with joining words by | operator as such
 Word1 | word2 ...

With about 20 words and it works fine .

The only thing I'm concerned is that this text, I'm parsing is a user input. Right now there is no limit on the number of characters they type in , but would it make sense to put a limit , like say 500 characters or something like that
If it's user input, it should be fine.  I would not expect there to be issues until text is at least a couple gigabytes in size.

If it is user input, just make sure it is sanitized if it is ever used as anything other than a string (eg used as a regular expression, used to update a database, etc).

Author

Commented:
Thank you both for your insights

Author

Commented:
@wilcoxon - thanks again.
You're welcome.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial