How to make C# Regex work with "new line" characters in textfile?

Hello,

NOTE: I am a C# programmer - if you give a solution with UNIX or PERL formatting I won't know what to do with it. Please observe the forum I am asking this question in (C#). Thanks!

 I am writing a parsing module using regex's and ran across a snag: when my input contains new lines characters my regex stops working. I think best explaination is with code:



        [Test]
        public void ApplyStartAndStopAnchors()
        {
            string urlContent = "abcdefghijklmnopqrstuvwxyz";
            string regexString = "mno";
            string regexStartString = "def";
            string regexStopString = "stu";

            RegexGenerator generator =
                new RegexGenerator(new List<string>(), regexString, regexStartString, regexStopString, urlContent);

            generator.ApplyStartAndStopAnchors();
            Assert.AreEqual("ghijklmnopqr", generator.UrlContent); // PASSES
        }

And this example wouldn't be complete without generator.ApplyStartAndStopAnchors() (this being where I need your help)

        public void ApplyStartAndStopAnchors()
        {
            string subRegex = @"(?<=" + _regexStartString + ")" + "(.+)" + "(?=" + _regexStopString + ")";

            Regex r = new Regex(subRegex);

            Match m = r.Match(UrlContent);

            if (m.Success)
                UrlContent = m.Groups[1].Value;
        }


Now here's the catch, if I use multiline input:

Input text file: (MultiLineRegexTest.txt)
<start>
abcdefghij
klmnop
qrstuvw
xyz
<stop>

Which is just the English alphabet spread across 4 lines.

Now, when I read this input text file it contains new line characters. Here is a test showing the result:

        [Test]
        public void ApplyStartAndStopAnchorsOnTestFileWithMultipleLines()
        {
            FileStream fs = FileSupport.OpenFile(@"../../../Testfiles/MultiLineRegexTest.txt");
            string content = FileSupport.GetTextFileAsString(fs);

            string regexString = "mno";
            string regexStartString = "def";
            string regexStopString = "stu";

            RegexGenerator generator =
                new RegexGenerator(new List<string>(), regexString, regexStartString, regexStopString, content);

            generator.ApplyStartAndStopAnchors();
            Assert.AreEqual("ghijklmnopqr", generator.UrlContent); // FAILS
        }

And the output of the failing Assert.AreEqual statement is:

<start>
RegexGeneratorTestFixture.ApplyStartAndStopAnchorsOnTestFileWithMultipleLines : FailedNUnit.Framework.AssertionException:
String lengths differ.  Expected length=12, but was length=32.
Strings differ at index 0.

expected:<"ghijklmnopqr">
 but was:<"abcdefghij\r\nklmnop\r\nqrstuvw\r\nxyz">
-----------^
<stop>



Notice \r\n in the "but was" output? Those are the new line characters. They make it so my Regex in ApplyStartAndStopAnchors() doesn't work anymore.


Does anyone know how to fix this?

Much thanks,
sapbucket
LVL 3
sapbucketAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

b0lsc0ttIT ManagerCommented:
sapbucket,

I am not a C# expert but I believe you just need to add RegexOptions.Singleline when you use the regex object.  In the code above I believe you just want it at ...

            Match m = r.Match(UrlContent, RegexOptions.Singleline);

Let me know if you have any questions or need more information.

b0lsc0tt
0
b0lsc0ttIT ManagerCommented:
sapbucket,

Acutally the better place for the change may be ...

            Regex r = new Regex(subRegex, RegexOptions.Singleline);

Let me know if you have a question.

b0lsc0tt
0
b0lsc0ttIT ManagerCommented:
sapbucket,

By the way ... to make up for my uncertainty on how to apply it the issue is basically using the dot character.  Unless you specify it like I suggest above that character does not match newlines.  It will match most other characters but not a newline without the "switch."  Another fix would be to change the expression so it didn't use the dot but that may not work with your data.  This limitation is the reason I use the dot character with caution though; it should be the first thing you turn to when you want to match "any" character. :)

Let me know if you have a question.

b0lsc0tt
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

sapbucketAuthor Commented:
Thanks for quick reply!

I changed my code to this:

        public void ApplyStartAndStopAnchors()
        {
            string subRegex = @"(?<=" + _regexStartString + ")" + "(.+)" + "(?=" + _regexStopString + ")";

            Regex r = new Regex(subRegex, RegexOptions.Singleline);

            Match m = r.Match(UrlContent);

            if (m.Success)
                UrlContent = m.Groups[1].Value;
        }


And after rebuilding and running unit tests I see the following still fails:

NUnit.Framework.AssertionException:
String lengths differ.  Expected length=12, but was length=16.
Strings differ at index 4.

expected:<"ghijklmnopqr">
 but was:<"ghij\r\nklmnop\r\nqr">
---------------^



Which is the same problem...

It seems like SingleLine should have done the trick - but it didn't
0
sapbucketAuthor Commented:
Yes the '.' character is probably a bad choice. I'm not sure of a better option.

I am parsing HTML - there are many characters and I'm not sure how to write a character class that includes all of them , hence I use the '.' operator.

My primary goal is to grab a subsection of HTML. All I know about the subsection is the start and stop strings that define it. What the subsection contains I cannot be certain of, other than it is a string characters from ANSI character set (which is what HTML is composed of).


What about writing code to strip \r\n from the input string? (I guess that's what I though RegexOptions.SingleLine was supposed to do)
0
b0lsc0ttIT ManagerCommented:
Let me look over this more but try using [\s\S] instead of the dot character.  It also has the benefit of matching newlines.

Let me know how it works.

bol
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
b0lsc0ttIT ManagerCommented:
The singleline option won't strip the newline characters.  If that is what you want then you should just use C# script to do it.  It should be easier than doing it with an expression but I don't know C# so can't tell you the exact script.

I am actually a little confused by the error you mentioned in your second to the last comment.  What you said about the string length difference along with the expected and result strings was a little confusing.  It may just be my unfamiliarity with C#.

Let me know how this and my last comment helped.

bol
0
sapbucketAuthor Commented:
bol,

>> I am actually a little confused by the error you mentioned in your second to the last comment
It's actually output from nunit, a unit testing framework that I use to build high quality code. The intent of showing the output of nunit was to show the "\r\n" characters in the string I want to apply a Regex to. For example, when I "view" the html source that I get input from (with my human eyes) I definitely do not see \r\n anywhere in the html (they are nonvisible characters). However, when I import the html  using a FileStream, all-of-a-sudden newline characters appear in the string (these are "\r\n" characters). This was unexpected to say the least and I wasn't certain how to proceed.

Switching topics - I found that this regex worked awesome:

string subRegex = @"(?<=" + _regexStartString + ")" + "[\s\S]+" + "(?=" + _regexStopString + ")";

and satisfies my question. I will award the points soon since it was your recommendation to use [\s\S]. This subtle replacement sped up the regex considerably.

Maybe you have a further improvement? Hehe
0
b0lsc0ttIT ManagerCommented:
Thanks for the response.  That did clarify it some.  I knew those characters are not usually visible; I use a hex editor when I have to see them.

I am glad that last suggestion did it and especially to hear it sped it up.  I really try to avoid the dot character as much as possible.  I don't have any other suggestions right now.  Maybe you will be able to provide another interesting and fun question to get the "brain juices" flowing again. :)

bol
0
sapbucketAuthor Commented:
Thanks bol! Worked great.
0
b0lsc0ttIT ManagerCommented:
Your welcome!  Thanks for the points and grade.

bol
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C#

From novice to tech pro — start learning today.