[Webinar] Streamline your web hosting managementRegister Today

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 6094
  • Last Modified:

How to make C# Regex work with "new line" characters in textfile?

Hello,

NOTE: I am a C# programmer - if you give a solution with UNIX or PERL formatting I won't know what to do with it. Please observe the forum I am asking this question in (C#). Thanks!

 I am writing a parsing module using regex's and ran across a snag: when my input contains new lines characters my regex stops working. I think best explaination is with code:



        [Test]
        public void ApplyStartAndStopAnchors()
        {
            string urlContent = "abcdefghijklmnopqrstuvwxyz";
            string regexString = "mno";
            string regexStartString = "def";
            string regexStopString = "stu";

            RegexGenerator generator =
                new RegexGenerator(new List<string>(), regexString, regexStartString, regexStopString, urlContent);

            generator.ApplyStartAndStopAnchors();
            Assert.AreEqual("ghijklmnopqr", generator.UrlContent); // PASSES
        }

And this example wouldn't be complete without generator.ApplyStartAndStopAnchors() (this being where I need your help)

        public void ApplyStartAndStopAnchors()
        {
            string subRegex = @"(?<=" + _regexStartString + ")" + "(.+)" + "(?=" + _regexStopString + ")";

            Regex r = new Regex(subRegex);

            Match m = r.Match(UrlContent);

            if (m.Success)
                UrlContent = m.Groups[1].Value;
        }


Now here's the catch, if I use multiline input:

Input text file: (MultiLineRegexTest.txt)
<start>
abcdefghij
klmnop
qrstuvw
xyz
<stop>

Which is just the English alphabet spread across 4 lines.

Now, when I read this input text file it contains new line characters. Here is a test showing the result:

        [Test]
        public void ApplyStartAndStopAnchorsOnTestFileWithMultipleLines()
        {
            FileStream fs = FileSupport.OpenFile(@"../../../Testfiles/MultiLineRegexTest.txt");
            string content = FileSupport.GetTextFileAsString(fs);

            string regexString = "mno";
            string regexStartString = "def";
            string regexStopString = "stu";

            RegexGenerator generator =
                new RegexGenerator(new List<string>(), regexString, regexStartString, regexStopString, content);

            generator.ApplyStartAndStopAnchors();
            Assert.AreEqual("ghijklmnopqr", generator.UrlContent); // FAILS
        }

And the output of the failing Assert.AreEqual statement is:

<start>
RegexGeneratorTestFixture.ApplyStartAndStopAnchorsOnTestFileWithMultipleLines : FailedNUnit.Framework.AssertionException:
String lengths differ.  Expected length=12, but was length=32.
Strings differ at index 0.

expected:<"ghijklmnopqr">
 but was:<"abcdefghij\r\nklmnop\r\nqrstuvw\r\nxyz">
-----------^
<stop>



Notice \r\n in the "but was" output? Those are the new line characters. They make it so my Regex in ApplyStartAndStopAnchors() doesn't work anymore.


Does anyone know how to fix this?

Much thanks,
sapbucket
0
sapbucket
Asked:
sapbucket
  • 7
  • 4
1 Solution
 
b0lsc0ttIT ManagerCommented:
sapbucket,

I am not a C# expert but I believe you just need to add RegexOptions.Singleline when you use the regex object.  In the code above I believe you just want it at ...

            Match m = r.Match(UrlContent, RegexOptions.Singleline);

Let me know if you have any questions or need more information.

b0lsc0tt
0
 
b0lsc0ttIT ManagerCommented:
sapbucket,

Acutally the better place for the change may be ...

            Regex r = new Regex(subRegex, RegexOptions.Singleline);

Let me know if you have a question.

b0lsc0tt
0
 
b0lsc0ttIT ManagerCommented:
sapbucket,

By the way ... to make up for my uncertainty on how to apply it the issue is basically using the dot character.  Unless you specify it like I suggest above that character does not match newlines.  It will match most other characters but not a newline without the "switch."  Another fix would be to change the expression so it didn't use the dot but that may not work with your data.  This limitation is the reason I use the dot character with caution though; it should be the first thing you turn to when you want to match "any" character. :)

Let me know if you have a question.

b0lsc0tt
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
sapbucketAuthor Commented:
Thanks for quick reply!

I changed my code to this:

        public void ApplyStartAndStopAnchors()
        {
            string subRegex = @"(?<=" + _regexStartString + ")" + "(.+)" + "(?=" + _regexStopString + ")";

            Regex r = new Regex(subRegex, RegexOptions.Singleline);

            Match m = r.Match(UrlContent);

            if (m.Success)
                UrlContent = m.Groups[1].Value;
        }


And after rebuilding and running unit tests I see the following still fails:

NUnit.Framework.AssertionException:
String lengths differ.  Expected length=12, but was length=16.
Strings differ at index 4.

expected:<"ghijklmnopqr">
 but was:<"ghij\r\nklmnop\r\nqr">
---------------^



Which is the same problem...

It seems like SingleLine should have done the trick - but it didn't
0
 
sapbucketAuthor Commented:
Yes the '.' character is probably a bad choice. I'm not sure of a better option.

I am parsing HTML - there are many characters and I'm not sure how to write a character class that includes all of them , hence I use the '.' operator.

My primary goal is to grab a subsection of HTML. All I know about the subsection is the start and stop strings that define it. What the subsection contains I cannot be certain of, other than it is a string characters from ANSI character set (which is what HTML is composed of).


What about writing code to strip \r\n from the input string? (I guess that's what I though RegexOptions.SingleLine was supposed to do)
0
 
b0lsc0ttIT ManagerCommented:
Let me look over this more but try using [\s\S] instead of the dot character.  It also has the benefit of matching newlines.

Let me know how it works.

bol
0
 
b0lsc0ttIT ManagerCommented:
The singleline option won't strip the newline characters.  If that is what you want then you should just use C# script to do it.  It should be easier than doing it with an expression but I don't know C# so can't tell you the exact script.

I am actually a little confused by the error you mentioned in your second to the last comment.  What you said about the string length difference along with the expected and result strings was a little confusing.  It may just be my unfamiliarity with C#.

Let me know how this and my last comment helped.

bol
0
 
sapbucketAuthor Commented:
bol,

>> I am actually a little confused by the error you mentioned in your second to the last comment
It's actually output from nunit, a unit testing framework that I use to build high quality code. The intent of showing the output of nunit was to show the "\r\n" characters in the string I want to apply a Regex to. For example, when I "view" the html source that I get input from (with my human eyes) I definitely do not see \r\n anywhere in the html (they are nonvisible characters). However, when I import the html  using a FileStream, all-of-a-sudden newline characters appear in the string (these are "\r\n" characters). This was unexpected to say the least and I wasn't certain how to proceed.

Switching topics - I found that this regex worked awesome:

string subRegex = @"(?<=" + _regexStartString + ")" + "[\s\S]+" + "(?=" + _regexStopString + ")";

and satisfies my question. I will award the points soon since it was your recommendation to use [\s\S]. This subtle replacement sped up the regex considerably.

Maybe you have a further improvement? Hehe
0
 
b0lsc0ttIT ManagerCommented:
Thanks for the response.  That did clarify it some.  I knew those characters are not usually visible; I use a hex editor when I have to see them.

I am glad that last suggestion did it and especially to hear it sped it up.  I really try to avoid the dot character as much as possible.  I don't have any other suggestions right now.  Maybe you will be able to provide another interesting and fun question to get the "brain juices" flowing again. :)

bol
0
 
sapbucketAuthor Commented:
Thanks bol! Worked great.
0
 
b0lsc0ttIT ManagerCommented:
Your welcome!  Thanks for the points and grade.

bol
0

Featured Post

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

  • 7
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now