Regex Match Sentences?

  I'm attempting to parse a text file into an SQL table using Regex Match. I have successfully matched just the port. However, I've never Matched a complete sentence before. The sentences all start with a preceding 'white space'...

Q. How can I Regex Match just the individual sentences?  


2 Death
20 Senna Spy FTP server
21 Back Construction, Blade Runner, Cattivik FTP Server, CC Invader, Dark FTP, Doly Trojan, Fore, Invisible FTP, Juggernaut 42, Larva, MotIv FTP, Net Administrator
22 Shaft
23 Fire HacKer, Tiny Telnet Server - TTS, Truva Atl
25 Ajan, Antigen, Barok, Email Password Sender - EPS, EPS II, Gip, Gris, Happy99, Hpteam mail, Hybris, I love you, Kuang2, Magic Horse, MBT (Mail Bombing Trojan)
31 Agent 31, Hackers Paradise, Masters Paradise
41 Deep Throat, Foreplay
48 DRAT
50 DRAT
58 DMSetup
59 DMSetup
79 CDK, Firehotcker
LVL 1
kvnsdrAsked:
Who is Participating?
 
vigridConnect With a Mentor Commented:
Oh, you can use the regex:

Regex rx = new Regex("(?<port>[0-9]+)\\s+(?<sentence>.*)");

Two things are added here - the space character is exchanged to "\s|", which tells regex to match any whitespace character(s). And the comma is postfixed woth an asterisk rather than a plus. Plus sign stands for "1 or more occurences of the expression to the left", and asterisk stands for "0 or more occurences of the expression to the left".

Please note the double backslash character. It's needed by the C# compiler to tell that after the "\" is something else than an escape character.
0
 
vigridCommented:
What language do you intend to use the regex in?
0
 
kvnsdrAuthor Commented:
I program in C# .NET... I'm looking for a regular expression, not code.
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

 
vigridCommented:
Code snippet in C#:

Regex rx = new Regex("(?<port>[0-9]+) (?<sentence>.+)");
StreamReader sr = new StreamReader("Input.txt");
string file = sr.ReadToEnd();
MatchCollection mc = rx.Matches(file);
foreach(Match m in mc)
      Console.WriteLine("Port: {0}\tSentence: {1}", m.Groups["port"], m.Groups["sentence"]);

Regular expression: "(?<port>[0-9]+) (?<sentence>.+)".

Comments:

the (?<something>[blah]) creates a new regex group. You can name the groups and then access them in a way you like. So you create 2 groups: "port" and "sentence", and you access them via Groups property in Match object.

HTH
0
 
kvnsdrAuthor Commented:
The Regex for 'sentence' returns the port and the sentenc. I need to exclude the port.

Regex:  (?<sentence>.+)".
Return: 25 AntiGen, Email Password Attacks

Need:   AntiGen, Email Password Attacks
0
 
vigridCommented:
Yes, you're absolutely right. You're just not using it right :). Just use the "sentence" group within a match, but use the whole regular expression for that. The first group ("port") is looking for digits and a space character, and the second group is looking for everything that is left until the end of line. If you deleted the first group, only the second group is still working, so it matches the whole line of text. The comma character stands for "any character" in regex.

Regex rx = new Regex("(?<port>[0-9]+) (?<sentence>.+)");
StreamReader sr = new StreamReader("Input.txt");
string file = sr.ReadToEnd();
MatchCollection mc = rx.Matches(file);
foreach(Match m in mc)
     AddToCollection(m.Groups["sentence"]);

Now does it make any more sense?

HTH :)
0
 
kvnsdrAuthor Commented:
My Regex manual briefly references a "Positive Lookahead-Assertion" with the following example, Meaning; the pattern preceding the parentheses is searched and if the pattern within the parentheses is found, it is not part of the result return. That's what I'm attempting to do.  

"Positive Lookahead-Assertion" example:
...(?=...)

Current Regex that I think should work:
(?<sentence>.+)(?=\d{1,6})

Still something is wrong with this Regex
0
 
aib_42Commented:
This would be a Perl-style regex. Convert and escape it accordingly:

/^(\d{2})\s(.*)$/

Optionally, use
/^(\d{2})\s(.*)\s?$/
to get rid of extra whitespace at the end of "sentence".
0
 
aib_42Commented:
sorry, change the second regex with:
/^(\d{2})\s(.*)\s*?$/
0
 
aib_42Commented:
and (\d{2}) assumes two digits. use {min,} {,max} or {min,max} if you have any minimum or maximum number of digits. For min=1 and max=infinity, use (\d+)
0
 
kvnsdrAuthor Commented:
And the correct answer is:

//  (?<sentence>\D*)           = Sentences without numbers    -- Using * or + return same result --
//  ((?<sentence>.+)           = Sentences with ANY characters

I will award the 125 points to vigrid because of a good partial answer leading to the correct answer.
0
 
vigridCommented:
Thank you! :)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.