Solved

Regex Match Sentences?

Posted on 2004-09-16
12
1,497 Views
Last Modified: 2010-08-05
  I'm attempting to parse a text file into an SQL table using Regex Match. I have successfully matched just the port. However, I've never Matched a complete sentence before. The sentences all start with a preceding 'white space'...

Q. How can I Regex Match just the individual sentences?  


2 Death
20 Senna Spy FTP server
21 Back Construction, Blade Runner, Cattivik FTP Server, CC Invader, Dark FTP, Doly Trojan, Fore, Invisible FTP, Juggernaut 42, Larva, MotIv FTP, Net Administrator
22 Shaft
23 Fire HacKer, Tiny Telnet Server - TTS, Truva Atl
25 Ajan, Antigen, Barok, Email Password Sender - EPS, EPS II, Gip, Gris, Happy99, Hpteam mail, Hybris, I love you, Kuang2, Magic Horse, MBT (Mail Bombing Trojan)
31 Agent 31, Hackers Paradise, Masters Paradise
41 Deep Throat, Foreplay
48 DRAT
50 DRAT
58 DMSetup
59 DMSetup
79 CDK, Firehotcker
0
Comment
Question by:kvnsdr
  • 5
  • 4
  • 3
12 Comments
 
LVL 4

Expert Comment

by:vigrid
Comment Utility
What language do you intend to use the regex in?
0
 
LVL 1

Author Comment

by:kvnsdr
Comment Utility
I program in C# .NET... I'm looking for a regular expression, not code.
0
 
LVL 4

Expert Comment

by:vigrid
Comment Utility
Code snippet in C#:

Regex rx = new Regex("(?<port>[0-9]+) (?<sentence>.+)");
StreamReader sr = new StreamReader("Input.txt");
string file = sr.ReadToEnd();
MatchCollection mc = rx.Matches(file);
foreach(Match m in mc)
      Console.WriteLine("Port: {0}\tSentence: {1}", m.Groups["port"], m.Groups["sentence"]);

Regular expression: "(?<port>[0-9]+) (?<sentence>.+)".

Comments:

the (?<something>[blah]) creates a new regex group. You can name the groups and then access them in a way you like. So you create 2 groups: "port" and "sentence", and you access them via Groups property in Match object.

HTH
0
 
LVL 1

Author Comment

by:kvnsdr
Comment Utility
The Regex for 'sentence' returns the port and the sentenc. I need to exclude the port.

Regex:  (?<sentence>.+)".
Return: 25 AntiGen, Email Password Attacks

Need:   AntiGen, Email Password Attacks
0
 
LVL 4

Expert Comment

by:vigrid
Comment Utility
Yes, you're absolutely right. You're just not using it right :). Just use the "sentence" group within a match, but use the whole regular expression for that. The first group ("port") is looking for digits and a space character, and the second group is looking for everything that is left until the end of line. If you deleted the first group, only the second group is still working, so it matches the whole line of text. The comma character stands for "any character" in regex.

Regex rx = new Regex("(?<port>[0-9]+) (?<sentence>.+)");
StreamReader sr = new StreamReader("Input.txt");
string file = sr.ReadToEnd();
MatchCollection mc = rx.Matches(file);
foreach(Match m in mc)
     AddToCollection(m.Groups["sentence"]);

Now does it make any more sense?

HTH :)
0
 
LVL 4

Accepted Solution

by:
vigrid earned 125 total points
Comment Utility
Oh, you can use the regex:

Regex rx = new Regex("(?<port>[0-9]+)\\s+(?<sentence>.*)");

Two things are added here - the space character is exchanged to "\s|", which tells regex to match any whitespace character(s). And the comma is postfixed woth an asterisk rather than a plus. Plus sign stands for "1 or more occurences of the expression to the left", and asterisk stands for "0 or more occurences of the expression to the left".

Please note the double backslash character. It's needed by the C# compiler to tell that after the "\" is something else than an escape character.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 1

Author Comment

by:kvnsdr
Comment Utility
My Regex manual briefly references a "Positive Lookahead-Assertion" with the following example, Meaning; the pattern preceding the parentheses is searched and if the pattern within the parentheses is found, it is not part of the result return. That's what I'm attempting to do.  

"Positive Lookahead-Assertion" example:
...(?=...)

Current Regex that I think should work:
(?<sentence>.+)(?=\d{1,6})

Still something is wrong with this Regex
0
 
LVL 7

Expert Comment

by:aib_42
Comment Utility
This would be a Perl-style regex. Convert and escape it accordingly:

/^(\d{2})\s(.*)$/

Optionally, use
/^(\d{2})\s(.*)\s?$/
to get rid of extra whitespace at the end of "sentence".
0
 
LVL 7

Expert Comment

by:aib_42
Comment Utility
sorry, change the second regex with:
/^(\d{2})\s(.*)\s*?$/
0
 
LVL 7

Expert Comment

by:aib_42
Comment Utility
and (\d{2}) assumes two digits. use {min,} {,max} or {min,max} if you have any minimum or maximum number of digits. For min=1 and max=infinity, use (\d+)
0
 
LVL 1

Author Comment

by:kvnsdr
Comment Utility
And the correct answer is:

//  (?<sentence>\D*)           = Sentences without numbers    -- Using * or + return same result --
//  ((?<sentence>.+)           = Sentences with ANY characters

I will award the 125 points to vigrid because of a good partial answer leading to the correct answer.
0
 
LVL 4

Expert Comment

by:vigrid
Comment Utility
Thank you! :)
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now