Solved

Regex Match Sentences?

Posted on 2004-09-16
12
1,536 Views
Last Modified: 2010-08-05
  I'm attempting to parse a text file into an SQL table using Regex Match. I have successfully matched just the port. However, I've never Matched a complete sentence before. The sentences all start with a preceding 'white space'...

Q. How can I Regex Match just the individual sentences?  


2 Death
20 Senna Spy FTP server
21 Back Construction, Blade Runner, Cattivik FTP Server, CC Invader, Dark FTP, Doly Trojan, Fore, Invisible FTP, Juggernaut 42, Larva, MotIv FTP, Net Administrator
22 Shaft
23 Fire HacKer, Tiny Telnet Server - TTS, Truva Atl
25 Ajan, Antigen, Barok, Email Password Sender - EPS, EPS II, Gip, Gris, Happy99, Hpteam mail, Hybris, I love you, Kuang2, Magic Horse, MBT (Mail Bombing Trojan)
31 Agent 31, Hackers Paradise, Masters Paradise
41 Deep Throat, Foreplay
48 DRAT
50 DRAT
58 DMSetup
59 DMSetup
79 CDK, Firehotcker
0
Comment
Question by:kvnsdr
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
  • 3
12 Comments
 
LVL 4

Expert Comment

by:vigrid
ID: 12076548
What language do you intend to use the regex in?
0
 
LVL 1

Author Comment

by:kvnsdr
ID: 12076813
I program in C# .NET... I'm looking for a regular expression, not code.
0
 
LVL 4

Expert Comment

by:vigrid
ID: 12076910
Code snippet in C#:

Regex rx = new Regex("(?<port>[0-9]+) (?<sentence>.+)");
StreamReader sr = new StreamReader("Input.txt");
string file = sr.ReadToEnd();
MatchCollection mc = rx.Matches(file);
foreach(Match m in mc)
      Console.WriteLine("Port: {0}\tSentence: {1}", m.Groups["port"], m.Groups["sentence"]);

Regular expression: "(?<port>[0-9]+) (?<sentence>.+)".

Comments:

the (?<something>[blah]) creates a new regex group. You can name the groups and then access them in a way you like. So you create 2 groups: "port" and "sentence", and you access them via Groups property in Match object.

HTH
0
The Ultimate Checklist to Optimize Your Website

Websites are getting bigger and complicated by the day. Video, images, custom fonts are all great for showcasing your product/service. But the price to pay in terms of reduced page load times and ultimately, decreased sales, can lead to some difficult decisions about what to cut.

 
LVL 1

Author Comment

by:kvnsdr
ID: 12077316
The Regex for 'sentence' returns the port and the sentenc. I need to exclude the port.

Regex:  (?<sentence>.+)".
Return: 25 AntiGen, Email Password Attacks

Need:   AntiGen, Email Password Attacks
0
 
LVL 4

Expert Comment

by:vigrid
ID: 12077420
Yes, you're absolutely right. You're just not using it right :). Just use the "sentence" group within a match, but use the whole regular expression for that. The first group ("port") is looking for digits and a space character, and the second group is looking for everything that is left until the end of line. If you deleted the first group, only the second group is still working, so it matches the whole line of text. The comma character stands for "any character" in regex.

Regex rx = new Regex("(?<port>[0-9]+) (?<sentence>.+)");
StreamReader sr = new StreamReader("Input.txt");
string file = sr.ReadToEnd();
MatchCollection mc = rx.Matches(file);
foreach(Match m in mc)
     AddToCollection(m.Groups["sentence"]);

Now does it make any more sense?

HTH :)
0
 
LVL 4

Accepted Solution

by:
vigrid earned 125 total points
ID: 12077481
Oh, you can use the regex:

Regex rx = new Regex("(?<port>[0-9]+)\\s+(?<sentence>.*)");

Two things are added here - the space character is exchanged to "\s|", which tells regex to match any whitespace character(s). And the comma is postfixed woth an asterisk rather than a plus. Plus sign stands for "1 or more occurences of the expression to the left", and asterisk stands for "0 or more occurences of the expression to the left".

Please note the double backslash character. It's needed by the C# compiler to tell that after the "\" is something else than an escape character.
0
 
LVL 1

Author Comment

by:kvnsdr
ID: 12077837
My Regex manual briefly references a "Positive Lookahead-Assertion" with the following example, Meaning; the pattern preceding the parentheses is searched and if the pattern within the parentheses is found, it is not part of the result return. That's what I'm attempting to do.  

"Positive Lookahead-Assertion" example:
...(?=...)

Current Regex that I think should work:
(?<sentence>.+)(?=\d{1,6})

Still something is wrong with this Regex
0
 
LVL 7

Expert Comment

by:aib_42
ID: 12077858
This would be a Perl-style regex. Convert and escape it accordingly:

/^(\d{2})\s(.*)$/

Optionally, use
/^(\d{2})\s(.*)\s?$/
to get rid of extra whitespace at the end of "sentence".
0
 
LVL 7

Expert Comment

by:aib_42
ID: 12077893
sorry, change the second regex with:
/^(\d{2})\s(.*)\s*?$/
0
 
LVL 7

Expert Comment

by:aib_42
ID: 12077945
and (\d{2}) assumes two digits. use {min,} {,max} or {min,max} if you have any minimum or maximum number of digits. For min=1 and max=infinity, use (\d+)
0
 
LVL 1

Author Comment

by:kvnsdr
ID: 12078389
And the correct answer is:

//  (?<sentence>\D*)           = Sentences without numbers    -- Using * or + return same result --
//  ((?<sentence>.+)           = Sentences with ANY characters

I will award the 125 points to vigrid because of a good partial answer leading to the correct answer.
0
 
LVL 4

Expert Comment

by:vigrid
ID: 12078698
Thank you! :)
0

Featured Post

Transaction Monitoring Vs. Real User Monitoring

Synthetic Transaction Monitoring Vs. Real User Monitoring: When To Use Each Approach? In this article, we will discuss two major monitoring approaches: Synthetic Transaction and Real User Monitoring.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this post we will learn different types of Android Layout and some basics of an Android App.
Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question