?
Solved

How can I extract this using Regex in Perl?

Posted on 2012-03-27
11
Medium Priority
?
273 Views
Last Modified: 2012-03-28
I have strings like the following:
1.2x...
super...

For both string, I want to extract the part before the ... so, for the first one I want the string splitted into 1.2x and ...

For the second I want to split the string into super and ...

How should I do this?  Thanks.
0
Comment
Question by:thomaszhwang
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
11 Comments
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 2000 total points
ID: 37774945
Something like this?

$text = "super...";
$text =~ /(.*?)(\.{3})/s;
print "I want $1 and $2";

Output (after I fixed a bug from my initial post):
I want super and ...
0
 

Author Comment

by:thomaszhwang
ID: 37774958
Does this work for 1.2x...... as well?

I can't test right now, but will do soon.  Thanks.
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 2000 total points
ID: 37774998
Yes.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:thomaszhwang
ID: 37777266
$text =~ /(.*?)(\.{3})/s;

Open in new window


What does the s at the end do?

I tried to do a match on sto... and the result are sto.. and .

This is my code.

($p1, $p2) = lc($x) =~ /^(.+)([.,:;]+)$/;

Open in new window


Any idea?  Thanks.
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 2000 total points
ID: 37778923
You've changed the pattern so that only one . character is captured in the 2nd group. Try:

($p1, $p2) = lc($x) =~ /^(.+)([.,:;]{3})$/;

Open in new window


Because you've changed the \. to [.,:;], it will not just match ... but it will also match the following:
.,:
;;;
:,:
etc

The s is a modifier which means that the . wildcard will match newline characters. This means that if your text was this:

Some text over
multiple lines... and some more

Open in new window


The results would be:
Some text over
multiple lines

Open in new window

and
...

Open in new window


You've also removed the ? from my pattern, which makes the .+ (or .*) non-greedy. Without it, from the text:

Some text over
multiple lines... and some more::: and yet more

Open in new window


The results would be:
Some text over
multiple lines... and some more

Open in new window

and
:::

Open in new window

0
 

Author Comment

by:thomaszhwang
ID: 37778996
Yes, that's actually what I want.  I don't know the exact number of the following dots and it would be nice to match things such as : and ;

So in general, I want to include as many marks as possible that are at the end of the string.

abc.:;;;;;.......                      ->        abc and .:;;;;;.......
1.4323.........................     ->        1.4323 and .........................
abc.                                   ->        abc and .
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 2000 total points
ID: 37779161
I think you need that ? to make the .+ non-greedy.

To pick up the extra punctuation characters you'll want one more slight adjustment:
($p1, $p2) = lc($x) =~ /^(.+?)([.,:;]{3,})$/;

However, your 3rd case creates a new problem - if you were to capture text prior to just 1 punctuation character, as in:
abc.                                   ->        abc and .

Then you would get this result:
1.4323.........................     ->        1 and .

If you can think of some rule that would consistently differentiate between those 2 cases (such as always ignoring a single . if there's a number straight after it), then we could potentially resolve that. eg

($p1, $p2) = lc($x) =~ /^(.+?)((?!\.\d)[.,:;]+)/;

Note you've also got a $ at the end of the pattern, which forces the punctuation characters to be at the end of the line. Is that really what you want?
0
 

Author Comment

by:thomaszhwang
ID: 37779192
Do you think this gonna work since I do have a word boundary at the end, so as long as I make it non-greedy, it should work fine, right?

($p1, $p2) = lc($x) =~ /^(.+?)([.,:;]+)$/;

Open in new window

0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 2000 total points
ID: 37779225
Yes, I think that will work. To be pedantic, a word boundary is a \b whereas the $ matches the end of the string (or line, if you use the m modifier).
0
 

Author Comment

by:thomaszhwang
ID: 37779235
Oh ok, thanks.  The end of the string is what I want.
0
 

Author Closing Comment

by:thomaszhwang
ID: 37779240
Thanks.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

800 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question