• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 306
  • Last Modified:

Need help with regular expression to match and capture single term expressions inside ()s.

I am using the pattern:
on the below text to match only single terms inside a pair of () and capture it in group 1  unless a superscript number exist on the outside of the closing ) or I have multiple ()s next to each other (ex. (s)(2x)).

Everything seems to work the way I want except two items:
 The terms that are numbers only (4.4, 12 & -33) match but nothing exists in group 1-3 and group 4 & 5 don't match.
 The last expression (s)(2x) matches both terms inside the ()s and I don't want it to match if I have multiple ()s next to each other.

What am I missing?  Is there a way to make this less complicated?
1 Solution
Are you saying that in the case of
you want something to exist in  group 1-3?
do you want group 4 & 5 to match?
I don't get anything in 1-3 for any of them, and $4 matches only <sup>-2</sup> and <sup>-55</sup>
SurranoSystem EngineerCommented:
It would help a lot if you told us which language's regex are we talking about. Ordinary egrep doesn't support "?!"
The * at the end of group 2 means match 0 or more times, as many times as possible.
the ? at the end of group 3 means match optionally,
the * at the end of group 4 means match 0 or more times
this means that the empty string matches groups 3 and 4,
So when  group 2 matches as many times as possible, the last possible time will be matching the empty string.
Only this last match will be stored.

Did you mean to say ? instead of * for group 2?
Or did you need the ? on group 3 given the * on group 2?
Train for your Pen Testing Engineer Certification

Enroll today in this bundle of courses to gain experience in the logistics of pen testing, Linux fundamentals, vulnerability assessments, detecting live systems, and more! This series, valued at $3,000, is free for Premium members, Team Accounts, and Qualified Experts.

NevSoFlyAuthor Commented:
Thanks for the responses.

@Surrano:  I am using VB.net (VS2012).

I am saying that in the cases of 4.4, 12, & -33 I want group 1 to match 4.4, 12, & -33.  I really don't care what any other group matches.  The reason I mentioned the other groups was that I was only trying to provide all the info that I had on my situation.  I'm sorry for the confusion.  

As for the breakdown, of the pattern. I am attempting to breakdown parts of a term.  

([-]?[0-9]*\.?[0-9]*)?     is for coefficients/constants that may be +/-, have decimal points or not be present at.

([a-z]?(<sup>[-]?[0-9]*\.?[0-9]+</sup>)?)*     is for variables that may or may not have exponents or not be present at all.  I believe that I need the ? on group 3 because an exponent could only exists a maximum of 1 times if a variable existed at all.

I am most-likely over complicating this.  

The only reason I added the code to match the constants/coefficients, variables and exponents was that I was trying to differentiate between single and multiple term expressions within the ()s.

I know that the operations inside of the ()s will only be addition, so for group 1 couldn't I just grab everything inside the ()s as long as a + wasn't present?  Then I would only need to ensure that an exponent wasn't out side the closing ).  I was thinking something like \(([^+]+?)\) it seems to work by itself for identifying ()s with only single terms but I can't get it to work with the negative look ahead for exponents.
Please give some examples telling whether or not you want to match, and if it matches, what you would want to capture.
NevSoFlyAuthor Commented:
I hope this helps.

string                                                                                       capture
(2)                                                                                              2
(2.555)                                                                                       2.555
(2a)                                                                                            2a
(2.555a)                                                                                     2.555a
(2.555a<sup>2</sup>)                                                            2a<sup>2</sup>
(2.555a<sup>2.555</sup>)                                                    2a<sup>2.555</sup>
(a)                                                                                             a
(a<sup>2</sup>)                                                                     a<sup>2</sup>
(ab)                                                                                           ab
(a<sup>2</sup>b)                                                                  a<sup>2</sup>b
(a<sup>2</sup>b<sup>2</sup>)                                               a<sup>2</sup>b<sup>2</sup>
(2)<sup>2</sup>                                                                    nothing
(2.555)<sup>2</sup>                                                             nothing
(2a)<sup>2</sup>                                                                  nothing
(2.555a)<sup>2</sup>                                                           nothing
(2.555a<sup>2</sup>)<sup>2</sup>                                  nothing
(2.555a<sup>2.555</sup>)<sup>2</sup>                           nothing
(a)<sup>2</sup>                                                                    nothing
(a<sup>2</sup>)<sup>2</sup>                                           nothing
(ab)<sup>2</sup>                                                                 nothing
(a<sup>2</sup>b)<sup>2</sup>                                         nothing
(a<sup>2</sup>b<sup>2</sup>)<sup>2</sup>                 nothing
(any expression)(any expression)                                         nothing
Derek JensenCommented:
I am having no luck getting every single row to match using your test data with only one regex; I think you'll have to simply use multiple regexes, and either check each one on each line in a loop, or use an array of regexes if you're in PHP.
(2.555a<sup>2</sup>)                                                            2a<sup>2</sup>
How does 2a<sup>2</sup> come from (2.555a<sup>2</sup>)   ?
Are we to ignore \.\d+ in the case when it is followed by a<sup>?
What if it is followed by <sup> with no a?
Do we only take the first and last character of whatever precedes <sup>?
Assuming  2a<sup> was supposed to be 2.555a<sup>, this works:
  print $1 if /^\(([^)]+)\)(?![<(])/;
Otherwise, I'll need more examples to determine exactly what is to be captured.
NevSoFlyAuthor Commented:
(2.555a<sup>2</sup>)                                                            2a<sup>2</sup>
should have been
(2.555a<sup>2</sup>)                                                            2.555a<sup>2</sup>

Are we to ignore \.\d+ in the case when it is followed by a<sup>?

I'm guessing \.\d is from your code, so if your asking if your to ignore a decimal point and the following numbers if an exponent follows it. (ex. (2.555a<sup>2</sup>) ) the answer is no.

What if it is followed by <sup> with no a? no.

Do we only take the first and last character of whatever precedes <sup>?  no, if <sup> is within the ()s take everything.  If <sup> is outside the ()s take nothing.
If (2.555a<sup>2</sup>) should have been 2.555a<sup>2</sup>
then /^\(([^)]+)\)(?![<(])/ seems to do everything you want on the examples in http:#a39789108
NevSoFlyAuthor Commented:
It answer all the examples that I gave but could you please break it down and explain it to me because all I understand is the negative look ahead part.
perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/^\(([^)]+)\)(?![<(])/)->explain'
The regular expression:


matches as follows:
NODE                     EXPLANATION
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
  ^                        the beginning of the string
  \(                       '('
  (                        group and capture to \1:
    [^)]+                    any character except: ')' (1 or more
                             times (matching the most amount
  )                        end of \1
  \)                       ')'
  (?!                      look ahead to see if there is not:
    [<(]                     any character of: '<', '('
  )                        end of look-ahead
)                        end of grouping

So, everything in a set of parentheses at the start if the string, unless that set of parentheses is followed by < or (
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now