?
Solved

getting string through regex

Posted on 2008-02-12
3
Medium Priority
?
132 Views
Last Modified: 2010-03-05
I'm maintaining someone else's code and for some reason it is unable to match certain portions of the string.
The regex doesn't make that much sense to me
here is a sample of the html
<a href="/lt-H%C3%A9rr%C2%9C-amp-Pa%C3%9Fw%C3%B6rt%C2%B4s-gt-Tue-Feb-12-13-03-02-2008/lm/RSN9WU4RQ9QBM/ref=cm_lm_pdp_title_full"><'HérrÅ  &  PaÃxwört´s'> - Tue Feb 12 13:03:02 2008</a>

the code is
use constant LM_FULLVIEW_DESC   => '/lm/';
my $rgx_lm_fullview = qr/(??{LM_FULLVIEW})|(??{LM_FULLVIEW_DESC})/;
my ($lm_id, $lm_title, $lm_date, @lm_items) = $rp_teaser_block =~
 m|
<a\ href=".*?$rgx_lm_fullview([A-Z0-9]+?)/.*?">        # match lm_id from fullview link
        (.+?)                                                    # match title
            </a>.*?
            <span\ id="lm_formattedData312".*?>
                &nbsp;\( (.*?) \)                        # match date
            </span>.*?

the match id and match title portion is not work. any help would be greatly appreciated
0
Comment
Question by:angelblade27
3 Comments
 
LVL 85

Accepted Solution

by:
ozo earned 500 total points
ID: 20881143
The regular expression:

(?x-ims:
<a\ href=".*?([A-Z0-9]+?)/.*?">         # match lm_id from fullview link
        (.+?)                                                     # match title
            </a>.*?
            <span\ id="lm_formattedData312".*?>
                &nbsp;\( (.*?) \)                         # match date
            </span>.*?
    )

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?x-ims:                 group, but do not capture (disregarding
                         whitespace and comments) (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n):
----------------------------------------------------------------------
  <a                       '<a'
----------------------------------------------------------------------
  \                        ' '
----------------------------------------------------------------------
  href="                   'href="'
----------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [A-Z0-9]+?               any character of: 'A' to 'Z', '0' to '9'
                             (1 or more times (matching the least
                             amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  ">                       '">'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    .+?                      any character except \n (1 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  </a>                     '</a>'
----------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  <span                    '<span'
----------------------------------------------------------------------
  \                        ' '
----------------------------------------------------------------------
  id="lm_formattedData     'id="lm_formattedData312"'
  312"
----------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  &nbsp;                   '&nbsp;'
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
  </span>                  '</span>'
----------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 

Author Comment

by:angelblade27
ID: 20881925
thanks for the great explanation of each individually but when i look at the regex it seems like it should extract the correct data but it doesn't.

<a\ href=".*?$rgx_lm_fullview([A-Z0-9]+?)/.*?">        # match lm_id from fullview link
        (.+?)                                                    # match title

should capture
RSN9WU4RQ9QBM
and
'HérrÅ  &  PaÃxwört´s'> - Tue Feb 12 13:03:02 2008
in the above url yet it doesn't match it.
for the $rgx_lm_fullview it looks like it mean it could be either of the two constants but what is the ?? preceding the constant variable used for?
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 20907882
You don't seem to give enough of the code, where's the assignment of LM_FULLVIEW and where's the closing of the regex pattern etc., it would seem there's at least a line or two more you should share.
0

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

589 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question