Regular Exp Help! Matching where _NOT_ ?=lookahead

Posted on 2006-05-30
Last Modified: 2010-04-16
I need some help with a regexp.
I am doing screen scrapes and I need to filter out the last match here is an example of the code and the hits:

    # I understand that the .+ is greedy, I need to find a way to make it not so much: lookaheads?
    match = re.findall("[0-9]+.html\">.+<\/a>",s)
    if match:
        for sMatch in match:
            print sMatch
#====================== OUTPUT =====
165810536.html">Free Wrought Iron Railings</a>
165808353.html">Free Printer</a>
165807325.html">2 Free Baby Seats!  (Get 'em now!)</a>
165806607.html">Free Brookstone Dustbuster- Needs Fixing</a>
100.html">next 100 postings</a>    #<-----  This line needs to _not_ be a match

For reference, the site im scraping is
Question by:EchoBinary
    LVL 1

    Expert Comment

    just do :

      # I understand that the .+ is greedy, I need to find a way to make it not so much: lookaheads?
        match = re.findall("[0-9]+.html\">.+<\/a>",s)
        if match:
            for sMatch in match[:-1]:
                print sMatch

    the match[:-1] will take the complete list and leave the last element out.

    Author Comment

    Still very new with Python, didnt know you could slice lists. thats cool!
    Im lucky in that this case the unneeded item is the last one in all cases.

    is it better to do it via a slice or a regexp look ahead?
    like this? (i looked it up)
    match = re.findall("[0-9]+.html\">(?!next\ 100).+<\/a>",s)
    LVL 17

    Expert Comment

    depends on what you man by "better". Anything that complicates a regexp complicates debugging.

     I like the slicing option as the whole thing is easier to understand, and probably more efficient.
    LVL 28

    Accepted Solution

    It depends how quick/dirty should/may be the solution. It depends also how big is the resulting program. As ramrom said, debugging may be the problem; however, it may not be that difficult if the program is short and the regular expression is used often (i.e. the bug is very visible).

    You can also try to make the regexp less greedy by putting '?' just after '+':
    match = re.findall("[0-9]+.html\">.+?<\/a>",s)

    Be also careful when writing patterns. You should always use raw strings, i.e. put 'r' in front of the first quote char to suppress interpretation of the backslash during complilation of the string. When expecting double quote in the pattern, use single quote to write the pattern:

    match = re.findall(r'[0-9]+.html">.+?</a>', s)

    When using the re, you should always think hard whether you get what you want. In your case, the dot in front of html is ANY character. It should be escaped:

    match = re.findall(r'[0-9]+\.html">.+?</a>', s)

    Another question is whether the regular expression should not be precompiled (re.compile(...)) if used intensively. Also, the re.findall() may not be the best approach -- it depends.

    If you really want to parse HTML source, using some HTML parser would probably be better. But it would not probably be that quick and dirty solution.
    LVL 17

    Expert Comment

    LVL 15

    Expert Comment

        match = re.findall("[0-9]{4,}.html\">.+<\/a>",s)

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Maximize Your Threat Intelligence Reporting

    Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

    This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
    Strings in Python are the set of characters that, once defined, cannot be changed by any other method like replace. Even if we use the replace method it still does not modify the original string that we use, but just copies the string and then modif…
    Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
    Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…

    737 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    21 Experts available now in Live!

    Get 1:1 Help Now