Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Regular Exp Help! Matching where _NOT_ ?=lookahead

Posted on 2006-05-30
Medium Priority
Last Modified: 2010-04-16
I need some help with a regexp.
I am doing screen scrapes and I need to filter out the last match here is an example of the code and the hits:

    # I understand that the .+ is greedy, I need to find a way to make it not so much: lookaheads?
    match = re.findall("[0-9]+.html\">.+<\/a>",s)
    if match:
        for sMatch in match:
            print sMatch
#====================== OUTPUT =====
165810536.html">Free Wrought Iron Railings</a>
165808353.html">Free Printer</a>
165807325.html">2 Free Baby Seats!  (Get 'em now!)</a>
165806607.html">Free Brookstone Dustbuster- Needs Fixing</a>
100.html">next 100 postings</a>    #<-----  This line needs to _not_ be a match

For reference, the site im scraping is Craigslist.org
Question by:EchoBinary

Expert Comment

ID: 16790603
just do :

  # I understand that the .+ is greedy, I need to find a way to make it not so much: lookaheads?
    match = re.findall("[0-9]+.html\">.+<\/a>",s)
    if match:
        for sMatch in match[:-1]:
            print sMatch

the match[:-1] will take the complete list and leave the last element out.

Author Comment

ID: 16790676
Still very new with Python, didnt know you could slice lists. thats cool!
Im lucky in that this case the unneeded item is the last one in all cases.

is it better to do it via a slice or a regexp look ahead?
like this? (i looked it up)
match = re.findall("[0-9]+.html\">(?!next\ 100).+<\/a>",s)
LVL 17

Expert Comment

ID: 16792346
depends on what you man by "better". Anything that complicates a regexp complicates debugging.

 I like the slicing option as the whole thing is easier to understand, and probably more efficient.
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

LVL 29

Accepted Solution

pepr earned 2000 total points
ID: 16796003
It depends how quick/dirty should/may be the solution. It depends also how big is the resulting program. As ramrom said, debugging may be the problem; however, it may not be that difficult if the program is short and the regular expression is used often (i.e. the bug is very visible).

You can also try to make the regexp less greedy by putting '?' just after '+':
match = re.findall("[0-9]+.html\">.+?<\/a>",s)

Be also careful when writing patterns. You should always use raw strings, i.e. put 'r' in front of the first quote char to suppress interpretation of the backslash during complilation of the string. When expecting double quote in the pattern, use single quote to write the pattern:

match = re.findall(r'[0-9]+.html">.+?</a>', s)

When using the re, you should always think hard whether you get what you want. In your case, the dot in front of html is ANY character. It should be escaped:

match = re.findall(r'[0-9]+\.html">.+?</a>', s)

Another question is whether the regular expression should not be precompiled (re.compile(...)) if used intensively. Also, the re.findall() may not be the best approach -- it depends.

If you really want to parse HTML source, using some HTML parser would probably be better. But it would not probably be that quick and dirty solution.
LVL 17

Expert Comment

ID: 16801600
LVL 15

Expert Comment

ID: 16886764
    match = re.findall("[0-9]{4,}.html\">.+<\/a>",s)

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question