[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

make re.finditer run faster...

Posted on 2004-11-29
6
Medium Priority
?
300 Views
Last Modified: 2010-04-16
Does anyone have any speed improvements to contribute on this code?
The input is email body firewall logs, so there can be many matches in one email.

regexppattern = compiled regexp code
matchlist = regexppattern.finditer(body)
        (begin,end) = (0,0)
        for m in matchlist:
                (begin,end) = match2.span()
                match = regexppattern.match(body,begin,end)
                if match.group('ip') in foundstrings:
                        if len(foundstrings[match.group('ip')].log) > 5:
                                continue

The code simply takes too long to execute if i send 600+ fwlog mails to it.
The code should take height for the possibilty for several ip's in one body.
0
Comment
Question by:hegga
  • 3
  • 2
6 Comments
 
LVL 14

Expert Comment

by:RichieHindle
ID: 12707611
Can you supply the real code, and some example input?  There are a number of things that don't make sense:

 o Where does 'match2' come from?
 o Where does 'foundstrings' come from?
 o What does the regular expression actually look like?

Here are a couple of other comments.  I've added some line numbers to your code:

1. regexppattern = compiled regexp code
2. matchlist = regexppattern.finditer(body)
3.         (begin,end) = (0,0)
4.         for m in matchlist:
5.                 (begin,end) = match2.span()
6.                 match = regexppattern.match(body,begin,end)
7.                 if match.group('ip') in foundstrings:
8.                         if len(foundstrings[match.group('ip')].log) > 5:
9.                                 continue

Shouldn't line 5 say "(begin,end) = m.span()"?  There is no 'match2'.
If that's the case, line 6 is redundant - 'match' will be equivalent to 'm'.  You can just use 'm' in lines 7 and 8.

The code doesn't actually do anything - what is the intended output?
0
 

Author Comment

by:hegga
ID: 12708359

Thnx for your reply!

The code below is how i run it now.
The ouput should be [IP]/[YEAR]-[MONTH]-[DATE] [TIME]
a long with maximum 5 lines of firewall log.

foundstrings is a global dictonary, that stores the string and the log
But the format of the fwlog may vary, the one below is just an example.

def trypattern(regexppattern, body, tz):
        strPos = None
#       for match in regexppattern.finditer(body):#re.finditer(regexppattern, body): <-- I used this one first
        matchlist = regexppattern.finditer(body)
        (begin,end) = (0,0)
        for match2 in matchlist:
                (begin,end) = match2.span()
                match = regexppattern.match(body,begin,end)
                if match.group('ip') in foundstrings:
                        if len(foundstrings[match.group('ip')].log) > 5:
                                continue
                ip = match.group('ip')
                dom = match.group('date')
                month = match.group('month')
                strPos = match.span()
                year = '2004'
                time = match.group('time')
                if re.match('[a-zA-Z]+', month):
                        month = str(MONTHS[month])
                if ip in foundstrings:
                        if len(foundstrings[ip].log) < 5:
                                foundstrings[ip].log.append(match.string[strPos[0]:strPos[1]])
                else:
                        foundstrings[ip] = Case()
                        foundstrings[ip].ip = ip
                        foundstrings[ip].log.append(match.string[strPos[0]:strPos[1]])
                        foundstrings[ip].string = convert_time(year + '-' + month + '-' + dom + ' ' + time + ' ' + tz, '+1')
        print_foundstrings()

INPUT:
Nov 28 23:52:42 firewall kernel: IN=ppp0 OUT= MAC= SRC=83.108.146.184 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=13372 DF PROTO=TCP SPT=2189 DPT=445 WINDOW=64800 RES=0x00 SYN URGP=0
Nov 28 23:52:45 firewall kernel: IN=ppp0 OUT= MAC= SRC=83.108.146.184 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=13475 DF PROTO=TCP SPT=2189 DPT=445 WINDOW=64800 RES=0x00 SYN URGP=0
Nov 28 23:52:51 firewall kernel: IN=ppp0 OUT= MAC= SRC=83.108.146.184 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=13664 DF PROTO=TCP SPT=2189 DPT=445 WINDOW=64800 RES=0x00 SYN URGP=0
Nov 28 23:52:56 firewall kernel: IN=ppp0 OUT= MAC= SRC=81.33.102.158 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=19758 DF PROTO=TCP SPT=4246 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0
Nov 28 23:52:59 firewall kernel: IN=ppp0 OUT= MAC= SRC=81.33.102.158 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=20166 DF PROTO=TCP SPT=4246 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0
Nov 28 23:53:04 firewall kernel: IN=ppp0 OUT= MAC= SRC=81.33.102.158 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=21008 DF PROTO=TCP SPT=4246 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0
Nov 28 23:53:07 firewall kernel: IN=ppp0 OUT= MAC= SRC=68.206.15.147 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=105 ID=19219 DF PROTO=TCP SPT=3952 DPT=445 WINDOW=17520 RES=0x00 SYN URGP=0
Nov 28 23:53:10 firewall kernel: IN=ppp0 OUT= MAC= SRC=68.206.15.147 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=105 ID=21457 DF PROTO=TCP SPT=3952 DPT=445 WINDOW=17520 RES=0x00 SYN URGP=0
Nov 28 23:53:16 firewall kernel: IN=ppp0 OUT= MAC= SRC=68.206.15.147 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=105 ID=24917 DF PROTO=TCP SPT=3952 DPT=445 WINDOW=17520 RES=0x00 SYN URGP=0

REGEXP:
For the example fwlog the regexp looks like this:
(?P<month>\w+)\s+(?P<date>\d+)\s+(?P<time>\d\d:\d\d:\d\d)\s+(?P<ip>\d+\.\d+\.\d+\.\d+).*?(?P<flag>IN)
0
 
LVL 14

Accepted Solution

by:
RichieHindle earned 2000 total points
ID: 12710803
That regular expression doesn't match any of your example input.  Changing things around, and implementing the missing pieces, I came up with this:

def trypattern(regexppattern, body, tz, foundstrings=foundstrings):
        for match in regexppattern.finditer(body):
                continue
                foundstring = foundstrings.get(match.group('ip'), None)
                if foundstring is not None and len(foundstring.log) >= 5:
                        continue
                ip = match.group('ip')
                dom = match.group('date')
                month = match.group('month')
                year = '2004'
                time = match.group('time')
                if re.match('[a-zA-Z]+', month):
                        pass #str(MONTHS[month])
                if ip in foundstrings:
                        if len(foundstrings[ip].log) < 5:
                                foundstrings[ip].log.append(body[match.start():match.end()])
                else:
                        foundstrings[ip] = Case()
                        foundstrings[ip].ip = ip
                        foundstrings[ip].log.append(body[match.start():match.end()])
                        foundstrings[ip].string = convert_time(year + '-' + month + '-' + dom + ' ' + time + ' ' + tz, '+1')
        print_foundstrings()

 o Added "foundstrings=foundstrings" to make foundstrings a local variable.
 o Remove unnecessary second use of regexppattern - the result was always the same as the first.
 o Replace "match.string[strPos[0]:strPos[1]]" with "body[match.start():match.end()]", which refers to the main match object.
 o Replace "> 5" with ">= 5", so that the condition can fire.
 o Use "foundstrings.get()" to avoid doing the lookup twice.
 o Remove unnecessary pre-definition of variables

That takes it from 8.7 seconds for 100,000 records to 6.3 seconds.  Not very dramatic, but then this:

    def trypattern(pattern, body):
        for match in pattern.finditer(body):
            pass
    trypattern(re.compile("(\w+)"), INPUT)  # INPUT is 100,000 log lines

takes 4.3 seconds, so you're not going to get much of an improvement.  Is 8 seconds for 100,000 records really too slow?  If that sounds too fast, maybe the slowdown is in convert_time(), Case() or print_foundstrings(), or not even in this part of the whole system at all?
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 

Author Comment

by:hegga
ID: 12714064
Thank you for your time effort on this case, with your help i've gotten it a bit better.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 12719758
The above code will run remarkably fast and generate nothing.
All that will be executed is:
        for match in regexppattern.finditer(body):
                continue
 No?
0
 
LVL 14

Expert Comment

by:RichieHindle
ID: 12719874
Oops!  You're quite right.  That first 'continue' should be removed - I'd thrown it in to test how fast the loop took if there was no loop body.  Otherwise the code should be fine.  Embarrassing... 8-)
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Suggested Courses
Course of the Month18 days, 3 hours left to enroll

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question