make re.finditer run faster...

Does anyone have any speed improvements to contribute on this code?
The input is email body firewall logs, so there can be many matches in one email.

regexppattern = compiled regexp code
matchlist = regexppattern.finditer(body)
        (begin,end) = (0,0)
        for m in matchlist:
                (begin,end) = match2.span()
                match = regexppattern.match(body,begin,end)
                if match.group('ip') in foundstrings:
                        if len(foundstrings[match.group('ip')].log) > 5:
                                continue

The code simply takes too long to execute if i send 600+ fwlog mails to it.
The code should take height for the possibilty for several ip's in one body.
heggaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

RichieHindleCommented:
Can you supply the real code, and some example input?  There are a number of things that don't make sense:

 o Where does 'match2' come from?
 o Where does 'foundstrings' come from?
 o What does the regular expression actually look like?

Here are a couple of other comments.  I've added some line numbers to your code:

1. regexppattern = compiled regexp code
2. matchlist = regexppattern.finditer(body)
3.         (begin,end) = (0,0)
4.         for m in matchlist:
5.                 (begin,end) = match2.span()
6.                 match = regexppattern.match(body,begin,end)
7.                 if match.group('ip') in foundstrings:
8.                         if len(foundstrings[match.group('ip')].log) > 5:
9.                                 continue

Shouldn't line 5 say "(begin,end) = m.span()"?  There is no 'match2'.
If that's the case, line 6 is redundant - 'match' will be equivalent to 'm'.  You can just use 'm' in lines 7 and 8.

The code doesn't actually do anything - what is the intended output?
0
heggaAuthor Commented:

Thnx for your reply!

The code below is how i run it now.
The ouput should be [IP]/[YEAR]-[MONTH]-[DATE] [TIME]
a long with maximum 5 lines of firewall log.

foundstrings is a global dictonary, that stores the string and the log
But the format of the fwlog may vary, the one below is just an example.

def trypattern(regexppattern, body, tz):
        strPos = None
#       for match in regexppattern.finditer(body):#re.finditer(regexppattern, body): <-- I used this one first
        matchlist = regexppattern.finditer(body)
        (begin,end) = (0,0)
        for match2 in matchlist:
                (begin,end) = match2.span()
                match = regexppattern.match(body,begin,end)
                if match.group('ip') in foundstrings:
                        if len(foundstrings[match.group('ip')].log) > 5:
                                continue
                ip = match.group('ip')
                dom = match.group('date')
                month = match.group('month')
                strPos = match.span()
                year = '2004'
                time = match.group('time')
                if re.match('[a-zA-Z]+', month):
                        month = str(MONTHS[month])
                if ip in foundstrings:
                        if len(foundstrings[ip].log) < 5:
                                foundstrings[ip].log.append(match.string[strPos[0]:strPos[1]])
                else:
                        foundstrings[ip] = Case()
                        foundstrings[ip].ip = ip
                        foundstrings[ip].log.append(match.string[strPos[0]:strPos[1]])
                        foundstrings[ip].string = convert_time(year + '-' + month + '-' + dom + ' ' + time + ' ' + tz, '+1')
        print_foundstrings()

INPUT:
Nov 28 23:52:42 firewall kernel: IN=ppp0 OUT= MAC= SRC=83.108.146.184 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=13372 DF PROTO=TCP SPT=2189 DPT=445 WINDOW=64800 RES=0x00 SYN URGP=0
Nov 28 23:52:45 firewall kernel: IN=ppp0 OUT= MAC= SRC=83.108.146.184 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=13475 DF PROTO=TCP SPT=2189 DPT=445 WINDOW=64800 RES=0x00 SYN URGP=0
Nov 28 23:52:51 firewall kernel: IN=ppp0 OUT= MAC= SRC=83.108.146.184 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=127 ID=13664 DF PROTO=TCP SPT=2189 DPT=445 WINDOW=64800 RES=0x00 SYN URGP=0
Nov 28 23:52:56 firewall kernel: IN=ppp0 OUT= MAC= SRC=81.33.102.158 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=19758 DF PROTO=TCP SPT=4246 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0
Nov 28 23:52:59 firewall kernel: IN=ppp0 OUT= MAC= SRC=81.33.102.158 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=20166 DF PROTO=TCP SPT=4246 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0
Nov 28 23:53:04 firewall kernel: IN=ppp0 OUT= MAC= SRC=81.33.102.158 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=113 ID=21008 DF PROTO=TCP SPT=4246 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0
Nov 28 23:53:07 firewall kernel: IN=ppp0 OUT= MAC= SRC=68.206.15.147 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=105 ID=19219 DF PROTO=TCP SPT=3952 DPT=445 WINDOW=17520 RES=0x00 SYN URGP=0
Nov 28 23:53:10 firewall kernel: IN=ppp0 OUT= MAC= SRC=68.206.15.147 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=105 ID=21457 DF PROTO=TCP SPT=3952 DPT=445 WINDOW=17520 RES=0x00 SYN URGP=0
Nov 28 23:53:16 firewall kernel: IN=ppp0 OUT= MAC= SRC=68.206.15.147 DST=83.108.157.162 LEN=48 TOS=0x00 PREC=0x00 TTL=105 ID=24917 DF PROTO=TCP SPT=3952 DPT=445 WINDOW=17520 RES=0x00 SYN URGP=0

REGEXP:
For the example fwlog the regexp looks like this:
(?P<month>\w+)\s+(?P<date>\d+)\s+(?P<time>\d\d:\d\d:\d\d)\s+(?P<ip>\d+\.\d+\.\d+\.\d+).*?(?P<flag>IN)
0
RichieHindleCommented:
That regular expression doesn't match any of your example input.  Changing things around, and implementing the missing pieces, I came up with this:

def trypattern(regexppattern, body, tz, foundstrings=foundstrings):
        for match in regexppattern.finditer(body):
                continue
                foundstring = foundstrings.get(match.group('ip'), None)
                if foundstring is not None and len(foundstring.log) >= 5:
                        continue
                ip = match.group('ip')
                dom = match.group('date')
                month = match.group('month')
                year = '2004'
                time = match.group('time')
                if re.match('[a-zA-Z]+', month):
                        pass #str(MONTHS[month])
                if ip in foundstrings:
                        if len(foundstrings[ip].log) < 5:
                                foundstrings[ip].log.append(body[match.start():match.end()])
                else:
                        foundstrings[ip] = Case()
                        foundstrings[ip].ip = ip
                        foundstrings[ip].log.append(body[match.start():match.end()])
                        foundstrings[ip].string = convert_time(year + '-' + month + '-' + dom + ' ' + time + ' ' + tz, '+1')
        print_foundstrings()

 o Added "foundstrings=foundstrings" to make foundstrings a local variable.
 o Remove unnecessary second use of regexppattern - the result was always the same as the first.
 o Replace "match.string[strPos[0]:strPos[1]]" with "body[match.start():match.end()]", which refers to the main match object.
 o Replace "> 5" with ">= 5", so that the condition can fire.
 o Use "foundstrings.get()" to avoid doing the lookup twice.
 o Remove unnecessary pre-definition of variables

That takes it from 8.7 seconds for 100,000 records to 6.3 seconds.  Not very dramatic, but then this:

    def trypattern(pattern, body):
        for match in pattern.finditer(body):
            pass
    trypattern(re.compile("(\w+)"), INPUT)  # INPUT is 100,000 log lines

takes 4.3 seconds, so you're not going to get much of an improvement.  Is 8 seconds for 100,000 records really too slow?  If that sounds too fast, maybe the slowdown is in convert_time(), Case() or print_foundstrings(), or not even in this part of the whole system at all?
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

heggaAuthor Commented:
Thank you for your time effort on this case, with your help i've gotten it a bit better.
0
ramromconsultant Commented:
The above code will run remarkably fast and generate nothing.
All that will be executed is:
        for match in regexppattern.finditer(body):
                continue
 No?
0
RichieHindleCommented:
Oops!  You're quite right.  That first 'continue' should be removed - I'd thrown it in to test how fast the loop took if there was no loop body.  Otherwise the code should be fine.  Embarrassing... 8-)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.