Solved

Need to speed up slow executing scripts

Posted on 2011-02-15
7
388 Views
Last Modified: 2012-05-11
The attached scripts I have work but they take a long time to process.  They basically filters the records of a log file for a certain string and writes the records that contain that string to a new file.  I know execution is directly affectted by the size of the oringinal log file but Is there anyway to speed this up?  I have been told that some methods of reading and\ or writting to a file are faster than others.  Is the method that I am using the most efficent?  If not how do I improve it?  

The files that I am filtering are between 20 to 80 MB taking up to 50 seconds.
import os, time

from datetime import datetime, timedelta


def SpecErrLog(File, dt, err, Duration):
    source_file = open(File,"r")
    
    try:
        file1 = File + " " + dt.replace(":","_") + " [" +  err + "]" + str(Duration)
        #file1 = "temptry.txt"
        dest_file = open(file1, 'w')
        Mach, Dt = File.split()
        
        BeginDay = datetime.strptime("00:00:00.000", '%H:%M:%S.%f')
        EndDay = datetime.strptime("23:59:59.999", '%H:%M:%S.%f')
        dt = datetime.strptime(dt, '%H:%M:%S.%f')
        
        for line in source_file:
            arr=line.split(",")
            LineTimeStamp = datetime.strptime(arr[4].strip(), '%H:%M:%S.%f') #timeStamp of sourcefile.
            upperLimit= dt + timedelta(minutes=Duration)
            lowerLimit= dt - timedelta(minutes=Duration)
            if lowerLimit > BeginDay and upperLimit < EndDay: #if all records accurr within the same day.
                if lowerLimit < LineTimeStamp < upperLimit:
                    dest_file.write(line)
    finally:
        source_file.close
        dest_file.close
        print "finished"
        
if __name__ == "__main__":

    dt = "09:52:15.710"
    err = "54300"
    SpecErrLog("H108 01-24-2011", dt, err, 30)

Open in new window

0
Comment
Question by:NevSoFly
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
7 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 34901855
Here's a page about Python compilers: http://effbot.org/zone/python-compile.htm  That might speed it up.
0
 
LVL 5

Accepted Solution

by:
-Richard- earned 500 total points
ID: 34903462
The link above describes how to package your Python application as an EXE file -- this is not the same as compiling it.  Python files (".py" extension), after passing through the interpreter, will be stored into a byte code representation with a ".pyc" extension.   The packages described are a way to bundle the .pyc files along with related libraries into a single executable with a .EXE extension, but this is just a convenience and does not affect execution speed.  

Standard optimization techniques that apply to any language apply here too.  It's always good to look for loop invariants - values that get calculated every time through the loop even though you get the same result every time.  You have two of them - the calculation for upperLimit and lowerLimit.  Take those two lines out of the loop and put them prior to it.  That should make some kind of difference - maybe not too dramatic, but noticeable.

Another standard optimization technique is to put your most powerful conditionals first.   First you check for lowerLimit and upperLImit being on the same day, then you check the timestamp.  But won't  the limits be on the same day most of the time?  Meaning you'll almost always fall through to the second conditional.  If the timestamp check is much less frequently satisfied, than put it first.  Then you won't fall through to the second comparison so often.   That could be important because I think those comparisons of datetime structs are probably not real efficient.

Speaking of comparing time-structs, you may be able to eliminate doing comparisons against them entirely.  It would be much quicker to compare simply numbers.  If you represented all times as the number of seconds since the January 1st, 1970 (a common convention), then you could compare all dates and times as integers rather than time_structs, which would be much quicker.  

I don't really think the file i/o is your problem.  The standard write statement is quite efficient and I've written plenty of programs that looked a lot like yours which processed files that big quite rapidly.

I hope you find these ideas helpful.  Enjoy!  


0
 
LVL 5

Expert Comment

by:-Richard-
ID: 34903540
In fact, on second thought, I'm not convinced your logic is exactly correct.  You're throwing out lines where the lower limit might be less than the beginning of the day or the upper limit might be greater than the end of the day.  I think what you want to do is change the lower limit calculation so if the lower limit comes out as less than the start of the day, you make it the start of the day; and make an analagous change with the upper limit and the end of the day.  That would allow you to eliminate the beginDay and endDay comparison entirely, as well as eliminating a bug.  I don't think your way would work properly if the initial "dt" parameter is very close to the beginning or the end of the day.
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:NevSoFly
ID: 34919853
Richard,

I tried your first suggestions and saved about 7 seconds but I am having a hard time trying to convert the times stamp to seconds.  I get an out of range error.  I have looked on the web but can't find another way to convert a datetime to seconds (integer).

I am trying to use your second suggestion but I think I may have to rethink my approach.  The reason is that the log files that I am pulling this data from only contain info for 1 day.  If the LineTimeStamp is less than BeginDay then I will need to open the previous days logs and search them.

So for now I think I will only work with one day and not test for BeginDay or EndDay.
0
 

Author Comment

by:NevSoFly
ID: 34919865
if you do know of  a way to convert lowerLimit and upperLimit to seconds I'm all ears.
0
 
LVL 5

Expert Comment

by:-Richard-
ID: 34921059
Working with only one day will again improve your efficiency because you will move one more slow conditional check from within to outside the loop.  That should gain you several more seconds.  

Additionally, I missed two more loop invariants!   The calculation of lowerLimit and upperLimit will give the same result every time throuh the loop too.  Those lines can be moved priorto the loop which should gain you even more time.

My suggestion about using seconds was probably my worst idea.  Using seconds will make the comparison faster, but the additional computation involved in doing the conversion might destroy the benefit or even make it worse.  I think you can safely forget about it.

Once you do all the other things we discussed, you'll have a nice tight program and it will be running about as fast as it can.   80 megabytes is not a small fiile and it will take some time under the best of circumstances.   If it gets down to the 30-second range I'd say you were doing pretty good.
0
 

Author Closing Comment

by:NevSoFly
ID: 34921529
thank you
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to get only tweeted data  from the output file 3 109
Error catching in Python 8 57
Need a python script 5 100
cannot use pip to install pandas or pandas_datareader 5 240
This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question