[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 665
  • Last Modified:

Python file open and parse

Could someone give me a leg up on this please in Python.

Would like to parse the following input file.
Line 1 is the current date and time
Line 2 is a wireless signal strength (field 5)
..repeated..

Essentially need to extract date into internal data format for subsequent formatting and of course the signal strength field (field 5) from the second line.

Many thanks
BT

Input file (obviously longer, but same format)
Sat Apr 23 03:00:01 UTC 2011
00:00:00:00:00:00   10   11   6M    6  -89  255    585  50208 ESs          0       25   Normal ATH
Sat Apr 23 03:30:01 UTC 2011
00:00:00:00:00:00   10   11  11M   17  -88  225    697  52480 ESs          0       25   Normal ATH

Required output
<date>,<signal>
2011-04-23 03:00:01,6
2011-04-23 03:30:01,17
0
brothertom
Asked:
brothertom
  • 5
  • 3
  • 3
1 Solution
 
HonorGodCommented:
Something like this perhaps?
import re
import sys
from datetime import datetime

fh = open( sys.argv[ 1 ] )
data = fh.read()
fh.close()

linenum = 0
for line in data.splitlines() :
  linenum += 1
  if linenum % 2 :
    dateobj = datetime.strptime( line[ 4: ], "%b %d %H:%M:%S UTC %Y" )
  else :
    print '"%s"' % line
    print re.split( ' +', line )
    print '%s,%s' % ( datetime.strftime( dateobj, '%Y-%m-%d %H:%M%S' ), re.split( ' +', line )[ 4 ] )

Open in new window

0
 
brothertomAuthor Commented:
Excellent - thanks for this, big help.

BT
0
 
peprCommented:
@HonorGod: Congratulations to YAEEC (Yet Another Expert Exchange Certificate) -- now from Python!  You are on a good track :)))

Some minor comments to the code above.  While I understand the story behind the 'fh' identifier, you probably should stick with simpler 'f'.  The 'file handle' is a number (possibly a pointer) to the structrure that can be processed by the file functions.  The file handle is passed to the file functions.  On the other hand, the f is the object that contains also the data structure.  But it also "contains" the methods that work with the data.  This is rather cosmetic issue.  But we should often clean and sharpen our brain to use less hazy mental picture. ;)

New versions of Python tend to use iterators also in cases that were earlier often solved via temporary-built lists that were later traversed say via for loops.  If you have no very special reason to read everything into memory, it is usually better (more readable, less typing, more memory efficient, ...) to simply open the file, process the lines on-the-fly, and close the file.

If you need to count lines of the file or to count whatever iterated elements, the enumerate() function is handy.  It can be also given the second argument -- the start value of the counter.

The style guide says something about using the space in the source code and about the 4 spaces for indentation.  It really is a cosmetic issue.  Anyway, the source is then more readable for MORE PEOPLE as they are used to keep the style.

There is %a for day abbreviations for strptime().  This way you do not need to slice the line (source line 13).

The str.split() can be used here instead of re.split(...).  No need for regular expressions here.

If processing of the text file needs to tackle with more types of lines (regularly repeated) it may be good idea to use a finite automaton.  The above case is simple; however, such cases can easily get more complicated.

To summarize, the same code can look like this:

import sys
from datetime import datetime

f = open(sys.argv[1])
for linenum, line in enumerate(f, 1):
    line = line.rstrip()   # chop the \n and possibly some trailing white-spaces
    if linenum % 2:
        dateobj = datetime.strptime(line, "%a %b %d %H:%M:%S UTC %Y")
    else:
        print '"%s"' % line
        print line.split()
        print '%s,%s' % (datetime.strftime(dateobj, '%Y-%m-%d %H:%M%S'), line.split()[4])
f.close()

Open in new window

0
How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

 
peprCommented:
If you use Python 2.6+ (or Python 2.5 with the feature explicitly switched on).  You can use the 'with' construct for automatically closing the file.  It may be handy when the code that processes the lines is rather long and you want to be sure that you never forget to close the file:

import sys
from datetime import datetime

with open(sys.argv[1]) as f:
    for linenum, line in enumerate(f, 1):
        line = line.rstrip()   # chop the \n and possibly some trailing white-spaces
        if linenum % 2:
            dateobj = datetime.strptime(line, "%a %b %d %H:%M:%S UTC %Y")
        else:
            print '"%s"' % line
            print line.split()
            print '%s,%s' % (datetime.strftime(dateobj, '%Y-%m-%d %H:%M%S'), line.split()[4])

Open in new window


Notice: It is something completely different than the 'with' in the Pascal language (to avoid confusion if you know Pascal).
0
 
brothertomAuthor Commented:
Thanks pepr for the additional code and thoughts.

Incidentally, what is the Python equivalent of PHP preg_match statement that determines if a string is contained within a string.   The actual file being processed often does not have a signal strength line after a timestamp, so need to determine is a line read is a time stamp or not rather then rely on signal line beng an even line number.

so from line 7,

 if (preg_match('/UTC/',line):
   ..process date..
 else:
  ..process_signal..
0
 
peprCommented:
In the case, I would probably choose the

if 'UTC' in line:
    ..process date..
else:
    ..process signal..

The operator 'in' for the string acts as 'if something is the substring of the string'.  It is the equivalent of

if line.find('UTC') != -1:
    ..process date..
else:
    ..process signal..

However, the .find() variant is more general here and should be slower because of that (calling a method of the name, pasing argument, returning position and comparison with -1 in comparison with test that uses 'in' -- it can be optimized during compilation.  On my computer for the specific line, the 'in' is about 4 times faster:

C:\tmp\___python\brothertom\Q_26975865>python -m timeit "'Sat Apr 23 03:00:01 UTC 2011'.find('UTC') != -1"
1000000 loops, best of 3: 0.313 usec per loop

C:\tmp\___python\brothertom\Q_26975865>python -m timeit "'Sat Apr 23 03:00:01 UTC 2011'.find('UTC') != -1"
1000000 loops, best of 3: 0.314 usec per loop

Open in new window


If you still want to use the regular expressions (in some more complex cases), you should use the module re and the operation search():

import re
...
    if re.search('UTC', line):
        ..process date..
    else:
        ..process_signal..

Open in new window


When used in a loop (and possibly also in other cases), it is always good to get precompiled regular expression object and use that in the test:

import re
...
rexUTC = re.compile('UTC')
for line in f:
    line = line.rstrip()
    if rexUTC.search(line):
        ..process date..
    else:
        ..process_signal..

Open in new window


The regular expression is also nice when want to extract the groups out of the line.

As there is no need to count lines now, the enumerate() and linenum can be removed.  I still prefer using the 'in' in this case. (Notice a different place where line.rstrip() is done:

import sys
from datetime import datetime


f = open(sys.argv[1])
for line in f:
    if 'UTC' in line:
        dateobj = datetime.strptime(line.rstrip(), "%a %b %d %H:%M:%S UTC %Y")
    else:
        print '"%s"' % line
        print line.split()
        print '%s,%s' % (datetime.strftime(dateobj, '%Y-%m-%d %H:%M%S'), line.split()[4])
f.close()

Open in new window

0
 
peprCommented:
Sorry, the example that shows 'in' vs. .find() should be that one:

C:\tmp\___python\brothertom\Q_26975865>python -m timeit "'UTC' in 'Sat Apr 23 03:00:01 UTC 2011'"
10000000 loops, best of 3: 0.079 usec per loop

C:\tmp\___python\brothertom\Q_26975865>python -m timeit "'Sat Apr 23 03:00:01 UTC 2011'.find('UTC') != -1"
1000000 loops, best of 3: 0.313 usec per loop

Open in new window

0
 
brothertomAuthor Commented:
What a great answer - thank you very much.
0
 
HonorGodCommented:
I just love reading pepr's answers!  I always learn from them.

Thank you pepr, yet again, for an excellent response.
0
 
HonorGodCommented:
@pepr - How did you know that I got YAEEC? Do you get notified?
0
 
peprCommented:
@HonorGod:  http://www.experts-exchange.com/Programming/Languages/Scripting/Python/ -- the tab Recent Activity, 26/04/11 01:00 AM
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 5
  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now