Solved

Create single header dictionary/list for raw data file

Posted on 2008-10-13
12
725 Views
Last Modified: 2012-08-14
I have a script that grabs lines from a raw data (csv) file and copies the lines associated with 1-minute intervals to one file, those associated with 60-minute intervals to another, and cleans up the headers (see attached code).  

In each of the 24 archive files I need to process, there are 39 headers / parameters followed by the raw data (see attached excerpt).  Right now each header parameter is listed sequentially, followed by data in timestamp sequence.  What I'm trying to do is put the headers at the top of the file in a single row, and line up the metrics into 39 columns.  The timestamps begin in December 07 and go to present, so there are a huge number of lines, 150,000 or more.  

I've tried to create a single dictionary list from the headers, but all I've managed to do is print a selected item from the original data by row (39 separate lists with 1 item each).  I will need to associate the raw data with the header array, presumably with name-value pairs, but didn't get that far.  Let me know what you can do to help, TIA.
#!/usr/bin/python

 

import csv

 

fin = open('rawData.csv', 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

 

fout_hr = open('hr_sorted.csv', 'wb')

hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)

 

fout_min = open('min_sorted.csv', 'wb')

min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer

 

for row in source:

    if status == 0:   # wait a section header rows

        if row[0] == 'Point Name:':

            header = [ row[1:] ]    # the list with the info from the first header row

            status = 1

        else:

            continue

 

    elif status == 1:  # collecting some header rows

        if row[1] == '1 Minutes':

            header.append(row)

            writer = min_writer

        elif row[1] == '60 Minutes':

            header.append(row)

            writer = hr_writer

        elif row[1] == 'Date Range':

            pass                       # Ignore this header row.

        elif row[0] == 'Report Timings:':

            # Ignore this header row. Do output the collected header.

            for r in header:

                writer.writerow(r)

            status = 2                 # Get ready for for the value rows.

 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            writer.writerow(row[0:3])  # output the first three columns only

        else:

            writer.writerow([''])      # separator

            status = 0                 # wait for another section

    

    else:

        print 'Unexpected status', status    

        

fout_hr.close()

fout_min.close()

Open in new window

Rawdata.txt
0
Comment
Question by:sara_bellum
  • 6
  • 6
12 Comments
 

Author Comment

by:sara_bellum
Comment Utility
In a perfect world I'd have more time to research this and figure this out, but there we are...I should mention that the 60-minute/1-hour intervals in each archive vary depending on which hour in the day is being archived, so our focus is on the 1-minute intervals where we always have 60 timestamps per metric, regardless of which file is being processed.  (Of course it would be nice to have the hour archives working as well, but that's a lot less important.)
0
 
LVL 5

Expert Comment

by:PaulKeating
Comment Utility
  I'm having some trouble visualizing the output you want.

Am I correct in saying that you want (for the minute file) a 39 x 60 matrix of values, with the 39 column labels set to "CCHRC.PV1.DEM:APPARENT   PWR", "   CCHRC.PV1.DEM:CONSUMPTN   LO", and so on, and the 60 row labels being timestamps and occupying 2 columns (1 for date and one for time), and the body of the matrix being the values in column 3 of the input file?

And for the hour file, similar but the matrix will be 39 x n with n depending on the data in the input file?

And do the timestamps have to be in ascending order in the output, the way they are in the input? It seems a natural way to do it, but if a program not a person will be consuming the .csv you are producing, then it probably won't matter, and allowing them to be unsorted will simplify the program a little.
0
 

Author Comment

by:sara_bellum
Comment Utility
That sounds right but I attach an excerpt of a processed file so you can better visualize the format (this one was processed in PHP, but that method ate a lot of memory and garbled the timestamps so we're starting over in python).  
The timestamp order we use is always oldest to most recent - the latest timestamps are appended to the end of each raw data file, so the processed files should follow the same order.  
archive-sample.txt
0
 
LVL 5

Expert Comment

by:PaulKeating
Comment Utility
Ok, here it is.

You can't get entirely away from the memory problem. If you want to sort something, you can't print the first line of output until you've read the last line of input, because the last line might sort to the top of the list. You want a crosstab, and crosstabs have the same problem. You have to read everything into memory before you write anything out.

I suspect the garbled timestamps in the PHP program may have been due to the fact that the timestamps are not consistent through all the measurements. Some of them start and end a minute later than the others. If the programmer assumed that this was not so then a lot of the observations would have been off by one.

import csv

import time

import os.path
 

def convert_timestamp(date, hhmmss):

    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 

 

fin = open(r'c:\temp\Q_23810592.txt', 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer
 

valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 

for row in source:

    if status == 0:   # wait a section header rows

        if row[0] == 'Point Name:':

            columnlabel = row[1]   # the list with the info from the first header row

            status = 1

        else:

            continue
 

    elif status == 1:  # collecting some header rows

        if row[1] == '1 Minutes':

            freq = row[1]

        elif row[1] == '60 Minutes':

            freq = row[1]

        elif row[1] == 'Date Range':

            pass                       # Ignore this header row.

        elif row[0] == 'Report Timings:':

            status = 2                 # Get ready for for the value rows.
 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            date, hhmmss, value = row[0:3]

            rowlabel = convert_timestamp(date, hhmmss)

            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value

        else:

            status = 0                 # wait for another section

    

    else:

        print 'Unexpected status', status    
 
 
 

for freq in valuematrix.keys():  # freq = hour, minute

  fout = open(os.path.join(r'c:\temp',freq)+'.csv' , 'wb')

  try:

      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)

      timestamps = valuematrix[freq].keys()

      timestamps.sort()

      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()

      observations.sort()

      headers = ["Date", "Time"] + observations

      writer.writerow(headers)  # output the headers

      

      for row in timestamps:

          details = [time.strftime("%m/%d/%Y", row),

                     time.strftime("%H:%M:%S", row)] 

          for column in observations:

              try:

                  details.append(valuematrix[freq][row][column])

              except KeyError:

                  details.append("")

                  print 'Missing value',freq, time.asctime(row), column

          writer.writerow(details)

  finally:

      fout.close()

 

 

Open in new window

0
 

Author Comment

by:sara_bellum
Comment Utility
Thanks for your quick response!

I ran the script against a daily data file (a copy of which is attached above) and against a much larger archive file (both raw data files are in the same format).  The outputs, '1 Minutes.csv' and '60 Minutes.csv' look good - I'm forwarding to the operators to see if they detect any problem areas.
It was a great idea to introduce rows of timestamps for missing data - this was probably where the original script went wrong.

When executing this script at the Linux command line against the shorter data file, I got 39 lines of output like this:
Missing value 1 Minutes Tue Sep 23 06:01:00 2008 CCHRC.PV1.DEM:APPARENT PWR
(one for each header item); for the archive file, I got 112191 missing value lines, beginning with
Missing value 60 Minutes Fri Dec  7 13:33:08 2007 CCHRC.PV3.DEM:POWER FACTOR
and ending with
Missing value 1 Minutes Mon Sep 22 23:01:00 2008 CCHRC.PV3.DEM:DEMAND
(again, reporting a number of missing timestamps 39 times).

$ ./process.py > errors.txt could store the missing values in files that could successively be overwritten as needed; as annoying as it is to see a timestamp error repeated 39 times for each of 39 values, I will leave the error logs as is, because there are timestamps with some metrics showing values and others with none.  

You were right about how slow it was, and without a progress indicator at the command prompt I wasn't sure whether the script was processing the archive or just trying to. Until I'm confident enough of the processing power of this machine to put this in a crontab, I'd prefer to see some hash marks with a closing remark like "Done", if that can be accomplished without appreciably slowing down the script.

Finally, we are storing these files on a Linux server, so if there's a way to change the fout path to yield *nix-compatible file names, that would be better than the current file names.

Let me know if you need anything else. Thanks again!

0
 
LVL 5

Expert Comment

by:PaulKeating
Comment Utility
Ok, some real-world adjustments.

First, instead of hard-coding the file locations I'm taking them off the command line, and I'm sending the missing-values messages to stderr, so invocation will now be something like

$ ./process.py    /path/to/input.txt     /path/for/output.csv     2>     /path/for/error.log

Second, I've put in the progress messages you asked for. I'm printing a hash to the console for every line read. This may be too many: if so, put in a counter and report every 10th or 100th line.

counter += 1
if counter % 100 == 1: print '#',

Third, I've substituted underscores for spaces in the names of the output file for the benefit of picky filesystems.



#!/usr/bin/python
 

import csv

import time

import os.path

import sys
 

def convert_timestamp(date, hhmmss):

    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 

 

fin = open(sys.argv[1], 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer
 

valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 

for row in source:

    if status == 0:   # wait a section header rows

        if row[0] == 'Point Name:':

            columnlabel = row[1]   # the list with the info from the first header row

            status = 1

        else:

            continue
 

    elif status == 1:  # collecting some header rows

        if row[1] == '1 Minutes':

            freq = row[1]

        elif row[1] == '60 Minutes':

            freq = row[1]

        elif row[1] == 'Date Range':

            pass                       # Ignore this header row.

        elif row[0] == 'Report Timings:':

            status = 2                 # Get ready for for the value rows.
 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            date, hhmmss, value = row[0:3]

            rowlabel = convert_timestamp(date, hhmmss)

            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value

            print '#',

        else:

            status = 0                 # wait for another section

            print

    

    else:

        print >> sys.stderr, 'Unexpected status', status

print 'Input phase complete'
 
 

for freq in valuematrix.keys():  # freq = hour, minute

  fout = open(os.path.join(sys.argv[2],freq.replace(' ','_').lower())+'.csv' , 'wb')

  try:

      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)

      timestamps = valuematrix[freq].keys()

      timestamps.sort()

      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()

      observations.sort()

      headers = ["Date", "Time"] + observations

      writer.writerow(headers)  # output the headers

      

      for row in timestamps:

          details = [time.strftime("%m/%d/%Y", row),

                     time.strftime("%H:%M:%S", row)] 

          for column in observations:

              try:

                  details.append(valuematrix[freq][row][column])

              except KeyError:

                  details.append("")

                  print >> sys.stderr, 'Missing value',freq, time.asctime(row), column

          writer.writerow(details)

  finally:

      fout.close()

      print 'Output phase complete'

 

 

Open in new window

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:sara_bellum
Comment Utility
Thanks very much!  I may be pushing my luck here but I need one last thing: these outputs will be uploaded to a single directory on the server where users can download them from a web page, so each processed archive should have a unique filename.  I can set up a shutil copy to rename all the output files but I'm such a novice that I have to define source and destination 24 times for each hourly archive, 24 times for each minute-interval archive and then shutil.copy from each source to each destination (very tedious; if I can't reduce that I'll ask another question to resolve this too).  The quickest thing would be to find some way to concatenate the archive number into the output filename before the freq, in order to get filenames like archive1_1_minutes.csv, archive1_60_minutes.csv etc for 24 archives.  Let me know if that can be done, thanks.  
0
 
LVL 5

Expert Comment

by:PaulKeating
Comment Utility
Where is the number to come from? Is it in the name of the input file? Or do you want the program to work out the next number from looking at the files that already exist?
 
0
 
LVL 5

Expert Comment

by:PaulKeating
Comment Utility
... or maybe it would be a better idea to derive the filename from the first timestamp in the file: that way it wouldn't matter what order the input was processed in, and it would be obvious what was in a file without having to open it.
0
 

Author Comment

by:sara_bellum
Comment Utility
I can't use the starting timestamps for the archive files because they are historical and all begin with the same timestamp.  This does not apply to the daily data files (like the one I posted here) but we might as well use the same naming convention for the daily files as we do for the archives.

By way of explanation, I was prepared to run
$ ./process.py    /path-to-archives/archive1/archive1.csv     /path-to-archives/archive1/     2>     /path-to-archives/error.log
$ ./process.py    /path-to-archives/archive2/archive2.csv     /path-to-archives/archive2/     2>     /path-to-archives/error.log
etc 24 times, but that was because I needed to create a separate directory for each archive file in order for our PHP script to run at all.

I understand that your python script has no such limitation, but if I leave all the archive files in a single directory and run
$ ./process.py    /path-to-archives/archive1.csv     /path-to-archives/output1/    2>     /path-to-archives/error.log
$ ./process.py    /path-to-archives/archive2.csv     /path-to-archives/output2/    2>     /path-to-archives/error.log

then the filenames in the output directories still need to be unique, and the number should come from the name of the archive file (archive1, archive2 etc). Ideally, the processed files should have names like archive1_minutes.csv, archive1_hours.csv, archive2_minutes.csv, archive2_hours.csv etc.  If my boss still wants the daily files processed as well, the same numbering system would apply: data1.csv should yield data1_minutes.csv and data1_hours.csv etc (there are 24 daily data files along with 24 archive files; the daily files were processed correctly in PHP as far as we know, but we ran into difficulties with the archives).

As always, TIA :)
0
 
LVL 5

Accepted Solution

by:
PaulKeating earned 500 total points
Comment Utility
I've changed it to extract the serial number from the name of the input file. (If it can't find one it will use 000 - take that as an indication something is wrong.)

This works like the other program did: it creates a folder called outputn under the destination folder you specify and puts the crosstab files in that. If that isn't what you want, and you'd like all the output files to go to the same folder, then change line 64 from:

fout = open(os.path.join(outloc,freq.replace(' ','_').lower())+'.csv' , 'wb')

to

fout = open(outloc+'_'+freq.replace(' ','_').lower()+'.csv' , 'wb')

... doing that just substitutes _ for the / delimiter in the filename ...

and remove lines 24-5 which create the subfolders.

You really ought to get the shell to loop through the files for you, instead of issuing 24 commands. You might check out the for command. I won't offer you an example because I'm no expert on that topic.

#!/usr/bin/python
 

import csv

import time

import os.path

import sys

import re
 

def convert_timestamp(date, hhmmss):

    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 

seqre = re.compile(r'\D*(\d+).*\.')

fin = open(sys.argv[1], 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

try:

    seq = 'output' + seqre.match(os.path.basename(fin.name)).group(1)

except:

    seq = 'output000'

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer

outloc = os.path.join(sys.argv[2],seq)

if not os.path.exists(outloc):

    os.mkdir(outloc)
 

valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 

for row in source:

    if status == 0:   # wait a section header rows

        if row[0] == 'Point Name:':

            columnlabel = row[1]   # the list with the info from the first header row

            status = 1

        else:

            continue
 

    elif status == 1:  # collecting some header rows

        if row[1] == '1 Minutes':

            freq = row[1]

        elif row[1] == '60 Minutes':

            freq = row[1]

        elif row[1] == 'Date Range':

            pass                       # Ignore this header row.

        elif row[0] == 'Report Timings:':

            status = 2                 # Get ready for for the value rows.
 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            date, hhmmss, value = row[0:3]

            rowlabel = convert_timestamp(date, hhmmss)

            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value

            print '#',

        else:

            status = 0                 # wait for another section

            print

    

    else:

        print >> sys.stderr, 'Unexpected status', status

fin.close()

print 'Input phase complete', fin.name
 
 

for freq in valuematrix.keys():  # freq = hour, minute

  fout = open(os.path.join(outloc,freq.replace(' ','_').lower())+'.csv' , 'wb')

  #fout = open(outloc+'_'+freq.replace(' ','_').lower()+'.csv' , 'wb')

  try:

      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)

      timestamps = valuematrix[freq].keys()

      timestamps.sort()

      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()

      observations.sort()

      headers = ["Date", "Time"] + observations

      writer.writerow(headers)  # output the headers

      

      for row in timestamps:

          details = [time.strftime("%m/%d/%Y", row),

                     time.strftime("%H:%M:%S", row)] 

          for column in observations:

              try:

                  details.append(valuematrix[freq][row][column])

              except KeyError:

                  details.append("")

                  print >> sys.stderr, 'Missing value',freq, time.asctime(row), column

          writer.writerow(details)

  finally:

      fout.close()

      print 'Output phase complete', fout.name

 

 

Open in new window

0
 

Author Closing Comment

by:sara_bellum
Comment Utility
This is absolutely terrific!!  I need to figure out how to automate this process to avoid poking in commands manually each time, but figure that I've asked enough questions already ;-)
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now