asked on

Create single header dictionary/list for raw data file

I have a script that grabs lines from a raw data (csv) file and copies the lines associated with 1-minute intervals to one file, those associated with 60-minute intervals to another, and cleans up the headers (see attached code).

In each of the 24 archive files I need to process, there are 39 headers / parameters followed by the raw data (see attached excerpt). Right now each header parameter is listed sequentially, followed by data in timestamp sequence. What I'm trying to do is put the headers at the top of the file in a single row, and line up the metrics into 39 columns. The timestamps begin in December 07 and go to present, so there are a huge number of lines, 150,000 or more.

I've tried to create a single dictionary list from the headers, but all I've managed to do is print a selected item from the original data by row (39 separate lists with 1 item each). I will need to associate the raw data with the header array, presumably with name-value pairs, but didn't get that far. Let me know what you can do to help, TIA.

#!/usr/bin/python
 
import csv
 
fin = open('rawData.csv', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
fout_hr = open('hr_sorted.csv', 'wb')
hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)
 
fout_min = open('min_sorted.csv', 'wb')
min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            header = [ row[1:] ]    # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            header.append(row)
            writer = min_writer
        elif row[1] == '60 Minutes':
            header.append(row)
            writer = hr_writer
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            # Ignore this header row. Do output the collected header.
            for r in header:
                writer.writerow(r)
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            writer.writerow(row[0:3])  # output the first three columns only
        else:
            writer.writerow([''])      # separator
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
        
fout_hr.close()
fout_min.close()

Open in new window

Rawdata.txt

sara_bellum

ASKER

In a perfect world I'd have more time to research this and figure this out, but there we are...I should mention that the 60-minute/1-hour intervals in each archive vary depending on which hour in the day is being archived, so our focus is on the 1-minute intervals where we always have 60 timestamps per metric, regardless of which file is being processed. (Of course it would be nice to have the hour archives working as well, but that's a lot less important.)

PaulKeating

I'm having some trouble visualizing the output you want.

Am I correct in saying that you want (for the minute file) a 39 x 60 matrix of values, with the 39 column labels set to "CCHRC.PV1.DEM:APPARENT PWR", " CCHRC.PV1.DEM:CONSUMPTN LO", and so on, and the 60 row labels being timestamps and occupying 2 columns (1 for date and one for time), and the body of the matrix being the values in column 3 of the input file?

And for the hour file, similar but the matrix will be 39 x n with n depending on the data in the input file?

And do the timestamps have to be in ascending order in the output, the way they are in the input? It seems a natural way to do it, but if a program not a person will be consuming the .csv you are producing, then it probably won't matter, and allowing them to be unsorted will simplify the program a little.

sara_bellum

ASKER

That sounds right but I attach an excerpt of a processed file so you can better visualize the format (this one was processed in PHP, but that method ate a lot of memory and garbled the timestamps so we're starting over in python).
The timestamp order we use is always oldest to most recent - the latest timestamps are appended to the end of each raw data file, so the processed files should follow the same order.
archive-sample.txt

PaulKeating

Ok, here it is.

You can't get entirely away from the memory problem. If you want to sort something, you can't print the first line of output until you've read the last line of input, because the last line might sort to the top of the list. You want a crosstab, and crosstabs have the same problem. You have to read everything into memory before you write anything out.

I suspect the garbled timestamps in the PHP program may have been due to the fact that the timestamps are not consistent through all the measurements. Some of them start and end a minute later than the others. If the programmer assumed that this was not so then a lot of the observations would have been off by one.

import csv
import time
import os.path
 
def convert_timestamp(date, hhmmss):
    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 
 
fin = open(r'c:\temp\Q_23810592.txt', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            columnlabel = row[1]   # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            freq = row[1]
        elif row[1] == '60 Minutes':
            freq = row[1]
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            date, hhmmss, value = row[0:3]
            rowlabel = convert_timestamp(date, hhmmss)
            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value
        else:
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
 
 
 
for freq in valuematrix.keys():  # freq = hour, minute
  fout = open(os.path.join(r'c:\temp',freq)+'.csv' , 'wb')
  try:
      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
      timestamps = valuematrix[freq].keys()
      timestamps.sort()
      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()
      observations.sort()
      headers = ["Date", "Time"] + observations
      writer.writerow(headers)  # output the headers
      
      for row in timestamps:
          details = [time.strftime("%m/%d/%Y", row),
                     time.strftime("%H:%M:%S", row)] 
          for column in observations:
              try:
                  details.append(valuematrix[freq][row][column])
              except KeyError:
                  details.append("")
                  print 'Missing value',freq, time.asctime(row), column
          writer.writerow(details)
  finally:
      fout.close()

Open in new window

sara_bellum

ASKER

Thanks for your quick response!

I ran the script against a daily data file (a copy of which is attached above) and against a much larger archive file (both raw data files are in the same format). The outputs, '1 Minutes.csv' and '60 Minutes.csv' look good - I'm forwarding to the operators to see if they detect any problem areas.
It was a great idea to introduce rows of timestamps for missing data - this was probably where the original script went wrong.

When executing this script at the Linux command line against the shorter data file, I got 39 lines of output like this:
Missing value 1 Minutes Tue Sep 23 06:01:00 2008 CCHRC.PV1.DEM:APPARENT PWR
(one for each header item); for the archive file, I got 112191 missing value lines, beginning with
Missing value 60 Minutes Fri Dec 7 13:33:08 2007 CCHRC.PV3.DEM:POWER FACTOR
and ending with
Missing value 1 Minutes Mon Sep 22 23:01:00 2008 CCHRC.PV3.DEM:DEMAND
(again, reporting a number of missing timestamps 39 times).

$ ./process.py > errors.txt could store the missing values in files that could successively be overwritten as needed; as annoying as it is to see a timestamp error repeated 39 times for each of 39 values, I will leave the error logs as is, because there are timestamps with some metrics showing values and others with none.

You were right about how slow it was, and without a progress indicator at the command prompt I wasn't sure whether the script was processing the archive or just trying to. Until I'm confident enough of the processing power of this machine to put this in a crontab, I'd prefer to see some hash marks with a closing remark like "Done", if that can be accomplished without appreciably slowing down the script.

Finally, we are storing these files on a Linux server, so if there's a way to change the fout path to yield *nix-compatible file names, that would be better than the current file names.

Let me know if you need anything else. Thanks again!

PaulKeating

Ok, some real-world adjustments.

First, instead of hard-coding the file locations I'm taking them off the command line, and I'm sending the missing-values messages to stderr, so invocation will now be something like

$ ./process.py /path/to/input.txt /path/for/output.csv 2> /path/for/error.log

Second, I've put in the progress messages you asked for. I'm printing a hash to the console for every line read. This may be too many: if so, put in a counter and report every 10th or 100th line.

counter += 1
if counter % 100 == 1: print '#',

Third, I've substituted underscores for spaces in the names of the output file for the benefit of picky filesystems.

#!/usr/bin/python
 
import csv
import time
import os.path
import sys
 
def convert_timestamp(date, hhmmss):
    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 
 
fin = open(sys.argv[1], 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            columnlabel = row[1]   # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            freq = row[1]
        elif row[1] == '60 Minutes':
            freq = row[1]
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            date, hhmmss, value = row[0:3]
            rowlabel = convert_timestamp(date, hhmmss)
            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value
            print '#',
        else:
            status = 0                 # wait for another section
            print
    
    else:
        print >> sys.stderr, 'Unexpected status', status
print 'Input phase complete'
 
 
for freq in valuematrix.keys():  # freq = hour, minute
  fout = open(os.path.join(sys.argv[2],freq.replace(' ','_').lower())+'.csv' , 'wb')
  try:
      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
      timestamps = valuematrix[freq].keys()
      timestamps.sort()
      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()
      observations.sort()
      headers = ["Date", "Time"] + observations
      writer.writerow(headers)  # output the headers
      
      for row in timestamps:
          details = [time.strftime("%m/%d/%Y", row),
                     time.strftime("%H:%M:%S", row)] 
          for column in observations:
              try:
                  details.append(valuematrix[freq][row][column])
              except KeyError:
                  details.append("")
                  print >> sys.stderr, 'Missing value',freq, time.asctime(row), column
          writer.writerow(details)
  finally:
      fout.close()
      print 'Output phase complete'

Open in new window

sara_bellum

ASKER

Thanks very much! I may be pushing my luck here but I need one last thing: these outputs will be uploaded to a single directory on the server where users can download them from a web page, so each processed archive should have a unique filename. I can set up a shutil copy to rename all the output files but I'm such a novice that I have to define source and destination 24 times for each hourly archive, 24 times for each minute-interval archive and then shutil.copy from each source to each destination (very tedious; if I can't reduce that I'll ask another question to resolve this too). The quickest thing would be to find some way to concatenate the archive number into the output filename before the freq, in order to get filenames like archive1_1_minutes.csv, archive1_60_minutes.csv etc for 24 archives. Let me know if that can be done, thanks.

PaulKeating

Where is the number to come from? Is it in the name of the input file? Or do you want the program to work out the next number from looking at the files that already exist?

PaulKeating

... or maybe it would be a better idea to derive the filename from the first timestamp in the file: that way it wouldn't matter what order the input was processed in, and it would be obvious what was in a file without having to open it.

sara_bellum

ASKER

I can't use the starting timestamps for the archive files because they are historical and all begin with the same timestamp. This does not apply to the daily data files (like the one I posted here) but we might as well use the same naming convention for the daily files as we do for the archives.

By way of explanation, I was prepared to run
$ ./process.py /path-to-archives/archive1/archive1.csv /path-to-archives/archive1/ 2> /path-to-archives/error.log
$ ./process.py /path-to-archives/archive2/archive2.csv /path-to-archives/archive2/ 2> /path-to-archives/error.log
etc 24 times, but that was because I needed to create a separate directory for each archive file in order for our PHP script to run at all.

I understand that your python script has no such limitation, but if I leave all the archive files in a single directory and run
$ ./process.py /path-to-archives/archive1.csv /path-to-archives/output1/ 2> /path-to-archives/error.log
$ ./process.py /path-to-archives/archive2.csv /path-to-archives/output2/ 2> /path-to-archives/error.log

then the filenames in the output directories still need to be unique, and the number should come from the name of the archive file (archive1, archive2 etc). Ideally, the processed files should have names like archive1_minutes.csv, archive1_hours.csv, archive2_minutes.csv, archive2_hours.csv etc. If my boss still wants the daily files processed as well, the same numbering system would apply: data1.csv should yield data1_minutes.csv and data1_hours.csv etc (there are 24 daily data files along with 24 archive files; the daily files were processed correctly in PHP as far as we know, but we ran into difficulties with the archives).

As always, TIA :)

ASKER CERTIFIED SOLUTION

PaulKeating

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

sara_bellum

ASKER

This is absolutely terrific!! I need to figure out how to automate this process to avoid poking in commands manually each time, but figure that I've asked enough questions already ;-)