Solved

Create single header dictionary/list for raw data file

Posted on 2008-10-13
12
743 Views
Last Modified: 2012-08-14
I have a script that grabs lines from a raw data (csv) file and copies the lines associated with 1-minute intervals to one file, those associated with 60-minute intervals to another, and cleans up the headers (see attached code).  

In each of the 24 archive files I need to process, there are 39 headers / parameters followed by the raw data (see attached excerpt).  Right now each header parameter is listed sequentially, followed by data in timestamp sequence.  What I'm trying to do is put the headers at the top of the file in a single row, and line up the metrics into 39 columns.  The timestamps begin in December 07 and go to present, so there are a huge number of lines, 150,000 or more.  

I've tried to create a single dictionary list from the headers, but all I've managed to do is print a selected item from the original data by row (39 separate lists with 1 item each).  I will need to associate the raw data with the header array, presumably with name-value pairs, but didn't get that far.  Let me know what you can do to help, TIA.
#!/usr/bin/python
 
import csv
 
fin = open('rawData.csv', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
fout_hr = open('hr_sorted.csv', 'wb')
hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)
 
fout_min = open('min_sorted.csv', 'wb')
min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            header = [ row[1:] ]    # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            header.append(row)
            writer = min_writer
        elif row[1] == '60 Minutes':
            header.append(row)
            writer = hr_writer
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            # Ignore this header row. Do output the collected header.
            for r in header:
                writer.writerow(r)
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            writer.writerow(row[0:3])  # output the first three columns only
        else:
            writer.writerow([''])      # separator
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
        
fout_hr.close()
fout_min.close()

Open in new window

Rawdata.txt
0
Comment
Question by:sara_bellum
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 6
12 Comments
 

Author Comment

by:sara_bellum
ID: 22705543
In a perfect world I'd have more time to research this and figure this out, but there we are...I should mention that the 60-minute/1-hour intervals in each archive vary depending on which hour in the day is being archived, so our focus is on the 1-minute intervals where we always have 60 timestamps per metric, regardless of which file is being processed.  (Of course it would be nice to have the hour archives working as well, but that's a lot less important.)
0
 
LVL 5

Expert Comment

by:PaulKeating
ID: 22706682
  I'm having some trouble visualizing the output you want.

Am I correct in saying that you want (for the minute file) a 39 x 60 matrix of values, with the 39 column labels set to "CCHRC.PV1.DEM:APPARENT   PWR", "   CCHRC.PV1.DEM:CONSUMPTN   LO", and so on, and the 60 row labels being timestamps and occupying 2 columns (1 for date and one for time), and the body of the matrix being the values in column 3 of the input file?

And for the hour file, similar but the matrix will be 39 x n with n depending on the data in the input file?

And do the timestamps have to be in ascending order in the output, the way they are in the input? It seems a natural way to do it, but if a program not a person will be consuming the .csv you are producing, then it probably won't matter, and allowing them to be unsorted will simplify the program a little.
0
 

Author Comment

by:sara_bellum
ID: 22707166
That sounds right but I attach an excerpt of a processed file so you can better visualize the format (this one was processed in PHP, but that method ate a lot of memory and garbled the timestamps so we're starting over in python).  
The timestamp order we use is always oldest to most recent - the latest timestamps are appended to the end of each raw data file, so the processed files should follow the same order.  
archive-sample.txt
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 5

Expert Comment

by:PaulKeating
ID: 22707623
Ok, here it is.

You can't get entirely away from the memory problem. If you want to sort something, you can't print the first line of output until you've read the last line of input, because the last line might sort to the top of the list. You want a crosstab, and crosstabs have the same problem. You have to read everything into memory before you write anything out.

I suspect the garbled timestamps in the PHP program may have been due to the fact that the timestamps are not consistent through all the measurements. Some of them start and end a minute later than the others. If the programmer assumed that this was not so then a lot of the observations would have been off by one.

import csv
import time
import os.path
 
def convert_timestamp(date, hhmmss):
    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 
 
fin = open(r'c:\temp\Q_23810592.txt', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            columnlabel = row[1]   # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            freq = row[1]
        elif row[1] == '60 Minutes':
            freq = row[1]
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            date, hhmmss, value = row[0:3]
            rowlabel = convert_timestamp(date, hhmmss)
            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value
        else:
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
 
 
 
for freq in valuematrix.keys():  # freq = hour, minute
  fout = open(os.path.join(r'c:\temp',freq)+'.csv' , 'wb')
  try:
      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
      timestamps = valuematrix[freq].keys()
      timestamps.sort()
      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()
      observations.sort()
      headers = ["Date", "Time"] + observations
      writer.writerow(headers)  # output the headers
      
      for row in timestamps:
          details = [time.strftime("%m/%d/%Y", row),
                     time.strftime("%H:%M:%S", row)] 
          for column in observations:
              try:
                  details.append(valuematrix[freq][row][column])
              except KeyError:
                  details.append("")
                  print 'Missing value',freq, time.asctime(row), column
          writer.writerow(details)
  finally:
      fout.close()
 
 

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22708304
Thanks for your quick response!

I ran the script against a daily data file (a copy of which is attached above) and against a much larger archive file (both raw data files are in the same format).  The outputs, '1 Minutes.csv' and '60 Minutes.csv' look good - I'm forwarding to the operators to see if they detect any problem areas.
It was a great idea to introduce rows of timestamps for missing data - this was probably where the original script went wrong.

When executing this script at the Linux command line against the shorter data file, I got 39 lines of output like this:
Missing value 1 Minutes Tue Sep 23 06:01:00 2008 CCHRC.PV1.DEM:APPARENT PWR
(one for each header item); for the archive file, I got 112191 missing value lines, beginning with
Missing value 60 Minutes Fri Dec  7 13:33:08 2007 CCHRC.PV3.DEM:POWER FACTOR
and ending with
Missing value 1 Minutes Mon Sep 22 23:01:00 2008 CCHRC.PV3.DEM:DEMAND
(again, reporting a number of missing timestamps 39 times).

$ ./process.py > errors.txt could store the missing values in files that could successively be overwritten as needed; as annoying as it is to see a timestamp error repeated 39 times for each of 39 values, I will leave the error logs as is, because there are timestamps with some metrics showing values and others with none.  

You were right about how slow it was, and without a progress indicator at the command prompt I wasn't sure whether the script was processing the archive or just trying to. Until I'm confident enough of the processing power of this machine to put this in a crontab, I'd prefer to see some hash marks with a closing remark like "Done", if that can be accomplished without appreciably slowing down the script.

Finally, we are storing these files on a Linux server, so if there's a way to change the fout path to yield *nix-compatible file names, that would be better than the current file names.

Let me know if you need anything else. Thanks again!

0
 
LVL 5

Expert Comment

by:PaulKeating
ID: 22712892
Ok, some real-world adjustments.

First, instead of hard-coding the file locations I'm taking them off the command line, and I'm sending the missing-values messages to stderr, so invocation will now be something like

$ ./process.py    /path/to/input.txt     /path/for/output.csv     2>     /path/for/error.log

Second, I've put in the progress messages you asked for. I'm printing a hash to the console for every line read. This may be too many: if so, put in a counter and report every 10th or 100th line.

counter += 1
if counter % 100 == 1: print '#',

Third, I've substituted underscores for spaces in the names of the output file for the benefit of picky filesystems.



#!/usr/bin/python
 
import csv
import time
import os.path
import sys
 
def convert_timestamp(date, hhmmss):
    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 
 
fin = open(sys.argv[1], 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            columnlabel = row[1]   # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            freq = row[1]
        elif row[1] == '60 Minutes':
            freq = row[1]
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            date, hhmmss, value = row[0:3]
            rowlabel = convert_timestamp(date, hhmmss)
            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value
            print '#',
        else:
            status = 0                 # wait for another section
            print
    
    else:
        print >> sys.stderr, 'Unexpected status', status
print 'Input phase complete'
 
 
for freq in valuematrix.keys():  # freq = hour, minute
  fout = open(os.path.join(sys.argv[2],freq.replace(' ','_').lower())+'.csv' , 'wb')
  try:
      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
      timestamps = valuematrix[freq].keys()
      timestamps.sort()
      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()
      observations.sort()
      headers = ["Date", "Time"] + observations
      writer.writerow(headers)  # output the headers
      
      for row in timestamps:
          details = [time.strftime("%m/%d/%Y", row),
                     time.strftime("%H:%M:%S", row)] 
          for column in observations:
              try:
                  details.append(valuematrix[freq][row][column])
              except KeyError:
                  details.append("")
                  print >> sys.stderr, 'Missing value',freq, time.asctime(row), column
          writer.writerow(details)
  finally:
      fout.close()
      print 'Output phase complete'
 
 

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22716947
Thanks very much!  I may be pushing my luck here but I need one last thing: these outputs will be uploaded to a single directory on the server where users can download them from a web page, so each processed archive should have a unique filename.  I can set up a shutil copy to rename all the output files but I'm such a novice that I have to define source and destination 24 times for each hourly archive, 24 times for each minute-interval archive and then shutil.copy from each source to each destination (very tedious; if I can't reduce that I'll ask another question to resolve this too).  The quickest thing would be to find some way to concatenate the archive number into the output filename before the freq, in order to get filenames like archive1_1_minutes.csv, archive1_60_minutes.csv etc for 24 archives.  Let me know if that can be done, thanks.  
0
 
LVL 5

Expert Comment

by:PaulKeating
ID: 22722012
Where is the number to come from? Is it in the name of the input file? Or do you want the program to work out the next number from looking at the files that already exist?
 
0
 
LVL 5

Expert Comment

by:PaulKeating
ID: 22723254
... or maybe it would be a better idea to derive the filename from the first timestamp in the file: that way it wouldn't matter what order the input was processed in, and it would be obvious what was in a file without having to open it.
0
 

Author Comment

by:sara_bellum
ID: 22727111
I can't use the starting timestamps for the archive files because they are historical and all begin with the same timestamp.  This does not apply to the daily data files (like the one I posted here) but we might as well use the same naming convention for the daily files as we do for the archives.

By way of explanation, I was prepared to run
$ ./process.py    /path-to-archives/archive1/archive1.csv     /path-to-archives/archive1/     2>     /path-to-archives/error.log
$ ./process.py    /path-to-archives/archive2/archive2.csv     /path-to-archives/archive2/     2>     /path-to-archives/error.log
etc 24 times, but that was because I needed to create a separate directory for each archive file in order for our PHP script to run at all.

I understand that your python script has no such limitation, but if I leave all the archive files in a single directory and run
$ ./process.py    /path-to-archives/archive1.csv     /path-to-archives/output1/    2>     /path-to-archives/error.log
$ ./process.py    /path-to-archives/archive2.csv     /path-to-archives/output2/    2>     /path-to-archives/error.log

then the filenames in the output directories still need to be unique, and the number should come from the name of the archive file (archive1, archive2 etc). Ideally, the processed files should have names like archive1_minutes.csv, archive1_hours.csv, archive2_minutes.csv, archive2_hours.csv etc.  If my boss still wants the daily files processed as well, the same numbering system would apply: data1.csv should yield data1_minutes.csv and data1_hours.csv etc (there are 24 daily data files along with 24 archive files; the daily files were processed correctly in PHP as far as we know, but we ran into difficulties with the archives).

As always, TIA :)
0
 
LVL 5

Accepted Solution

by:
PaulKeating earned 500 total points
ID: 22734012
I've changed it to extract the serial number from the name of the input file. (If it can't find one it will use 000 - take that as an indication something is wrong.)

This works like the other program did: it creates a folder called outputn under the destination folder you specify and puts the crosstab files in that. If that isn't what you want, and you'd like all the output files to go to the same folder, then change line 64 from:

fout = open(os.path.join(outloc,freq.replace(' ','_').lower())+'.csv' , 'wb')

to

fout = open(outloc+'_'+freq.replace(' ','_').lower()+'.csv' , 'wb')

... doing that just substitutes _ for the / delimiter in the filename ...

and remove lines 24-5 which create the subfolders.

You really ought to get the shell to loop through the files for you, instead of issuing 24 commands. You might check out the for command. I won't offer you an example because I'm no expert on that topic.

#!/usr/bin/python
 
import csv
import time
import os.path
import sys
import re
 
def convert_timestamp(date, hhmmss):
    return time.strptime(date + " " + hhmmss, "%m/%d/%Y %H:%M:%S")
 
seqre = re.compile(r'\D*(\d+).*\.')
fin = open(sys.argv[1], 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
try:
    seq = 'output' + seqre.match(os.path.basename(fin.name)).group(1)
except:
    seq = 'output000'
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
outloc = os.path.join(sys.argv[2],seq)
if not os.path.exists(outloc):
    os.mkdir(outloc)
 
valuematrix = {'1 Minutes': {}, '60 Minutes': {}}
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            columnlabel = row[1]   # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            freq = row[1]
        elif row[1] == '60 Minutes':
            freq = row[1]
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            date, hhmmss, value = row[0:3]
            rowlabel = convert_timestamp(date, hhmmss)
            valuematrix[freq].setdefault(rowlabel,{})[columnlabel] = value
            print '#',
        else:
            status = 0                 # wait for another section
            print
    
    else:
        print >> sys.stderr, 'Unexpected status', status
fin.close()
print 'Input phase complete', fin.name
 
 
for freq in valuematrix.keys():  # freq = hour, minute
  fout = open(os.path.join(outloc,freq.replace(' ','_').lower())+'.csv' , 'wb')
  #fout = open(outloc+'_'+freq.replace(' ','_').lower()+'.csv' , 'wb')
  try:
      writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
      timestamps = valuematrix[freq].keys()
      timestamps.sort()
      observations = valuematrix[freq][timestamps[len(timestamps) // 2]].keys()
      observations.sort()
      headers = ["Date", "Time"] + observations
      writer.writerow(headers)  # output the headers
      
      for row in timestamps:
          details = [time.strftime("%m/%d/%Y", row),
                     time.strftime("%H:%M:%S", row)] 
          for column in observations:
              try:
                  details.append(valuematrix[freq][row][column])
              except KeyError:
                  details.append("")
                  print >> sys.stderr, 'Missing value',freq, time.asctime(row), column
          writer.writerow(details)
  finally:
      fout.close()
      print 'Output phase complete', fout.name
 
 

Open in new window

0
 

Author Closing Comment

by:sara_bellum
ID: 31505637
This is absolutely terrific!!  I need to figure out how to automate this process to avoid poking in commands manually each time, but figure that I've asked enough questions already ;-)
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Variable is a place holder or reserved memory locations to store any value. Which means whenever we create a variable, indirectly we are reserving some space in the memory. The interpreter assigns or allocates some space in the memory based on the d…
Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question