Solved

Copying rows from csv file with python csv reader fails (indexing problem)

Posted on 2008-10-06
17
1,689 Views
Last Modified: 2011-10-19
Working with the python csv reader, I'm trying to grab a specific number of rows from a csv file by setting the index to 0 once a header is found - for timestamps at one-minute intervals, I need to grab the next 60 lines (all starting with a timestamp) and copy them to a file;  for timestamps at one-hour intervals, I need to grab the next 10 lines (also timestamps) and copy them to another file.  Annotated code attached.
header_len = 3
hours_len = 10
minutes_len = 60
 
for item, row in enumerate(source):
    if 'Point Name:' in row:
        index = 0
        metrics = []
        if item >= index:
            metrics.append(row)
            print metrics  //this works
    if '1 Minutes' in row:
        index = 0
        rhm = []
        if item <= (index + minutes_len):
            rhm.append(item)
            print rhm  //this doesn't work
        rhm_slice = []
        if item <= (index + minutes_len):
            rhm_slice.append(row)
            print rhm_slice  // and this doesn't either

Open in new window

0
Comment
Question by:sara_bellum
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
  • 3
17 Comments
 
LVL 29

Expert Comment

by:pepr
ID: 22657234
Firstly, what is the "index" variable good for? It is never changed. Secondly, should not the '1 Minutes' be '1 Minute' or could it even fit with '21 Minutes'...

The only loop is "for" in the above code. The body of the loop is alwas put under "if". The lines where the strings 'Point Name:' or '1 Minutes' are not found are never processed.

You sould probably attach the source csv to make the things clearer.

Petr
0
 
LVL 17

Expert Comment

by:ramrom
ID: 22659974
Also show us the traceback so we can see where the error occurs.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 22659997
What does "this doesn't work" mean?
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 

Author Comment

by:sara_bellum
ID: 22664619
Obviously my request was not clear.  What I am trying to do is grab n number of lines from a raw data (csv) file beginning at certain points (defined by point name and time interval) and copy those lines to another file.
So if I have a line that contains "60 Minutes" or "1 Minutes" (yes, it's plural), I want to copy that line and n lines after that to a new file; if it is '60 Minutes' I need 10 lines, and for one-minute intervals I need to copy 60 lines.  Both of these time intervals are repeated multiple times throughout the raw data file, for each point name.  

Creating a consolidated header will require a dictionary, which is a separate question.  I attach an excerpt from the raw data file.
rawData.txt
0
 
LVL 17

Expert Comment

by:ramrom
ID: 22664711
I am handicapped in that I don't know what source is. I will assume that you did something like:
source = open(csvfile)

If something else please inform us, as the proposed solution won't work.
for row in source:
  if "1 Minutes" in row:
    noLines = 60
  elif "60 Minutes" in row:
    noLines = 10
  else:
    continue
  result = [row]
  for l in range(noLines):
    result.append(source.next())
  # at this point you will have a list of the desired rows to 
  # dispose of as desired

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22666006
Thank you very much for your quick response!  The problem with the result from the above suggestion is that the entire header is prepended to each raw data entry, and successive timestamps are appended to each row, so that instead of the 4 columns of data I started with, I now have 35 or more columns, depending on whether I use 'for row in source' or 'for item, row in enumerate(source)'.  

Until I learn how to build a header, I need to maintain the same number of rows and columns, and
- define which items in each row should be appended to the array (I tried rows[0:2] for the raw data entries but that failed - either the whole row prints or nothing does)
- and delete all rows containing strings 'Date Range' or 'Report Timings' (either all rows print or none do)

Finally, I worked with 3 separate scripts to get any of this to execute - one for 60 minute intervals and one for 1 minute intervals because elif showed up as a syntax error - and another to delete the excess header lines with the unneeded strings.  I'd like to work it all in one, if it doesn't exceed the scope of my question.  TIA!

Script headers:
 
#!/usr/bin/python
 
import csv
import os, string, sys, fileinput   //wasn't sure whether I need these
 
fin = open('hr_data.csv', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
fout = open('hr_sorted.csv', 'wb')
dest = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
 
# Script 1: Delete unneeded lines:
 
for item, row in enumerate(source):
    rhr = [row]
    rhr.append(row)
    if 'Date Range:' or 'Report Timings:' in row:
        del row[0:]
            else: 
                continue
            print rhr
 
# Script 2 - for one-hour intervals
 
if '60 Minutes' in row:
        noLines = 10
    else: 
        continue
    rhr = [row][0:2]       //same result with [row]
    for line in range(noLines):
        rhr.append(source.next())
        print rhr
 
# Script 3 - for 1 minute intervals
 
if '1 Minutes' in row:
        noLines = 60
    else: 
        continue
    rhm = [row][0:2]  //same result as [row]   
    for line in range(noLines):
        rhm.append(source.next())
        print rhm
 
# Script footers
 
dest.writerow(row)
 
fin.close()
fout.close()

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22666020
Correction to the above code: I used two different names for the one-hour-interval arrays and the one-minute-interval arrays, which isn't reflected above (copy / paste error).
0
 
LVL 29

Expert Comment

by:pepr
ID: 22668261
Just a minor, related comment. If row is the list returned from csv reader, then it may look like [1, 2, 3, 4]. When you write [row], you are making the list of the list -- like [ [1, 2, 3, 4] ]. When you are slicing such a list of lists using [0:2] then you are getting the list of first three elements. But there is a single element -- the [1, 2, 3, 4] list. Then, you are right to say that

[ [1, 2, 3, 4] ][0:2] is equal to [ [1, 2, 3, 4] ]

When you want to get the values for a subset of adjacent columt, you probably wanted to do something like

row[0:2] and not the [row][0:2]
0
 
LVL 29

Expert Comment

by:pepr
ID: 22668657
If I understand your needs well, the you want to consume rawData.csv and split the sequence of minute/hour sections into minute/hour output files. If possible, you want to preserve the original headers. If this is so, try the snippet below. It opens one csv reader for the rawData.csv and two output csv writers -- one for 1 minute info, the other for 60 min info. When processing the rawData, the writer is switched based on the processed header of the section.

If the above makes sense, then I recommend to use so called "finite state automaton" http://en.wikipedia.org/wiki/Deterministic_finite_automaton which is very usefull for parsing in such simple cases. It sounds more difficult than it really is. Simply said, when processing the input data, we can think about three states: waiting for a section header, collecting the section header (plus decision made based on its content), collecting the value rows to the destination specified in the header.

Here, the status is numbered inside the variable of the name. The header list is used for collecting the header info until we decide to which output it should be written. When using a finite automaton, the input elements (here the rows) are processed in a single loop. Based on the current status, the row is processed and possibly the next status is deterined (for the next loop).

Instead of counting the value lines--which may be unreliable--they are recognized by their properties (4 values in the row).

The finite automaton may look too complex for some simple tasks. However, it often happens that you find some peculiarities only during the development. Unless the finite automaton is no powerfull enough (theoretically), your parser may be easily modified, enhanced. Even though it is much simpler (theoretically) than other category of parsers (it is of the same power as regular expressions are), I have personally found it very useful for majority of simple tasks. It is worth to learn it (but learn also when it is not powerful enough).

Petr
#!/usr/bin/python
 
import csv
 
fin = open('rawData.csv', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
fout_hr = open('hr_sorted.csv', 'wb')
hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)
 
fout_min = open('min_sorted.csv', 'wb')
min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
for item, row in enumerate(source):
    if status == 0:   # wait a section header rows
        header = []
        if row[0] == 'Point Name:':
            header.append(row)
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting header rows
        header.append(row)
        if row[1] == '1 Minutes':
            writer = min_writer
        elif row[1] == '60 Minutes':
            writer = hr_writer
        elif row[0] == 'Report Timings:':
            # Do output the header and wait for the values.
            for r in header:
                writer.writerow(r)
            status = 2
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            writer.writerow(row[0:2])  # output the first two columns only
        else:
            writer.writerow([''])      # separator
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
        
fout_hr.close()
fout_min.close()

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 22668719
Actually, the item and enumerate() is not used in the above code. The loop

for item, row in enumerate(source):

... can be simplified to...

for row in source:

... unless you want to use the line number inside the rawData.csv for reporting of some error conditions recognized in the input file.
0
 
LVL 29

Expert Comment

by:pepr
ID: 22668762
The code arround the

    if status == 0:   # wait a section header rows

could be a bit enhanced. See the full udpated snippet below.
#!/usr/bin/python
 
import csv
 
fin = open('rawData.csv', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
fout_hr = open('hr_sorted.csv', 'wb')
hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)
 
fout_min = open('min_sorted.csv', 'wb')
min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            header = [ row ]           # the list with the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting header rows
        header.append(row)
        if row[1] == '1 Minutes':
            writer = min_writer
        elif row[1] == '60 Minutes':
            writer = hr_writer
        elif row[0] == 'Report Timings:':
            # Do output the header and wait for the values.
            for r in header:
                writer.writerow(r)
            status = 2
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            writer.writerow(row[0:2])  # output the first two columns only
        else:
            writer.writerow([''])      # separator
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
        
fout_hr.close()
fout_min.close()

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22674084
Thanks very much, will study deterministic finite automation next - the terminology is a bit daunting, but with time many things become less so :) I wrote a script to delete the two unwanted header lines and it works, but it would be better to delete those lines up front rather than run this script + a delete script on each data file.  So I tried several things that I thought should work, but got index out of range errors or no change at all to the output.  My delete script is attached - let me know if something can be done to delete these lines earlier, thanks.
!/usr/bin/python
 
import csv
import os, string, sys, fileinput
 
fin1 = open('hr_sorted.csv', 'rb')
source1 = csv.reader(fin1, quoting=csv.QUOTE_NONNUMERIC)
 
fin2 = open('min_sorted.csv', 'rb')
source2 = csv.reader(fin2, quoting=csv.QUOTE_NONNUMERIC)
 
fout1 = open('hr1_sorted.csv', 'wb')
dest1 = csv.writer(fout1, quoting=csv.QUOTE_NONNUMERIC)
 
fout2 = open('min1_sorted.csv', 'wb')
dest2 = csv.writer(fout2, quoting=csv.QUOTE_NONNUMERIC)
 
for row in source1:
    if row[0] == 'Date Range:':            
        del row
    elif row[0] == 'Report Timings:':
        del row
    else: 
        dest1.writerow(row)
 
for row in source2:
    if row[0] == 'Date Range:':            
        del row
    elif row[0] == 'Report Timings:':
        del row
    else: 
        dest2.writerow(row)
 
fin1.close()
fin2.close()
fout1.close()
fout2.close()

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22674107
Correction, again to my copy/paste: I've removed 'import os, string, sys, fileinput' from my script based on your feedback.
0
 

Author Comment

by:sara_bellum
ID: 22675504
I'm fully prepared to open up another question but must be reading the wrong python docs to figure out how to do dictionaries.  Your script looks well suited to adapting it to do this:
for row in source:
        if row[0] == 'Point Name:':
                header = [] or header = [ row ]
                header.append(row[1])        
                dest.writerow(row[1:])
but this maintains the row structure so each row is still its own array/list.  If I try to consolidate the rows into a single array with:
for item, row in enumerate(source):
...
header.append([item][1])
I get syntax errors or no output at all
If you know of a good beginner reference on dictionary lists let me know thanks.
0
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 22677749
To delete unwanted lines from the header should mean that you do not copy them to the output (i.e. ignoring them). If I understand you well, you consumed rawData.csv (attached earlier as rawData.txt) and extracted the data to hr_sorted.csv and min_sorted.csv. Then you want to process the later files again and remove some lines. However, It is better to do it in one pass.

By the way, what really are your original input files, how many? How many output files are expected? What files should have what the headers contain?

I am still not sure what is the input and what should be the output. The snippet below is the modified script that consumes rawData.csv and produces directly the hr1_sorted.csv and min1_sorted.csv without first producing hr_sorted.csv and min_sorted.csv. The header lines that you want to ignore are ignored (not put to the output), some header lines are modified (the 'Point Name:' removed, only the name value preserved).

Now for the other questions. I do not know what dictionaries do you want to use. We were using the lists so far.

The header.append([item][1]) will always fail, because ['whatever'] creates a list with a single element represented by 'whatever'. If the item is a number, you get something like [ 10 ]. The single element can be accessed via index value 0 (zero). There is no element at index 1 (stands for the second element), hence the error. This is probably not a syntax error, but the runtime error.

For the beginner reference on dictionaries and lists and other built-in data structures, have a look at the official tutorial http://docs.python.org/tutorial/datastructures.html.
#!/usr/bin/python
 
import csv
 
fin = open('rawData.csv', 'rb')
source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 
fout_hr = open('hr1_sorted.csv', 'wb')
hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)
 
fout_min = open('min1_sorted.csv', 'wb')
min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)
 
status = 0     # wait for the header
header = None  # collecting header lines of the section
writer = None  # reference to the wanted output writer
 
for row in source:
    if status == 0:   # wait a section header rows
        if row[0] == 'Point Name:':
            header = [ row[1:] ]    # the list with the info from the first header row
            status = 1
        else:
            continue
 
    elif status == 1:  # collecting some header rows
        if row[1] == '1 Minutes':
            header.append(row)
            writer = min_writer
        elif row[1] == '60 Minutes':
            header.append(row)
            writer = hr_writer
        elif row[1] == 'Date Range':
            pass                       # Ignore this header row.
        elif row[0] == 'Report Timings:':
            # Ignore this header row. Do output the collected header.
            for r in header:
                writer.writerow(r)
            status = 2                 # Get ready for for the value rows.
 
    elif status == 2:  # collecting the values
        if len(row) == 4:              # detect the row with values
            writer.writerow(row[0:2])  # output the first two columns only
        else:
            writer.writerow([''])      # separator
            status = 0                 # wait for another section
    
    else:
        print 'Unexpected status', status    
        
fout_hr.close()
fout_min.close()

Open in new window

0
 

Author Closing Comment

by:sara_bellum
ID: 31503692
Thanks very much, this works :)  I've been reading up on lists and dictionaries and there's abundant documentation on how to create them by entering the array items manually, but nothing for anyone at my level on file processing.  I'll poke around some more to find out how  to append an item from a row to a list or dictionary to see if I get lucky; then I'll probably open another question.
0
 
LVL 29

Expert Comment

by:pepr
ID: 22684590
For the file objects see http://docs.python.org/library/stdtypes.html#bltin-file-objects. Basically, after opening a file, you can iterate through it via

    for line in f:
        doSomething with the line

The iterator gets text lines as strings with the line terminator (from inside the file). The lines are not stored in a list or whatever.

Use lines = f.readlines() when you want to get all the lines into memory in the form of the 'lines' list.

Use s = f.read() to get the whole content of the file into the string variable. If you use the argument, you get at most that given number of bytes -- like   c = f.read(1)

The row's are the lists that were extracted from the file lines by the csv.reader() intermediate objects. You can think about csv.reader() as about an assistant for helping you with reading the opened file and choping the lines into elements.

Simi
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The purpose of this article is to demonstrate how we can use conditional statements using Python.
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question