Solved

Copying rows from csv file with python csv reader fails (indexing problem)

Posted on 2008-10-06
17
1,647 Views
Last Modified: 2011-10-19
Working with the python csv reader, I'm trying to grab a specific number of rows from a csv file by setting the index to 0 once a header is found - for timestamps at one-minute intervals, I need to grab the next 60 lines (all starting with a timestamp) and copy them to a file;  for timestamps at one-hour intervals, I need to grab the next 10 lines (also timestamps) and copy them to another file.  Annotated code attached.
header_len = 3

hours_len = 10

minutes_len = 60
 

for item, row in enumerate(source):

    if 'Point Name:' in row:

        index = 0

        metrics = []

        if item >= index:

            metrics.append(row)

            print metrics  //this works

    if '1 Minutes' in row:

        index = 0

        rhm = []

        if item <= (index + minutes_len):

            rhm.append(item)

            print rhm  //this doesn't work

        rhm_slice = []

        if item <= (index + minutes_len):

            rhm_slice.append(row)

            print rhm_slice  // and this doesn't either

Open in new window

0
Comment
Question by:sara_bellum
  • 7
  • 7
  • 3
17 Comments
 
LVL 28

Expert Comment

by:pepr
ID: 22657234
Firstly, what is the "index" variable good for? It is never changed. Secondly, should not the '1 Minutes' be '1 Minute' or could it even fit with '21 Minutes'...

The only loop is "for" in the above code. The body of the loop is alwas put under "if". The lines where the strings 'Point Name:' or '1 Minutes' are not found are never processed.

You sould probably attach the source csv to make the things clearer.

Petr
0
 
LVL 17

Expert Comment

by:ramrom
ID: 22659974
Also show us the traceback so we can see where the error occurs.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 22659997
What does "this doesn't work" mean?
0
 

Author Comment

by:sara_bellum
ID: 22664619
Obviously my request was not clear.  What I am trying to do is grab n number of lines from a raw data (csv) file beginning at certain points (defined by point name and time interval) and copy those lines to another file.
So if I have a line that contains "60 Minutes" or "1 Minutes" (yes, it's plural), I want to copy that line and n lines after that to a new file; if it is '60 Minutes' I need 10 lines, and for one-minute intervals I need to copy 60 lines.  Both of these time intervals are repeated multiple times throughout the raw data file, for each point name.  

Creating a consolidated header will require a dictionary, which is a separate question.  I attach an excerpt from the raw data file.
rawData.txt
0
 
LVL 17

Expert Comment

by:ramrom
ID: 22664711
I am handicapped in that I don't know what source is. I will assume that you did something like:
source = open(csvfile)

If something else please inform us, as the proposed solution won't work.
for row in source:

  if "1 Minutes" in row:

    noLines = 60

  elif "60 Minutes" in row:

    noLines = 10

  else:

    continue

  result = [row]

  for l in range(noLines):

    result.append(source.next())

  # at this point you will have a list of the desired rows to 

  # dispose of as desired

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22666006
Thank you very much for your quick response!  The problem with the result from the above suggestion is that the entire header is prepended to each raw data entry, and successive timestamps are appended to each row, so that instead of the 4 columns of data I started with, I now have 35 or more columns, depending on whether I use 'for row in source' or 'for item, row in enumerate(source)'.  

Until I learn how to build a header, I need to maintain the same number of rows and columns, and
- define which items in each row should be appended to the array (I tried rows[0:2] for the raw data entries but that failed - either the whole row prints or nothing does)
- and delete all rows containing strings 'Date Range' or 'Report Timings' (either all rows print or none do)

Finally, I worked with 3 separate scripts to get any of this to execute - one for 60 minute intervals and one for 1 minute intervals because elif showed up as a syntax error - and another to delete the excess header lines with the unneeded strings.  I'd like to work it all in one, if it doesn't exceed the scope of my question.  TIA!

Script headers:
 

#!/usr/bin/python
 

import csv

import os, string, sys, fileinput   //wasn't sure whether I need these
 

fin = open('hr_data.csv', 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)
 

fout = open('hr_sorted.csv', 'wb')

dest = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
 

# Script 1: Delete unneeded lines:
 

for item, row in enumerate(source):

    rhr = [row]

    rhr.append(row)

    if 'Date Range:' or 'Report Timings:' in row:

        del row[0:]

            else: 

                continue

            print rhr
 

# Script 2 - for one-hour intervals
 

if '60 Minutes' in row:

        noLines = 10

    else: 

        continue

    rhr = [row][0:2]       //same result with [row]

    for line in range(noLines):

        rhr.append(source.next())

        print rhr
 

# Script 3 - for 1 minute intervals
 

if '1 Minutes' in row:

        noLines = 60

    else: 

        continue

    rhm = [row][0:2]  //same result as [row]   

    for line in range(noLines):

        rhm.append(source.next())

        print rhm
 

# Script footers
 

dest.writerow(row)
 

fin.close()

fout.close()

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22666020
Correction to the above code: I used two different names for the one-hour-interval arrays and the one-minute-interval arrays, which isn't reflected above (copy / paste error).
0
 
LVL 28

Expert Comment

by:pepr
ID: 22668261
Just a minor, related comment. If row is the list returned from csv reader, then it may look like [1, 2, 3, 4]. When you write [row], you are making the list of the list -- like [ [1, 2, 3, 4] ]. When you are slicing such a list of lists using [0:2] then you are getting the list of first three elements. But there is a single element -- the [1, 2, 3, 4] list. Then, you are right to say that

[ [1, 2, 3, 4] ][0:2] is equal to [ [1, 2, 3, 4] ]

When you want to get the values for a subset of adjacent columt, you probably wanted to do something like

row[0:2] and not the [row][0:2]
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 28

Expert Comment

by:pepr
ID: 22668657
If I understand your needs well, the you want to consume rawData.csv and split the sequence of minute/hour sections into minute/hour output files. If possible, you want to preserve the original headers. If this is so, try the snippet below. It opens one csv reader for the rawData.csv and two output csv writers -- one for 1 minute info, the other for 60 min info. When processing the rawData, the writer is switched based on the processed header of the section.

If the above makes sense, then I recommend to use so called "finite state automaton" http://en.wikipedia.org/wiki/Deterministic_finite_automaton which is very usefull for parsing in such simple cases. It sounds more difficult than it really is. Simply said, when processing the input data, we can think about three states: waiting for a section header, collecting the section header (plus decision made based on its content), collecting the value rows to the destination specified in the header.

Here, the status is numbered inside the variable of the name. The header list is used for collecting the header info until we decide to which output it should be written. When using a finite automaton, the input elements (here the rows) are processed in a single loop. Based on the current status, the row is processed and possibly the next status is deterined (for the next loop).

Instead of counting the value lines--which may be unreliable--they are recognized by their properties (4 values in the row).

The finite automaton may look too complex for some simple tasks. However, it often happens that you find some peculiarities only during the development. Unless the finite automaton is no powerfull enough (theoretically), your parser may be easily modified, enhanced. Even though it is much simpler (theoretically) than other category of parsers (it is of the same power as regular expressions are), I have personally found it very useful for majority of simple tasks. It is worth to learn it (but learn also when it is not powerful enough).

Petr
#!/usr/bin/python

 

import csv

 

fin = open('rawData.csv', 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

 

fout_hr = open('hr_sorted.csv', 'wb')

hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)

 

fout_min = open('min_sorted.csv', 'wb')

min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer
 

for item, row in enumerate(source):

    if status == 0:   # wait a section header rows

        header = []

        if row[0] == 'Point Name:':

            header.append(row)

            status = 1

        else:

            continue
 

    elif status == 1:  # collecting header rows

        header.append(row)

        if row[1] == '1 Minutes':

            writer = min_writer

        elif row[1] == '60 Minutes':

            writer = hr_writer

        elif row[0] == 'Report Timings:':

            # Do output the header and wait for the values.

            for r in header:

                writer.writerow(r)

            status = 2
 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            writer.writerow(row[0:2])  # output the first two columns only

        else:

            writer.writerow([''])      # separator

            status = 0                 # wait for another section

    

    else:

        print 'Unexpected status', status    

        

fout_hr.close()

fout_min.close()

Open in new window

0
 
LVL 28

Expert Comment

by:pepr
ID: 22668719
Actually, the item and enumerate() is not used in the above code. The loop

for item, row in enumerate(source):

... can be simplified to...

for row in source:

... unless you want to use the line number inside the rawData.csv for reporting of some error conditions recognized in the input file.
0
 
LVL 28

Expert Comment

by:pepr
ID: 22668762
The code arround the

    if status == 0:   # wait a section header rows

could be a bit enhanced. See the full udpated snippet below.
#!/usr/bin/python

 

import csv

 

fin = open('rawData.csv', 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

 

fout_hr = open('hr_sorted.csv', 'wb')

hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)

 

fout_min = open('min_sorted.csv', 'wb')

min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer

 

for row in source:

    if status == 0:   # wait a section header rows

        if row[0] == 'Point Name:':

            header = [ row ]           # the list with the first header row

            status = 1

        else:

            continue

 

    elif status == 1:  # collecting header rows

        header.append(row)

        if row[1] == '1 Minutes':

            writer = min_writer

        elif row[1] == '60 Minutes':

            writer = hr_writer

        elif row[0] == 'Report Timings:':

            # Do output the header and wait for the values.

            for r in header:

                writer.writerow(r)

            status = 2

 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            writer.writerow(row[0:2])  # output the first two columns only

        else:

            writer.writerow([''])      # separator

            status = 0                 # wait for another section

    

    else:

        print 'Unexpected status', status    

        

fout_hr.close()

fout_min.close()

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22674084
Thanks very much, will study deterministic finite automation next - the terminology is a bit daunting, but with time many things become less so :) I wrote a script to delete the two unwanted header lines and it works, but it would be better to delete those lines up front rather than run this script + a delete script on each data file.  So I tried several things that I thought should work, but got index out of range errors or no change at all to the output.  My delete script is attached - let me know if something can be done to delete these lines earlier, thanks.
!/usr/bin/python
 

import csv

import os, string, sys, fileinput
 

fin1 = open('hr_sorted.csv', 'rb')

source1 = csv.reader(fin1, quoting=csv.QUOTE_NONNUMERIC)
 

fin2 = open('min_sorted.csv', 'rb')

source2 = csv.reader(fin2, quoting=csv.QUOTE_NONNUMERIC)
 

fout1 = open('hr1_sorted.csv', 'wb')

dest1 = csv.writer(fout1, quoting=csv.QUOTE_NONNUMERIC)
 

fout2 = open('min1_sorted.csv', 'wb')

dest2 = csv.writer(fout2, quoting=csv.QUOTE_NONNUMERIC)
 

for row in source1:

    if row[0] == 'Date Range:':            

        del row

    elif row[0] == 'Report Timings:':

        del row

    else: 

        dest1.writerow(row)
 

for row in source2:

    if row[0] == 'Date Range:':            

        del row

    elif row[0] == 'Report Timings:':

        del row

    else: 

        dest2.writerow(row)
 

fin1.close()

fin2.close()

fout1.close()

fout2.close()

Open in new window

0
 

Author Comment

by:sara_bellum
ID: 22674107
Correction, again to my copy/paste: I've removed 'import os, string, sys, fileinput' from my script based on your feedback.
0
 

Author Comment

by:sara_bellum
ID: 22675504
I'm fully prepared to open up another question but must be reading the wrong python docs to figure out how to do dictionaries.  Your script looks well suited to adapting it to do this:
for row in source:
        if row[0] == 'Point Name:':
                header = [] or header = [ row ]
                header.append(row[1])        
                dest.writerow(row[1:])
but this maintains the row structure so each row is still its own array/list.  If I try to consolidate the rows into a single array with:
for item, row in enumerate(source):
...
header.append([item][1])
I get syntax errors or no output at all
If you know of a good beginner reference on dictionary lists let me know thanks.
0
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 22677749
To delete unwanted lines from the header should mean that you do not copy them to the output (i.e. ignoring them). If I understand you well, you consumed rawData.csv (attached earlier as rawData.txt) and extracted the data to hr_sorted.csv and min_sorted.csv. Then you want to process the later files again and remove some lines. However, It is better to do it in one pass.

By the way, what really are your original input files, how many? How many output files are expected? What files should have what the headers contain?

I am still not sure what is the input and what should be the output. The snippet below is the modified script that consumes rawData.csv and produces directly the hr1_sorted.csv and min1_sorted.csv without first producing hr_sorted.csv and min_sorted.csv. The header lines that you want to ignore are ignored (not put to the output), some header lines are modified (the 'Point Name:' removed, only the name value preserved).

Now for the other questions. I do not know what dictionaries do you want to use. We were using the lists so far.

The header.append([item][1]) will always fail, because ['whatever'] creates a list with a single element represented by 'whatever'. If the item is a number, you get something like [ 10 ]. The single element can be accessed via index value 0 (zero). There is no element at index 1 (stands for the second element), hence the error. This is probably not a syntax error, but the runtime error.

For the beginner reference on dictionaries and lists and other built-in data structures, have a look at the official tutorial http://docs.python.org/tutorial/datastructures.html.
#!/usr/bin/python

 

import csv

 

fin = open('rawData.csv', 'rb')

source = csv.reader(fin, quoting=csv.QUOTE_NONNUMERIC)

 

fout_hr = open('hr1_sorted.csv', 'wb')

hr_writer = csv.writer(fout_hr, quoting=csv.QUOTE_NONNUMERIC)

 

fout_min = open('min1_sorted.csv', 'wb')

min_writer = csv.writer(fout_min, quoting=csv.QUOTE_NONNUMERIC)

 

status = 0     # wait for the header

header = None  # collecting header lines of the section

writer = None  # reference to the wanted output writer

 

for row in source:

    if status == 0:   # wait a section header rows

        if row[0] == 'Point Name:':

            header = [ row[1:] ]    # the list with the info from the first header row

            status = 1

        else:

            continue

 

    elif status == 1:  # collecting some header rows

        if row[1] == '1 Minutes':

            header.append(row)

            writer = min_writer

        elif row[1] == '60 Minutes':

            header.append(row)

            writer = hr_writer

        elif row[1] == 'Date Range':

            pass                       # Ignore this header row.

        elif row[0] == 'Report Timings:':

            # Ignore this header row. Do output the collected header.

            for r in header:

                writer.writerow(r)

            status = 2                 # Get ready for for the value rows.

 

    elif status == 2:  # collecting the values

        if len(row) == 4:              # detect the row with values

            writer.writerow(row[0:2])  # output the first two columns only

        else:

            writer.writerow([''])      # separator

            status = 0                 # wait for another section

    

    else:

        print 'Unexpected status', status    

        

fout_hr.close()

fout_min.close()

Open in new window

0
 

Author Closing Comment

by:sara_bellum
ID: 31503692
Thanks very much, this works :)  I've been reading up on lists and dictionaries and there's abundant documentation on how to create them by entering the array items manually, but nothing for anyone at my level on file processing.  I'll poke around some more to find out how  to append an item from a row to a list or dictionary to see if I get lucky; then I'll probably open another question.
0
 
LVL 28

Expert Comment

by:pepr
ID: 22684590
For the file objects see http://docs.python.org/library/stdtypes.html#bltin-file-objects. Basically, after opening a file, you can iterate through it via

    for line in f:
        doSomething with the line

The iterator gets text lines as strings with the line terminator (from inside the file). The lines are not stored in a list or whatever.

Use lines = f.readlines() when you want to get all the lines into memory in the form of the 'lines' list.

Use s = f.read() to get the whole content of the file into the string variable. If you use the argument, you get at most that given number of bytes -- like   c = f.read(1)

The row's are the lists that were extracted from the file lines by the csv.reader() intermediate objects. You can think about csv.reader() as about an assistant for helping you with reading the opened file and choping the lines into elements.

Simi
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now