Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 350
  • Last Modified:

Unzip csv and read in same block of code

Hi, I was just wondering if it was possible to use the zipfile module in python to unzip a csv file and then process the file in order to create another csv with the csv writer. In other words, i'd like to skip the unzip process and process the file while its in temporary memory. See code for an example.

Thanks.

Paul
z = zipfile.ZipFile(rawDir + dailyzip, 'r')
 
for info in z.infolist():
    fname = info.filename
    data = z.read(fname)
    for r in data: #iterate through each row in the csv data in memory
             #do processing here before writing to new csv file.

Open in new window

0
paulkramer
Asked:
paulkramer
  • 5
  • 4
1 Solution
 
peprCommented:
If I understand you correctly, you want to bind the extracted data to the csv.reader() without extracting the content to the physical .csv file. The csv.reader() accepts or open file object or the list. This way, you may want to convert the read data to the list of lines and pass it to the csv.reader(). See the snippet. (You may process the row with your csv.writer().)
import csv
import zipfile
 
zipfname = 'test.zip'
z = zipfile.ZipFile(zipfname, 'r')
 
for info in z.infolist():
    fname = info.filename
    if fname.endswith('.csv'):       # detect the .csv file
        content = z.read(fname)      # read the content 
        data = content.split('\n')   # split by lines to the list
        reader = csv.reader(data)    # and consume by the reader
        for row in reader:           # iterate through the rows
            print repr(row)
           
z.close()

Open in new window

0
 
paulkramerAuthor Commented:
Yes you understood me correctly. Apologies for the vague description.
0
 
paulkramerAuthor Commented:
I now have another annoying problem. I'm attempting to apply values in a csv file to dictionary keys but am receiving the error "IndexError: list index out of range". I've attached the csv file and associated code.

Note: I had to convert the csv to txt in order to upload to EE.
    z = zipfile.ZipFile(rawDir + dailyzip, 'r')
 
    for info in z.infolist():
        fname = info.filename
        content = z.read(fname)
        data = content.split('\n')
        reader = csv.reader(data)
        
        for i in reader:
            if i[0] == 'D' and i[1] == 'DUNIT':
                    date,time = i[4].split()
                    m['Date'] = date
                    m['Time'] = time
                    m[i[6]+'-'+date+time] = i[11]
        print m['LYA1-'+date+' '+time]
            

Open in new window

PUBLIC-DAILY-200904150000-200904.zip
0
Receive 1:1 tech help

Solve your biggest tech problems alongside global tech experts with 1:1 help.

 
paulkramerAuthor Commented:
Sorry, I forgot to add the m = {} to the above snippet of code.
0
 
peprCommented:
Now I do not understand. Did you solve everything or is there still an error? Anyway, the "IndexError: list index out of range" can probably be related only to indexing of i[x]. The code seems a bit fragile. You should think about detecting the situation when the row does not contain at least 12 elements for ['D', 'DUNIT', ...]. Also, you should be sure that the i[6] is a string (line 14).

The line 15 assumes that the data was in the csv. Is it always true? You can also use m.get(key, default) in the cases when the key may not be present (http://docs.python.org/library/stdtypes.html#dict.get).

A side note, you should keep consistent indentation by 4 spaces (lines 11-14).
0
 
paulkramerAuthor Commented:
The original problem of binding the data to memory instead of extracting to a file is solved. However I'm not sure why I'm receiving the dictionary key error.

The test looking for 'D' and 'DUNIT' will always be true within a CSV. I have a feeling it has to do with how the content is being converted to a list which is hindering the search for the two strings. When I print i just after instantiating the for loop, the whole file is printed and then the other processing takes place (ie. the key error). For some reason, 'for i in reader' isnt processing each line in the reader.
0
 
peprCommented:
Well, the "IndexError: list index out of range" is not related to dictionary. If you do not observe another error, then this one is probably related to a single row from the reader (it is a list; should probably be given better name than i which is usually used for an array index value).

Try to wrap the problematic code into try/except construct to capture the specific exception (http://docs.python.org/tutorial/errors.html#handling-exceptions). Then you can print the context of the problematic data -- see the snippet below. Notice also using the enumerate() function (http://docs.python.org/library/functions.html#enumerate) to get the line number related to the line in .csc where the error was observed.

The truth is that the content of the file was splited to the lines quite simply using

  data = content.split('\n')

The real file may contain unexpected newlines inside the values of the csv row elements. This way the splitting would be incorrect. Anyway, you should be able to observe more details using the added code below.

    z = zipfile.ZipFile(rawDir + dailyzip, 'r')
 
    for info in z.infolist():
        fname = info.filename
        content = z.read(fname)
        data = content.split('\n')
        reader = csv.reader(data)
        
        for n, i in enumerate(reader):
            try:
                if i[0] == 'D' and i[1] == 'DUNIT':
                    date,time = i[4].split()
                    m['Date'] = date
                    m['Time'] = time
                    m[i[6]+'-'+date+time] = i[11]
            except IndexError:
                print n, repr(i)
        print m['LYA1-'+date+' '+time]
            

Open in new window

0
 
paulkramerAuthor Commented:
Many thanks for those pointers. The error try/except construct is detecting a problem at the end of the CSV (line 241260). It's citing an empty list [] as the problem, however I have no idea how the empty list is being generated. After the error is handled, the data corresponding to the key is located perfectly so I suppose the error handling could be a work around for now. However , it might be a different story once theres more than one zip file to handle.

Any ideas as to why the empty list is being generated?
0
 
peprCommented:
There simply is the empty line (one more newline) at the end of the csv file--i.e. wrongly constructed--or it can be the result of reading the content of the zipped .csv file. You can check the content of the file after manual unzipping the zip file for which of the situations hold.

I suggest not to fight against the situation and simply to accept it. After learning the details, you can solve it using various approaches. For example...

If the content (returned via "content = z.read(fname)") always contain the same sequence that leads to the empty line after content.split('\n'), then you may want to split the content without that sequence (like "data = content[:-1].split('\n') or so).

Or you can decide to process only the rows that have enough elements:
        for i in reader:
            if len(i) > 0:            # or similar test related to the number of elements
                if i[0] == 'D' and i[1] == 'DUNIT':
                    date,time = i[4].split()
                    m['Date'] = date
                    m['Time'] = time
                    m[i[6]+'-'+date+time] = i[11]

Open in new window

0

Featured Post

[Webinar] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.

  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now