Solved

Identify CSV format with Python script?

Posted on 2014-04-21
3
705 Views
Last Modified: 2014-04-22
Hello Everyone,

I'm very new to Python so please bear with me.  

I currently receive a CSV file every day that (unfortunately for me) has an unpredictable structure.  I never know exactly how many columns of data I have until I open up the file, luckily I do know that it will always be one of two formats.  The CSV files can be either 6 or 10 columns long, standard comma-delimited with CRLF new line indicators.  So that being said, I'm trying to put together a python script that will dynamically identify the file format (presumably by the header row) and then process the data appropriately.  

col1,col2,col3,col4,col5,col6
a,b,c,d,e,f

or

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
a,b,c,d,e,f,1,2,3,4

Open in new window


I've done some reading on the csv module off the Python website and googled around for similar topics, I've made some progress but now I'm a bit stuck.  Here's the meager code I have so far:

import csv

file_name = "C:\temp\file1.txt"

f = open(file_name,'r')

csv_reader = csv.reader(f,delimiter=',')
ncol = len(next(csv_reader)) #read first line and count columns
print(ncol)

Open in new window


So this presumably gets me the number of columns which I can use for an if statement to fork logic.  The next step I'm trying to get at is to load the CSV data into a SQL Server table.  Now I'm unsure what the appropriate way to do this is.  Do I load each line of the CSV into a list and then insert the list into the SQL table?  Or is there a way to get a two dimensional array going to make this work in a bulk load fashion?  One thing that makes this easier is that the CSV column names and the SQL table column names are the same, so I just have to match them for each line of data.

Thanks in advance,
Glen
0
Comment
Question by:jisoo411
  • 2
3 Comments
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 40014188
Firstly for the CSV processing. As you use Python 3, you should open it as shown below (different way when compared with the same for csv in Python 2):
#!python3

import csv

def process(fname):
    # Should be opened with newline='' and the encoding.
    f = open(fname,'r', encoding='utf_8', newline='')
    reader = csv.reader(f, delimiter=',')
    ncol = len(next(reader)) #read first line and count columns
    if ncol == 6:
        process6col(reader)
    if ncol == 10:
        process10col(reader)
    f.close() # you should always close the open file        
              # or better -- use the "with" construct


def process6col(reader):
    print('processing of the 6 column version')
    for row in reader:
        print(row)
    print('--------------------')    

def process10col(reader):
    print('processing of the 10 column version')
    for row in reader:
        print(row)
        for element in row:
            print(element)
    print('--------------------')    


if __name__ == '__main__':
    process('file1.txt')
    process('file2.txt') 

Open in new window

I have used the detection of the variant and calling the related processing function. Of course, you can do it inside one function using the if construct. But if you get another version later, it may be more readable to simply add another special function. I did copy your example data to the file1.txt and file2.txt. Then I can observe:
c:\_Python\jisoo411\Q_28417073>a.py
processing of the 6 column version
['a', 'b', 'c', 'd', 'e', 'f']
--------------------
processing of the 10 column version
['a', 'b', 'c', 'd', 'e', 'f', '1', '2', '3', '4']
a
b
c
d
e
f
1
2
3
4
--------------------

Open in new window

For the insertion into MSSQL, you should probably install the pymssql package (http://pymssql.org/). See the examples of usage here http://pymssql.org/pymssql_examples.html, or here http://code.google.com/p/pymssql/wiki/PymssqlExamples

The later is probably more straightforward for you. You need to form the INSERT command as a template and insert the values or via % operator, or via newer .format() method of the string.
   insert6template = 'INSERT INTO tablename VALUES("{}", "{}", "{}", "{}", "{}", "{}")'  
   insert10template = insert6template[:-1] + ',{}, {}, {}, {})'
   ...
    cur.execute(insert6template.format(*row))

Open in new window


Notice that the string values when inserted need double quoting for the SQL. You have the integers in the variant with 10 elements as integers, but you get them as strings. This is fine because you need to put them into the INSERT template. Only, they are not quoted in the template. The *row expands the list of elements as if you passed them to the format() method as more agruments.
0
 

Author Comment

by:jisoo411
ID: 40015853
Hi Pepr,

Thanks so much for replying.  The structure you posted is very much what I had in mind, the 3rd party that supplies the files to me can be unreliable so I have to take into account the unexpected.  Currently I'm forced to use ceODBC for MSSQL insertion (according to my DBA, other packages had problems with bulk inserting?).  But I imagine it's very similar in style and substance to pymssql so I'll use that as my guide.  

I'm hoping to utilize any sort of bulk loading as the files I'm consuming can have over 100k rows of data, which would be terrible if I had to load them one at a time.  Based on what I see on documentation and in the context of code example above, would a bulk insert look something like this?

cur.executemany(insert6template.format(reader))

Open in new window

Or is the reader object not the appropriate type to pass as an argument?
0
 
LVL 28

Expert Comment

by:pepr
ID: 40016347
You cannot do that with reader and the .execute(). Firstly, the template is just a string and the .format() requires the exact number of arguments to be inserted instead of the placeholders. Also, there is no way how to convert the reader object into a string. The reader acts as the source of a sequence (of rows).

When used in the for loop, the loop contstruct iterates through all of the points of the sequence (where one point is one row).

You can do:  lst = list(reader), because the list() also iterates through the passed object. However, the list grows in memory, whilst iteration via the for loop does not (if you throw away the processed row).

The .executemany(), on the other hand, seems to be designed for iteration: one template, and many rows in the list. I have never used the package but if I were the designer, I woul allow to pass the reader to the .executemany. However, you have to use another kind of placeholders in the template string. It is the older way used in earlier versions of Python. It is still possible to use it instead of .format(). The operator % used like this: result = template % tuple where the result will be the string, and the tuple is the source of data that replace placeholders. The operator % is apparently used inside the implementation of the .executemany(), and the reader can be the source of the tuples (actually of lists, but it does not matter here as both tuples and lists can be iterated). However, the reader must be passed as the second argument of the .executemany(). Warning: it is just a guess. I did not try.
0

Featured Post

Free Gift Card with Acronis Backup Purchase!

Backup any data in any location: local and remote systems, physical and virtual servers, private and public clouds, Macs and PCs, tablets and mobile devices, & more! For limited time only, buy any Acronis backup products and get a FREE Amazon/Best Buy gift card worth up to $200!

Join & Write a Comment

In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Use this article to create a batch file to backup a Microsoft SQL Server database to a Windows folder.  The folder can be on the local hard drive or on a network share.  This batch file will query the SQL server to get the current date & time and wi…
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now