Solved

Identify CSV format with Python script?

Posted on 2014-04-21
3
730 Views
Last Modified: 2014-04-22
Hello Everyone,

I'm very new to Python so please bear with me.  

I currently receive a CSV file every day that (unfortunately for me) has an unpredictable structure.  I never know exactly how many columns of data I have until I open up the file, luckily I do know that it will always be one of two formats.  The CSV files can be either 6 or 10 columns long, standard comma-delimited with CRLF new line indicators.  So that being said, I'm trying to put together a python script that will dynamically identify the file format (presumably by the header row) and then process the data appropriately.  

col1,col2,col3,col4,col5,col6
a,b,c,d,e,f

or

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
a,b,c,d,e,f,1,2,3,4

Open in new window


I've done some reading on the csv module off the Python website and googled around for similar topics, I've made some progress but now I'm a bit stuck.  Here's the meager code I have so far:

import csv

file_name = "C:\temp\file1.txt"

f = open(file_name,'r')

csv_reader = csv.reader(f,delimiter=',')
ncol = len(next(csv_reader)) #read first line and count columns
print(ncol)

Open in new window


So this presumably gets me the number of columns which I can use for an if statement to fork logic.  The next step I'm trying to get at is to load the CSV data into a SQL Server table.  Now I'm unsure what the appropriate way to do this is.  Do I load each line of the CSV into a list and then insert the list into the SQL table?  Or is there a way to get a two dimensional array going to make this work in a bulk load fashion?  One thing that makes this easier is that the CSV column names and the SQL table column names are the same, so I just have to match them for each line of data.

Thanks in advance,
Glen
0
Comment
Question by:jisoo411
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 40014188
Firstly for the CSV processing. As you use Python 3, you should open it as shown below (different way when compared with the same for csv in Python 2):
#!python3

import csv

def process(fname):
    # Should be opened with newline='' and the encoding.
    f = open(fname,'r', encoding='utf_8', newline='')
    reader = csv.reader(f, delimiter=',')
    ncol = len(next(reader)) #read first line and count columns
    if ncol == 6:
        process6col(reader)
    if ncol == 10:
        process10col(reader)
    f.close() # you should always close the open file        
              # or better -- use the "with" construct


def process6col(reader):
    print('processing of the 6 column version')
    for row in reader:
        print(row)
    print('--------------------')    

def process10col(reader):
    print('processing of the 10 column version')
    for row in reader:
        print(row)
        for element in row:
            print(element)
    print('--------------------')    


if __name__ == '__main__':
    process('file1.txt')
    process('file2.txt') 

Open in new window

I have used the detection of the variant and calling the related processing function. Of course, you can do it inside one function using the if construct. But if you get another version later, it may be more readable to simply add another special function. I did copy your example data to the file1.txt and file2.txt. Then I can observe:
c:\_Python\jisoo411\Q_28417073>a.py
processing of the 6 column version
['a', 'b', 'c', 'd', 'e', 'f']
--------------------
processing of the 10 column version
['a', 'b', 'c', 'd', 'e', 'f', '1', '2', '3', '4']
a
b
c
d
e
f
1
2
3
4
--------------------

Open in new window

For the insertion into MSSQL, you should probably install the pymssql package (http://pymssql.org/). See the examples of usage here http://pymssql.org/pymssql_examples.html, or here http://code.google.com/p/pymssql/wiki/PymssqlExamples

The later is probably more straightforward for you. You need to form the INSERT command as a template and insert the values or via % operator, or via newer .format() method of the string.
   insert6template = 'INSERT INTO tablename VALUES("{}", "{}", "{}", "{}", "{}", "{}")'  
   insert10template = insert6template[:-1] + ',{}, {}, {}, {})'
   ...
    cur.execute(insert6template.format(*row))

Open in new window


Notice that the string values when inserted need double quoting for the SQL. You have the integers in the variant with 10 elements as integers, but you get them as strings. This is fine because you need to put them into the INSERT template. Only, they are not quoted in the template. The *row expands the list of elements as if you passed them to the format() method as more agruments.
0
 

Author Comment

by:jisoo411
ID: 40015853
Hi Pepr,

Thanks so much for replying.  The structure you posted is very much what I had in mind, the 3rd party that supplies the files to me can be unreliable so I have to take into account the unexpected.  Currently I'm forced to use ceODBC for MSSQL insertion (according to my DBA, other packages had problems with bulk inserting?).  But I imagine it's very similar in style and substance to pymssql so I'll use that as my guide.  

I'm hoping to utilize any sort of bulk loading as the files I'm consuming can have over 100k rows of data, which would be terrible if I had to load them one at a time.  Based on what I see on documentation and in the context of code example above, would a bulk insert look something like this?

cur.executemany(insert6template.format(reader))

Open in new window

Or is the reader object not the appropriate type to pass as an argument?
0
 
LVL 29

Expert Comment

by:pepr
ID: 40016347
You cannot do that with reader and the .execute(). Firstly, the template is just a string and the .format() requires the exact number of arguments to be inserted instead of the placeholders. Also, there is no way how to convert the reader object into a string. The reader acts as the source of a sequence (of rows).

When used in the for loop, the loop contstruct iterates through all of the points of the sequence (where one point is one row).

You can do:  lst = list(reader), because the list() also iterates through the passed object. However, the list grows in memory, whilst iteration via the for loop does not (if you throw away the processed row).

The .executemany(), on the other hand, seems to be designed for iteration: one template, and many rows in the list. I have never used the package but if I were the designer, I woul allow to pass the reader to the .executemany. However, you have to use another kind of placeholders in the template string. It is the older way used in earlier versions of Python. It is still possible to use it instead of .format(). The operator % used like this: result = template % tuple where the result will be the string, and the tuple is the source of data that replace placeholders. The operator % is apparently used inside the implementation of the .executemany(), and the reader can be the source of the tuples (actually of lists, but it does not matter here as both tuples and lists can be iterated). However, the reader must be passed as the second argument of the .executemany(). Warning: it is just a guess. I did not try.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question