• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 786
  • Last Modified:

Identify CSV format with Python script?

Hello Everyone,

I'm very new to Python so please bear with me.  

I currently receive a CSV file every day that (unfortunately for me) has an unpredictable structure.  I never know exactly how many columns of data I have until I open up the file, luckily I do know that it will always be one of two formats.  The CSV files can be either 6 or 10 columns long, standard comma-delimited with CRLF new line indicators.  So that being said, I'm trying to put together a python script that will dynamically identify the file format (presumably by the header row) and then process the data appropriately.  




Open in new window

I've done some reading on the csv module off the Python website and googled around for similar topics, I've made some progress but now I'm a bit stuck.  Here's the meager code I have so far:

import csv

file_name = "C:\temp\file1.txt"

f = open(file_name,'r')

csv_reader = csv.reader(f,delimiter=',')
ncol = len(next(csv_reader)) #read first line and count columns

Open in new window

So this presumably gets me the number of columns which I can use for an if statement to fork logic.  The next step I'm trying to get at is to load the CSV data into a SQL Server table.  Now I'm unsure what the appropriate way to do this is.  Do I load each line of the CSV into a list and then insert the list into the SQL table?  Or is there a way to get a two dimensional array going to make this work in a bulk load fashion?  One thing that makes this easier is that the CSV column names and the SQL table column names are the same, so I just have to match them for each line of data.

Thanks in advance,
  • 2
1 Solution
Firstly for the CSV processing. As you use Python 3, you should open it as shown below (different way when compared with the same for csv in Python 2):

import csv

def process(fname):
    # Should be opened with newline='' and the encoding.
    f = open(fname,'r', encoding='utf_8', newline='')
    reader = csv.reader(f, delimiter=',')
    ncol = len(next(reader)) #read first line and count columns
    if ncol == 6:
    if ncol == 10:
    f.close() # you should always close the open file        
              # or better -- use the "with" construct

def process6col(reader):
    print('processing of the 6 column version')
    for row in reader:

def process10col(reader):
    print('processing of the 10 column version')
    for row in reader:
        for element in row:

if __name__ == '__main__':

Open in new window

I have used the detection of the variant and calling the related processing function. Of course, you can do it inside one function using the if construct. But if you get another version later, it may be more readable to simply add another special function. I did copy your example data to the file1.txt and file2.txt. Then I can observe:
processing of the 6 column version
['a', 'b', 'c', 'd', 'e', 'f']
processing of the 10 column version
['a', 'b', 'c', 'd', 'e', 'f', '1', '2', '3', '4']

Open in new window

For the insertion into MSSQL, you should probably install the pymssql package (http://pymssql.org/). See the examples of usage here http://pymssql.org/pymssql_examples.html, or here http://code.google.com/p/pymssql/wiki/PymssqlExamples

The later is probably more straightforward for you. You need to form the INSERT command as a template and insert the values or via % operator, or via newer .format() method of the string.
   insert6template = 'INSERT INTO tablename VALUES("{}", "{}", "{}", "{}", "{}", "{}")'  
   insert10template = insert6template[:-1] + ',{}, {}, {}, {})'

Open in new window

Notice that the string values when inserted need double quoting for the SQL. You have the integers in the variant with 10 elements as integers, but you get them as strings. This is fine because you need to put them into the INSERT template. Only, they are not quoted in the template. The *row expands the list of elements as if you passed them to the format() method as more agruments.
jisoo411Author Commented:
Hi Pepr,

Thanks so much for replying.  The structure you posted is very much what I had in mind, the 3rd party that supplies the files to me can be unreliable so I have to take into account the unexpected.  Currently I'm forced to use ceODBC for MSSQL insertion (according to my DBA, other packages had problems with bulk inserting?).  But I imagine it's very similar in style and substance to pymssql so I'll use that as my guide.  

I'm hoping to utilize any sort of bulk loading as the files I'm consuming can have over 100k rows of data, which would be terrible if I had to load them one at a time.  Based on what I see on documentation and in the context of code example above, would a bulk insert look something like this?


Open in new window

Or is the reader object not the appropriate type to pass as an argument?
You cannot do that with reader and the .execute(). Firstly, the template is just a string and the .format() requires the exact number of arguments to be inserted instead of the placeholders. Also, there is no way how to convert the reader object into a string. The reader acts as the source of a sequence (of rows).

When used in the for loop, the loop contstruct iterates through all of the points of the sequence (where one point is one row).

You can do:  lst = list(reader), because the list() also iterates through the passed object. However, the list grows in memory, whilst iteration via the for loop does not (if you throw away the processed row).

The .executemany(), on the other hand, seems to be designed for iteration: one template, and many rows in the list. I have never used the package but if I were the designer, I woul allow to pass the reader to the .executemany. However, you have to use another kind of placeholders in the template string. It is the older way used in earlier versions of Python. It is still possible to use it instead of .format(). The operator % used like this: result = template % tuple where the result will be the string, and the tuple is the source of data that replace placeholders. The operator % is apparently used inside the implementation of the .executemany(), and the reader can be the source of the tuples (actually of lists, but it does not matter here as both tuples and lists can be iterated). However, the reader must be passed as the second argument of the .executemany(). Warning: it is just a guess. I did not try.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now