Identify CSV format with Python script?

Posted on 2014-04-21
Last Modified: 2014-04-22
Hello Everyone,

I'm very new to Python so please bear with me.  

I currently receive a CSV file every day that (unfortunately for me) has an unpredictable structure.  I never know exactly how many columns of data I have until I open up the file, luckily I do know that it will always be one of two formats.  The CSV files can be either 6 or 10 columns long, standard comma-delimited with CRLF new line indicators.  So that being said, I'm trying to put together a python script that will dynamically identify the file format (presumably by the header row) and then process the data appropriately.  




Open in new window

I've done some reading on the csv module off the Python website and googled around for similar topics, I've made some progress but now I'm a bit stuck.  Here's the meager code I have so far:

import csv

file_name = "C:\temp\file1.txt"

f = open(file_name,'r')

csv_reader = csv.reader(f,delimiter=',')
ncol = len(next(csv_reader)) #read first line and count columns

Open in new window

So this presumably gets me the number of columns which I can use for an if statement to fork logic.  The next step I'm trying to get at is to load the CSV data into a SQL Server table.  Now I'm unsure what the appropriate way to do this is.  Do I load each line of the CSV into a list and then insert the list into the SQL table?  Or is there a way to get a two dimensional array going to make this work in a bulk load fashion?  One thing that makes this easier is that the CSV column names and the SQL table column names are the same, so I just have to match them for each line of data.

Thanks in advance,
Question by:jisoo411
  • 2
LVL 29

Accepted Solution

pepr earned 500 total points
ID: 40014188
Firstly for the CSV processing. As you use Python 3, you should open it as shown below (different way when compared with the same for csv in Python 2):

import csv

def process(fname):
    # Should be opened with newline='' and the encoding.
    f = open(fname,'r', encoding='utf_8', newline='')
    reader = csv.reader(f, delimiter=',')
    ncol = len(next(reader)) #read first line and count columns
    if ncol == 6:
    if ncol == 10:
    f.close() # you should always close the open file        
              # or better -- use the "with" construct

def process6col(reader):
    print('processing of the 6 column version')
    for row in reader:

def process10col(reader):
    print('processing of the 10 column version')
    for row in reader:
        for element in row:

if __name__ == '__main__':

Open in new window

I have used the detection of the variant and calling the related processing function. Of course, you can do it inside one function using the if construct. But if you get another version later, it may be more readable to simply add another special function. I did copy your example data to the file1.txt and file2.txt. Then I can observe:
processing of the 6 column version
['a', 'b', 'c', 'd', 'e', 'f']
processing of the 10 column version
['a', 'b', 'c', 'd', 'e', 'f', '1', '2', '3', '4']

Open in new window

For the insertion into MSSQL, you should probably install the pymssql package ( See the examples of usage here, or here

The later is probably more straightforward for you. You need to form the INSERT command as a template and insert the values or via % operator, or via newer .format() method of the string.
   insert6template = 'INSERT INTO tablename VALUES("{}", "{}", "{}", "{}", "{}", "{}")'  
   insert10template = insert6template[:-1] + ',{}, {}, {}, {})'

Open in new window

Notice that the string values when inserted need double quoting for the SQL. You have the integers in the variant with 10 elements as integers, but you get them as strings. This is fine because you need to put them into the INSERT template. Only, they are not quoted in the template. The *row expands the list of elements as if you passed them to the format() method as more agruments.

Author Comment

ID: 40015853
Hi Pepr,

Thanks so much for replying.  The structure you posted is very much what I had in mind, the 3rd party that supplies the files to me can be unreliable so I have to take into account the unexpected.  Currently I'm forced to use ceODBC for MSSQL insertion (according to my DBA, other packages had problems with bulk inserting?).  But I imagine it's very similar in style and substance to pymssql so I'll use that as my guide.  

I'm hoping to utilize any sort of bulk loading as the files I'm consuming can have over 100k rows of data, which would be terrible if I had to load them one at a time.  Based on what I see on documentation and in the context of code example above, would a bulk insert look something like this?


Open in new window

Or is the reader object not the appropriate type to pass as an argument?
LVL 29

Expert Comment

ID: 40016347
You cannot do that with reader and the .execute(). Firstly, the template is just a string and the .format() requires the exact number of arguments to be inserted instead of the placeholders. Also, there is no way how to convert the reader object into a string. The reader acts as the source of a sequence (of rows).

When used in the for loop, the loop contstruct iterates through all of the points of the sequence (where one point is one row).

You can do:  lst = list(reader), because the list() also iterates through the passed object. However, the list grows in memory, whilst iteration via the for loop does not (if you throw away the processed row).

The .executemany(), on the other hand, seems to be designed for iteration: one template, and many rows in the list. I have never used the package but if I were the designer, I woul allow to pass the reader to the .executemany. However, you have to use another kind of placeholders in the template string. It is the older way used in earlier versions of Python. It is still possible to use it instead of .format(). The operator % used like this: result = template % tuple where the result will be the string, and the tuple is the source of data that replace placeholders. The operator % is apparently used inside the implementation of the .executemany(), and the reader can be the source of the tuples (actually of lists, but it does not matter here as both tuples and lists can be iterated). However, the reader must be passed as the second argument of the .executemany(). Warning: it is just a guess. I did not try.

Featured Post

Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
tools to scan a SQL server's problem 14 28
SQL Dump exec output to table 3 23
SQL R 21 28
SQL Recursion 6 20
Use this article to create a batch file to backup a Microsoft SQL Server database to a Windows folder.  The folder can be on the local hard drive or on a network share.  This batch file will query the SQL server to get the current date & time and wi…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question