Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Memory Errors: Large CSV files into SQL Server

Posted on 2014-04-24
2
Medium Priority
?
502 Views
Last Modified: 2014-06-13
Hello All,

So this Python Neophyte is finally (hopefully) nearing the end of his project thanks to the advice of some kind and knowledgable programmers here (as well as Google).  But I'm running into what is hopefully my last roadblock.  So I have a large CSV file, typically around 100k in rowcount and 200+ columns wide, making for some pretty large files (100-200mb).  

I'm attempting to use the executemany() function to be able to load these rows quickly as doing them one by one would take way longer than time permits.  But I'm running into memory errors, here's what I have so far:

import csv
import ceODBC

fname='c:\temp\myfile.csv'

f = open(fname,'r',encoding='utf_8',newline='')
reader = csv.reader(f,delimiter=',')

conn = ceODBC.connect('***connection info***;Trusted_Connection=yes',autocommit=False)
cursor = conn.cursor()

header = next(reader)

#target columns explicitly defined
query = 'insert into mytable (col1,col2,col3...) values ({0})'

query = query.format(','.join('?' * len(header)))

columns = []

#Create a list of lists (aka 2D array)
for row in reader:
   columns.append(row)

#Load all rows in a batch
cursor.executemany(query,columns)

conn.commit()

cursor.close

conn.close

f.close()

Open in new window


Can anyone suggest a modification or a better way of being able to get this into my table?  

Thanks in advance,
Glen
0
Comment
Question by:jisoo411
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 

Author Comment

by:jisoo411
ID: 40021184
Re-arranging the for loop to incorporate more frequent commits seems to help somewhat:

import csv
import ceODBC

fname='c:\temp\myfile.csv'

f = open(fname,'r',encoding='utf_8',newline='')
reader = csv.reader(f,delimiter=',')

conn = ceODBC.connect('***connection info***;Trusted_Connection=yes',autocommit=False)
cursor = conn.cursor()

header = next(reader)

#target columns explicitly defined
query = 'insert into mytable (col1,col2,col3...) values ({0})'

query = query.format(','.join('?' * len(header)))

columns = []

#Create a list of lists (aka 2D array)
for row in reader:
   columns.append(row)

   if len(columns) > 1000:
       #Load all rows in a batch
       cursor.executemany(query,columns)
       conn.commit()
       columns = []

#Pick up the scraps
cursor.executemany(query,columns)
conn.commit()

cursor.close

conn.close

f.close()

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 2000 total points
ID: 40021465
Warning: the cursor.close and conn.close is not called. You have to append the parentheses like cursor.close() and conn.close().

I did look inside the cursor.executemany implementation. It is written in C and it really requires to be passed the list of rows. (It does not accepts any generator.)

To make it more readable and more understandable, I suggest to define your own chunking reader that will wrap the normal reader and returns the list of rows of predefined length.

Have a look at the following example:
#!python3

import csv

def chunking_reader(reader, chunk_size=10):
    try:
        while True:
            # Initialize the list of rows as empty, and then append
            # the chunk_size of rows from the reader.
            result = []      # init
            for counter in range(chunk_size):
                row = next(reader)
                result.append(row)
                
            # Return the result and after resurrection do the next loop.
            # (The yield makes it a generator instead of a plain function.)    
            yield result
            
    except StopIteration:
        # When the reader has no more rows, the next(reader) raises
        # the StopIteration exception and we get here. Yield whatever
        # was collected until now (less than chunk_size of rows),
        # and raise the same (StopIteration) exception to stop the
        # outer for-loop.
        yield result
        raise


# Simulated content of the csv file.
fname = 'data.csv'
with open(fname, 'w', encoding='utf_8', newline='') as f:
    writer = csv.writer(f)
    for x in range(95):            # here 95 rows hardwired -- for demonstration only
        writer.writerow([x] * 3)   # the row has 3 elements all with the value x

# Now access the csv file via csv.reader and loop via the chunking reader.
with open(fname, 'r', encoding='utf_8', newline='') as f:
    reader = csv.reader(f)
    for list_of_rows in chunking_reader(reader):
        print(list_of_rows)
        print('-----------------------------------------')

Open in new window

It prints
[['0', '0', '0'], ['1', '1', '1'], ['2', '2', '2'], ['3', '3', '3'], ['4', '4',
'4'], ['5', '5', '5'], ['6', '6', '6'], ['7', '7', '7'], ['8', '8', '8'], ['9',
'9', '9']]
-----------------------------------------
[['10', '10', '10'], ['11', '11', '11'], ['12', '12', '12'], ['13', '13', '13'],
 ['14', '14', '14'], ['15', '15', '15'], ['16', '16', '16'], ['17', '17', '17'],
 ['18', '18', '18'], ['19', '19', '19']]
-----------------------------------------
[['20', '20', '20'], ['21', '21', '21'], ['22', '22', '22'], ['23', '23', '23'],
 ['24', '24', '24'], ['25', '25', '25'], ['26', '26', '26'], ['27', '27', '27'],
 ['28', '28', '28'], ['29', '29', '29']]
-----------------------------------------
[['30', '30', '30'], ['31', '31', '31'], ['32', '32', '32'], ['33', '33', '33'],
 ['34', '34', '34'], ['35', '35', '35'], ['36', '36', '36'], ['37', '37', '37'],
 ['38', '38', '38'], ['39', '39', '39']]
-----------------------------------------
[['40', '40', '40'], ['41', '41', '41'], ['42', '42', '42'], ['43', '43', '43'],
 ['44', '44', '44'], ['45', '45', '45'], ['46', '46', '46'], ['47', '47', '47'],
 ['48', '48', '48'], ['49', '49', '49']]
-----------------------------------------
[['50', '50', '50'], ['51', '51', '51'], ['52', '52', '52'], ['53', '53', '53'],
 ['54', '54', '54'], ['55', '55', '55'], ['56', '56', '56'], ['57', '57', '57'],
 ['58', '58', '58'], ['59', '59', '59']]
-----------------------------------------
[['60', '60', '60'], ['61', '61', '61'], ['62', '62', '62'], ['63', '63', '63'],
 ['64', '64', '64'], ['65', '65', '65'], ['66', '66', '66'], ['67', '67', '67'],
 ['68', '68', '68'], ['69', '69', '69']]
-----------------------------------------
[['70', '70', '70'], ['71', '71', '71'], ['72', '72', '72'], ['73', '73', '73'],
 ['74', '74', '74'], ['75', '75', '75'], ['76', '76', '76'], ['77', '77', '77'],
 ['78', '78', '78'], ['79', '79', '79']]
-----------------------------------------
[['80', '80', '80'], ['81', '81', '81'], ['82', '82', '82'], ['83', '83', '83'],
 ['84', '84', '84'], ['85', '85', '85'], ['86', '86', '86'], ['87', '87', '87'],
 ['88', '88', '88'], ['89', '89', '89']]
-----------------------------------------
[['90', '90', '90'], ['91', '91', '91'], ['92', '92', '92'], ['93', '93', '93'],
 ['94', '94', '94']]
-----------------------------------------

Open in new window

Instead of printing, you will call the executemany and commit. Set the defaut value of the chunk size to fit your needs or pass it explicitly.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

"The time has come," the Walrus said, "To talk of many things: Of sets--and lists--and dictionaries-- Of variable kinks-- And why you see it changing not-- And why so strange are strings." This part describes how variables and references (see …
Variable is a place holder or reserved memory locations to store any value. Which means whenever we create a variable, indirectly we are reserving some space in the memory. The interpreter assigns or allocates some space in the memory based on the d…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Suggested Courses

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question