Advertisement

06.06.2008 at 04:25PM PDT, ID: 23465365 | Points: 500
[x]
Attachment Details

program cannot handle big file

Asked by onyourmark in Python Scripting Language

Tags:

Hi. I have a python file (below) that reads an sqlite database and outputs a cvs file or a tab delimited file. It works if the input database is small (say 5 mb) but not if it is bigger (say 100 mb). Does anyone know how I can get it to work on a larger file?
Here is the code. The input file is called test.db and the output is written to
"C:/his/testdata.csv"
Thank you.Start Free Trial
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
76:
77:
78:
79:
80:
81:
82:
83:
84:
85:
86:
87:
88:
89:
90:
91:
92:
93:
94:
95:
96:
97:
98:
99:
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
110:
111:
112:
113:
114:
115:
116:
117:
118:
119:
120:
121:
122:
123:
124:
125:
126:
127:
128:
129:
130:
131:
132:
133:
134:
135:
from sqlite3 import dbapi2 as sqlite
 
#DEFINE  ########################################################
#Do not change these 
class URL_TITLE_OUTPUT_TYPE():
    URL_ONLY = 1
    TITLE_ONLY = 2
    BOTH = 3
 
class CSV_DELIMIT_TYPE():
    TAB = "\t"
    COMMA = "," 
# END DEFINE  ########################################################    
 
#USE the switch var below to config
 
SWITCH_URL_TITLE_OUTPUT = URL_TITLE_OUTPUT_TYPE.TITLE_ONLY  # or URL_TITLE_OUTPUT_TYPE.TITLE_ONLY or URL_TITLE_OUTPUT_TYPE.BOTH
SWITCH_CSV_DELIMIT = CSV_DELIMIT_TYPE.COMMA   # or CSV_DELIMIT_TYPE.COMMA 
 
########################################################        
 
CSV_FILE = "C:/his/testdata.csv"
DATABASE_FILE = "C:/his/test.db"
index_dict ={}
index_dict_for_title = {}
 
def write_to_csv(content_list):
 
   f = open(CSV_FILE,"a")
   f.write(SWITCH_CSV_DELIMIT.join(content_list) + "\n")
   f.flush()
   f.close()
 
def get_csv_output(matrix,words,urls):
 
   title = []
   judge_url_title_output(title,"WebAddress","Titles")
   title.extend(words)
   write_to_csv(title)
 
   for url in urls:
       line = []
       judge_url_title_output(line,url,get_title(matrix,url))
       for word in words:
           line.append(get_amount(matrix,word,url));
       print "finish make url : " + url
       write_to_csv(line)
 
   print "have done..."
 
def judge_url_title_output(line_list,url,title):
    if SWITCH_URL_TITLE_OUTPUT == URL_TITLE_OUTPUT_TYPE.URL_ONLY:
        line_list.append(url)
    elif SWITCH_URL_TITLE_OUTPUT == URL_TITLE_OUTPUT_TYPE.TITLE_ONLY:
        line_list.append(title)
    elif SWITCH_URL_TITLE_OUTPUT == URL_TITLE_OUTPUT_TYPE.BOTH:    
        line_list.append(url)
        line_list.append(title)
    else:
        print "SWITCH_URL_TITLE_OUTPUT var error,can not lookup value ,check it pls"    
        line_list.append(url)
 
 
def get_amount(matrix,word,url):
   key = word + "&" + url
   if index_dict.has_key(key):
       return index_dict[key]
   else:
       return "0"
 
def get_title(matrix,url):
   key =  url
   if index_dict_for_title.has_key(key):
       return index_dict_for_title[key]
   else:
       return ""
 
def make_index_dict(rows):
   key =""
   value = ""
   for row in rows:
       key = row[1]+"&"+row[2]
       index_dict[key] = str(row[0])
       index_dict_for_title[row[2]] = str(row[3])
   print "finish init index_dict..."
 
 
 
 
def start():
 
   conn = sqlite.connect(DATABASE_FILE)
   # I  add titleName column in the sql query
   sql = '''SELECT count( wordlocation.wordid ) AS amount ,
                                   wordlist.word ,urllist.url , titles.titleName
                                       FROM wordlocation , wordlist , urllist, titles
                                       WHERE wordlist.rowid = wordlocation.wordid
                                       AND urllist.rowid = wordlocation.urlid
                                       AND urllist.rowid = titles.urlid
                                       GROUP BY wordlocation.urlid , wordid'''
 
   cu = conn.cursor()
   cu.execute(sql)
   rs = cu.fetchall()
 
   if not rs  or len(rs)==0 :
       print "no data found..."
 
   urls = []
   words = []
   moreIgnoreWords = ['googl', 'blog', 'search', 't', 'link', 'activ',  'background', 'k', 'h', 'font', 'i', 'z', 'displai', 'none', 'div', 'n', 'margin', 'top',  'bold', 'q', 'b', 'ch', 'cursor', 'pointer', 'e', 'p', 'pr', 'ul',   'li', 'list', 'style',  'j', 'line', 'height', 'br', 'sbb', 'td', 'lrr', 'sop', 'gbar', 'float', 'gbh', 'border', 'solid', 'posit', 'absolut', 'gbi', 'fff', 'index', 'guser', 'import', 'media', 'all', 'right', 'vertic', 'align', 'block', 'text', 'decor', 'hover', 'bodi', 'famili', 'arial', 'san', 'serif', 'm', 'tpb', 'ttb', 'white', 'space', 'nowrap', 'rsb', 'btb', 'bt', 'ln', 'ccc', 'hd', 'eee', 's', 'f', 'fl', 'w', 'green', 'img', 'l', 'g', 'tr', 'asb', 'as', 'window', 'function', 'd', 'c', 'var', 'on', 'if', 'addeventlisten', 'fals', 'els', 'attachev', 'appli', 'thi', 'argument', 'return', 'undefin', 'o', 'firstchild', 'tagnam', 'px', 'tg', 'navextra', 'document', 'getelementbyid', 'getelementsbytagnam', 'span', 'event', 'cancelbubbl', 'createel', 'arrai', 'everi', 'createpopup', 'ifram', 'framebord', 'scroll', 'no', 'src', 'javascript', 'parentnod', 'appendchild', 'id', 'for', 'insertbefor', 'classnam', 'click', 'close', 'while', 'do', 'offsetleft', 'offsetpar', 'ss', 'statu', 'TRUE', 'cs', 'ga', 'substr', 'r', 'target', 'srcelement', 'locat', 'href', 'clk', 'url', 'ct', 'cd', 'cad', 'sg', 'rwurl', 'escap', 'replac', 'rdh', 'rdp', 'length', 'host', 'hostnam', 'pathnam', 'field', 'split', 'break', 'new', 'imag', 'blogsearch', 'sa', 'ei', 'rl', 'nln', 'continu', 'sig', 'colonpo', 'indexof', 'posidarrai', 'pb', 'number', 'nan', 'onmousedown', 're', 'web', 'map', 'shop', 'gmail', 'more', 'video', 'group', 'book', 'scholar', 'financ', 'youtub', 'calendar', 'photo', 'reader', 'even', 'raquo', 'sign', 'nbsp', 'advanc', 'prefer', 'result', 'about', 'second', 'toggletoadvanc', 'noadvanc', 'togglefromadvanc', 'publish', 'last', 'hour', 'dai', 'past', 'week', 'month', 'anytim', 'write', 'choos', 'date', 'hide', 'form', 'action', 'http', 'com', 'daterang', 'name', 'onsubmit', 'formatdateinput', 'input', 'as_q', 'type', 'hidden', 'valu', 'num', 'hl', 'en', 'as_epq', 'as_eq', 'lr', 'safe', 'ie', 'iso', 'as_mind', 'as_minm', 'as_mini', 'as_maxd', 'as_maxm', 'as_maxi', 'as_drrb', 'ctz', 'start', 'autocomplet', 'off', 'maxlength', 'onblur', 'checkandclos', 'onfocu', 'open', 'end', 'option', 'btnd', 'submit', 'go', 'gettimezoneoffset', 'montharrai', 'jan', 'feb', 'mar', 'apr', 'mai', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'dowarrai', 'su', 'tu', 'th', 'datedelim', 'ymdpermut', 'mdy', '_calendarpopup', 'setinitialcalendarset', 'class', 'ddcalp_dai', 'curr_dai', 'subscrib', 'alert', 'atom', 'rss', 'sort', 'by', 'relev', 'relat', 'care', 'center', 'inform', 'you', 'can', 'us']
   #moreIgnoreWords = ['flibberish']
   
 
   rows = rs
   print "finish load data..."
 
   filtered_rows = []
   for row in rows:
        if row[1] not in moreIgnoreWords:        
            filtered_rows.append(row)
   rows =  filtered_rows
 
   for row in rows:
        if row[2] not in urls:
            urls.append(row[2])
        if row[1] not in words:
            words.append(row[1])
 
   make_index_dict(rows)
   
   get_csv_output(rows,words,urls)
 
if __name__ =="__main__":
    start()
[+][-]06.06.2008 at 04:40PM PDT, ID: 21733472

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 04:48PM PDT, ID: 21733494

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 04:51PM PDT, ID: 21733509

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 05:06PM PDT, ID: 21733571

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 05:07PM PDT, ID: 21733577

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 05:46PM PDT, ID: 21733695

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 05:49PM PDT, ID: 21733704

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 06:00PM PDT, ID: 21733735

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]06.06.2008 at 06:08PM PDT, ID: 21733754

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]06.07.2008 at 02:58AM PDT, ID: 21734900

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 7-day free trial to view this Author Comment or ask the Experts your question.

 
[+][-]06.07.2008 at 10:04AM PDT, ID: 21736149

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 7-day free trial to view this Expert Comment or ask the Experts your question.

 
 
Loading Advertisement...
20080716-EE-VQP-32 / EE_QW_2_20070628