Link to home
Start Free TrialLog in
Avatar of pfc42355
pfc42355Flag for United States of America

asked on

VBSCRIPT Large File Performance

I have a vbscript that compares two text files and outputs only the non-duplicates.  This script performs well for all but large files (over 100k records).  To process two 400k record files takes 8 - 10 hours.  Is there a faster way?  Please advise.
Dim objFSO: Set objFSO = CreateObject("Scripting.FileSystemObject")
strFile1 = "C:\PROVIDERA.txt"
strfile2 = "C:\PROVIDERB.txt"
strOutput = "C:\PROVIDERC.txt"

Dim File1: Set File1 = objFSO.OpenTextFile(strFile1)
file2 = objFSO.OpenTextFile(strfile2).ReadAll
Dim objOutput: Set objOutput = objFSO.CreateTextFile(strOutput)

Dim WshShell, BtnCode
Set WshShell = WScript.CreateObject("WScript.Shell")
timestart = Now

Do While Not File1.AtEndOfStream
  strRead = File1.ReadLine
  If InStr(file2, strRead) = 0 Then objOutput.WriteLine strRead & " - " & count & " - " & now
  count = count + 1
Loop

timeend = Now

File1.Close

objOutput.Close

MsgBox "All Done - " & count & " Records Read" & timestart & " - " & timeend

Set objOutput = Nothing
Set file2 = Nothing
Set File1 = Nothing
Set objFSO = Nothing

Open in new window

Avatar of Shift-3
Shift-3
Flag of United States of America image

You may get somewhat better performance by using the OpenAsTextStream method for File1.

Also consider increasing the amount of physical RAM on the workstation.  I ran your script against a couple of 400,000 row text files on a machine with 3 GB of RAM and it completed in about five minutes.
Avatar of Bill Prew
Bill Prew

In the two files, id every line unique, or are there duplicates, and if so what percentage?

Have you considered using a dictionary object for the matching rather than instr?

~bp
Avatar of pfc42355

ASKER

I have 4g ram (Vista Business), so I don't understand why its running so long. Should I do something to my Windows config?

Most of the records are duplicates (70%), I've never used dictionary object, why would that be better and could you include a rough example?
>> pfc42355

Since you have a lot of duplicates, you are searching file1 many times to find the same record when you hit duplicates on file2.  So if we can come up with a way to know that we already found (or didn't find) that record quickly and skip the INSTR() search of the file that could save some time.

Can you zip and post a sample of some decent sized test files?  If so I could take a shot at a different approach and see how it performs.

~bp
Also, as I understand the purpose of this script, you are processing file1, and only outputing records that didn't exist in file2.  A couple of questions:

(1) If their are duplicate records in file1 (compared to itself) do you want them to appear multiple times in the output?

(2) For the records in file2 that are not found in file1 you don't want to see those anyplace, right?

~bp
Basically, I receive a current monthly file that contains the same data as the prior month plus a few more transactions.  Therefore because I have already processed the prior months data I want to exclude those records from the current file.

If this can truly run in less than 30 minutes with some Windows tweaking I'm content with the current code as it is functional.  However, I am always open to suggestions.

The data is proprietary so I can't include it for you, but I'll try to create some de-identified data by tomorrow that you can use.
Try this using BinaryCompare
Dim objFSO: Set objFSO = CreateObject("Scripting.FileSystemObject") 
strFile1 = "C:\PROVIDERA.txt" 
strfile2 = "C:\PROVIDERB.txt" 
strOutput = "C:\PROVIDERC.txt" 
 
Dim File1: Set File1 = objFSO.OpenTextFile(strFile1) 
file2 = objFSO.OpenTextFile(strfile2).ReadAll 
Dim objOutput: Set objOutput = objFSO.CreateTextFile(strOutput) 
 
Dim WshShell, BtnCode 
Set WshShell = WScript.CreateObject("WScript.Shell") 
timestart = Now 
 
Do While Not File1.AtEndOfStream 
  strRead = File1.ReadLine 
  If InStr(1, file2, strRead, vbBinaryCompare) = 0 Then objOutput.WriteLine strRead & " - " & count & " - " & now 
  count = count + 1 
Loop 
 
timeend = Now 
 
File1.Close 
 
objOutput.Close 
 
MsgBox "All Done - " & count & " Records Read" & timestart & " - " & timeend 
 
Set objOutput = Nothing 
Set file2 = Nothing 
Set File1 = Nothing 
Set objFSO = Nothing

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Great Job Big Difference in Time