• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1723
  • Last Modified:

VBSCRIPT Large File Performance

I have a vbscript that compares two text files and outputs only the non-duplicates.  This script performs well for all but large files (over 100k records).  To process two 400k record files takes 8 - 10 hours.  Is there a faster way?  Please advise.
Dim objFSO: Set objFSO = CreateObject("Scripting.FileSystemObject")
strFile1 = "C:\PROVIDERA.txt"
strfile2 = "C:\PROVIDERB.txt"
strOutput = "C:\PROVIDERC.txt"

Dim File1: Set File1 = objFSO.OpenTextFile(strFile1)
file2 = objFSO.OpenTextFile(strfile2).ReadAll
Dim objOutput: Set objOutput = objFSO.CreateTextFile(strOutput)

Dim WshShell, BtnCode
Set WshShell = WScript.CreateObject("WScript.Shell")
timestart = Now

Do While Not File1.AtEndOfStream
  strRead = File1.ReadLine
  If InStr(file2, strRead) = 0 Then objOutput.WriteLine strRead & " - " & count & " - " & now
  count = count + 1
Loop

timeend = Now

File1.Close

objOutput.Close

MsgBox "All Done - " & count & " Records Read" & timestart & " - " & timeend

Set objOutput = Nothing
Set file2 = Nothing
Set File1 = Nothing
Set objFSO = Nothing

Open in new window

0
pfc42355
Asked:
pfc42355
1 Solution
 
Shift-3Commented:
You may get somewhat better performance by using the OpenAsTextStream method for File1.

Also consider increasing the amount of physical RAM on the workstation.  I ran your script against a couple of 400,000 row text files on a machine with 3 GB of RAM and it completed in about five minutes.
0
 
Bill PrewCommented:
In the two files, id every line unique, or are there duplicates, and if so what percentage?

Have you considered using a dictionary object for the matching rather than instr?

~bp
0
 
pfc42355Author Commented:
I have 4g ram (Vista Business), so I don't understand why its running so long. Should I do something to my Windows config?

Most of the records are duplicates (70%), I've never used dictionary object, why would that be better and could you include a rough example?
0
 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

 
Bill PrewCommented:
>> pfc42355

Since you have a lot of duplicates, you are searching file1 many times to find the same record when you hit duplicates on file2.  So if we can come up with a way to know that we already found (or didn't find) that record quickly and skip the INSTR() search of the file that could save some time.

Can you zip and post a sample of some decent sized test files?  If so I could take a shot at a different approach and see how it performs.

~bp
0
 
Bill PrewCommented:
Also, as I understand the purpose of this script, you are processing file1, and only outputing records that didn't exist in file2.  A couple of questions:

(1) If their are duplicate records in file1 (compared to itself) do you want them to appear multiple times in the output?

(2) For the records in file2 that are not found in file1 you don't want to see those anyplace, right?

~bp
0
 
pfc42355Author Commented:
Basically, I receive a current monthly file that contains the same data as the prior month plus a few more transactions.  Therefore because I have already processed the prior months data I want to exclude those records from the current file.

If this can truly run in less than 30 minutes with some Windows tweaking I'm content with the current code as it is functional.  However, I am always open to suggestions.

The data is proprietary so I can't include it for you, but I'll try to create some de-identified data by tomorrow that you can use.
0
 
fsouzabrasilCommented:
Try this using BinaryCompare
Dim objFSO: Set objFSO = CreateObject("Scripting.FileSystemObject") 
strFile1 = "C:\PROVIDERA.txt" 
strfile2 = "C:\PROVIDERB.txt" 
strOutput = "C:\PROVIDERC.txt" 
 
Dim File1: Set File1 = objFSO.OpenTextFile(strFile1) 
file2 = objFSO.OpenTextFile(strfile2).ReadAll 
Dim objOutput: Set objOutput = objFSO.CreateTextFile(strOutput) 
 
Dim WshShell, BtnCode 
Set WshShell = WScript.CreateObject("WScript.Shell") 
timestart = Now 
 
Do While Not File1.AtEndOfStream 
  strRead = File1.ReadLine 
  If InStr(1, file2, strRead, vbBinaryCompare) = 0 Then objOutput.WriteLine strRead & " - " & count & " - " & now 
  count = count + 1 
Loop 
 
timeend = Now 
 
File1.Close 
 
objOutput.Close 
 
MsgBox "All Done - " & count & " Records Read" & timestart & " - " & timeend 
 
Set objOutput = Nothing 
Set file2 = Nothing 
Set File1 = Nothing 
Set objFSO = Nothing

Open in new window

0
 
Bill PrewCommented:
Give this script a try, on my system with a test of about 10000 records it was WAY faster than your original code.

~bp
Const strFile1 = "C:\PROVIDERA.txt"
Const strfile2 = "C:\PROVIDERB.txt"
Const strOutput = "C:\PROVIDERC.txt"
 
Dim objFSO
Dim objDict1
Dim objDict2
Dim objFile1
Dim objFile2
Dim objOutput
Dim objShell
Dim BtnCode
Dim intCount1
Dim intCount2
Dim intUnique1
Dim intUnique2
Dim intOutput
Dim timStart
Dim timEnd
 
Set objShell = WScript.CreateObject("WScript.Shell")
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objDict1 = CreateObject("Scripting.Dictionary")
Set objDict2 = CreateObject("Scripting.Dictionary")
 
Set objFile1 = objFSO.OpenTextFile(strFile1)
Set objFile2 = objFSO.OpenTextFile(strFile2)
Set objOutput = objFSO.CreateTextFile(strOutput)
 
timStart = Now
 
intCount2 = 0
intUnique2 = 0
Do While Not objFile2.AtEndOfStream
  strRead = objFile2.ReadLine
  intCount2 = intCount2 + 1
  If Not objDict2.Exists(strRead) Then
    intUnique2 = intUnique2 + 1
    objDict2.Add strRead, False
  End If
Loop
 
intCount1 = 0
intUnique1 = 0
Do While Not objFile1.AtEndOfStream
  strRead = objFile1.ReadLine
  intCount1 = intCount1 + 1
  If objDict1.Exists(strRead) Then
    If objDict1.Item(strRead) Then
      'exists in file2
    Else
      'doesn't exist in file2
      objOutput.WriteLine strRead
      intOutput = intOutput + 1
    End If
  Else
    intUnique1 = intUnique1 + 1
    If objDict2.Exists(strRead) Then
      'exists in file2
      objDict1.Add strRead, True
    Else
      'doesn't exist in file2
      objDict1.Add strRead, False
      objOutput.WriteLine strRead
      intOutput = intOutput + 1
    End If
  End If
 
Loop
 
timEnd = Now
 
objFile1.Close
objFile2.Close
objOutput.Close
 
Wscript.Echo "All Done."
Wscript.Echo "Time Start   :" & timStart
Wscript.Echo "Time End     :" & timEnd
Wscript.Echo "File1 Count  :" & intCount1
Wscript.Echo "File2 Count  :" & intCount2
Wscript.Echo "File1 Unique :" & intUnique1
Wscript.Echo "File2 Unique :" & intUnique2
Wscript.Echo "Output Count :" & intOutput
 
Set objOutput = Nothing
Set objFile2 = Nothing
Set objFile1 = Nothing
Set objFSO = Nothing
Set objDict1 = Nothing
Set objDict2 = Nothing
Set objShell = Nothing

Open in new window

0
 
pfc42355Author Commented:
Great Job Big Difference in Time
0

Featured Post

A Cyber Security RX to Protect Your Organization

Join us on December 13th for a webinar to learn how medical providers can defend against malware with a cyber security "Rx" that supports a healthy technology adoption plan for every healthcare organization.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now