Avatar of elwayisgod
elwayisgod
Flag for United States of America asked on

VB Script to mimic this DOS Batch Script

I have a DOS script working great.  Problem is it takes 45 min to churn through files.  Some are pretty big.  Was wondering if there could be and equivalent VB script that could possibly churn through this faster?

@echo off
setlocal EnableDelayedExpansion

REM get a file name and the YEAR column
for /F "tokens=1,2 delims=|" %%A in (filelist.txt) do call :Extract %%A %%B

exit /b

:Extract FileName YearColNo
set filename=%1
set colno=%2

@echo Processing file %filename%

REM Get the header line
set head=
for /F "usebackq delims=" %%F in ("%filename%") do if not defined head set head=%%F

REM Construct RegEx for matching the correct year column
set pat=
for /L %%L in (2,1,%colno%) do set "pat=!pat![^^|]*|"
set "pat=^^!pat!2004^|"

> 2004_%filename%     echo,!head!
>>2004_%filename%     findstr /R "%pat%" %filename%
REM Head will be added automatically
REM > Non_2004_%filename% findstr /V /R "%pat%" %filename%
VB ScriptWindows BatchVisual Basic Classic

Avatar of undefined
Last Comment
Robert Schutt

8/22/2022 - Mon
Robert Schutt

I'm not sure but reading the files once instead of twice should make it a lot faster. I found an earlier version and changed that to read the files line by line. Disk caching should make that fast enough versus reading the file in memory. Splitting a very large string (when reading the whole file at once) into an array may also not be very fast.

Here's my version:
Option Explicit

' note the year is a string
Const C_YEAR = "2004" ' "2012"

Const C_DATA_DIR = ".\" ' "C:\Data\"

Const C_FILELIST = "filelist.txt"

Const C_SEP = "|"


Dim objFSO, objFilelist, arrFields, strFilename, intYearColumnNumber, objFile, blnFirst, strLine, objFileOut1, objFileOut2

Set objFSO = CreateObject("Scripting.FileSystemObject")

' get file name and year column number for each file to be processed
Set objFilelist = objFSO.OpenTextFile(C_FILELIST, 1, False)
While Not objFilelist.AtEndOfStream
	arrFields = Split(objFilelist.ReadLine, C_SEP)
	strFilename = arrFields(0)
	intYearColumnNumber = CInt(arrFields(1))
	Set objFile = objFSO.OpenTextFile(C_DATA_DIR & strFilename, 1, False)
	blnFirst = True
	While Not objFile.AtEndOfStream
		strLine = objFile.ReadLine
		If blnFirst Then
			Set objFileOut1 = objFSO.CreateTextFile(C_DATA_DIR & C_YEAR & "_" & strFilename, True)
			Set objFileOut2 = objFSO.CreateTextFile(C_DATA_DIR & "Non_" & C_YEAR & "_" & strFilename, True)
			objFileOut1.WriteLine strLine
			objFileOut2.WriteLine strLine
			blnFirst = False
		Else
			arrFields = Split(strLine, C_SEP)
			If UBound(arrFields) >= intYearColumnNumber-1 Then
				If arrFields(intYearColumnNumber-1) = C_YEAR Then
					objFileOut1.WriteLine strLine
				Else
					objFileOut2.WriteLine strLine
				End If
			End If
		End If
	Wend
	If Not blnFirst Then
		objFileOut1.Close
		Set objFileOut1 = Nothing
		objFileOut2.Close
		Set objFileOut2 = Nothing
	End If
	objFile.Close
	Set objFile = Nothing
Wend

objFilelist.Close
Set objFilelist = Nothing

Set objFSO = Nothing

Open in new window

Note: I tested with files found also in an earlier question ("2012" files). Also I saw in another question that you wanted to use a separate data directory so prepared for that.
elwayisgod

ASKER
OK.  So instead of writing the rows to the 'Non_' filename how about it's just skipped saving time.  It only then just writes the lines it finds thus saving time?  If file exists already it overwrites it too.  Can the script Query the years column and automatically loop through all the years that exist in source file?  That way I don't have to create 17 of these and run them back to back to back etc....  Just trying to think of things to tighten it up...

Or how about it 'removes' the lines from original file as it goes.  Thus then next year runs, there are less and less rows as it goes through each year?

Thanks for help.... Performance is such a key with this application.
elwayisgod

ASKER
So it starts with a large file but as each year gets stripped out it gets smaller and smaller thus each subsequent year should process faster?
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
elwayisgod

ASKER
Or can is just read the line and put that line in the appropriate new file.  Then every single line gets read once and is placed into new file once.  When a row finds another year it just appends to target file as it goes?   I don't know what to suggest.  Just trying to throw out ideas that might speed it up is all
elwayisgod

ASKER
Just to clarify.  The 'filelist.txt' file contains the filename and the column number of the Year.  It can be delimited anyway.  Currently it's comma delimited.  Goal is to create separate files by Year for each file in 'Filelist.txt' and include the header row from original file.  It can be appended at the very end.  Doesn't matter to me whatever is fastest.  So at end of day I would have 'x' files that all together would have same rows as original file.  Just need a faster way to process this according to client :(
Robert Schutt

Yes I think the idea to create and append files as unique years are found is the best. I'll have a look at that. Because writing back to the file all the data from other years would take very long (with big files) until one or more years with a lot of data have been taken out.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Robert Schutt

Here's a new version:
Option Explicit

Const C_DATA_DIR = ".\" ' "C:\Data\"

Const C_FILELIST = "filelist.txt"

Const C_SEP = "|"
Const C_SEP_FILELIST = ","


Dim objFSO, objFilelist, arrFields, strFilename, intYearColumnNumber, objFile, blnFirst, strLine, strHeader, strYear, objFileOut1, objFileOut2, colFiles

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set colFiles = CreateObject("Scripting.Dictionary")

' get file name and year column number for each file to be processed
Set objFilelist = objFSO.OpenTextFile(C_FILELIST, 1, False)
While Not objFilelist.AtEndOfStream
	arrFields = Split(objFilelist.ReadLine, C_SEP_FILELIST)
	strFilename = arrFields(0)
	intYearColumnNumber = CInt(arrFields(1))
	Set objFile = objFSO.OpenTextFile(C_DATA_DIR & strFilename, 1, False)
	blnFirst = True
	While Not objFile.AtEndOfStream
		strLine = objFile.ReadLine
		If blnFirst Then
			strHeader = strLine
			blnFirst = False
		Else
			arrFields = Split(strLine, C_SEP)
			strYear = "Non"
			If UBound(arrFields) >= intYearColumnNumber-1 Then
				If IsNumeric(arrFields(intYearColumnNumber-1)) Then
					strYear = arrFields(intYearColumnNumber-1)
				End If
			End If
			If Not colFiles.Exists(strYear) Then
				colFiles.Add strYear, objFSO.CreateTextFile(C_DATA_DIR & strYear & "_" & strFilename, True)
				colFiles(strYear).WriteLine strHeader
			End If
			colFiles(strYear).WriteLine strLine
		End If
	Wend
	objFile.Close
	Set objFile = Nothing
	For Each strYear In colFiles
		colFiles(strYear).Close
		Set colFiles(strYear) = Nothing
	Next
	colFiles.RemoveAll
Wend

objFilelist.Close
Set objFilelist = Nothing

Set objFSO = Nothing
Set colFiles = Nothing

Open in new window

elwayisgod

ASKER
Robert,

I get:

D:\AppFiles\SDS\SDS_Retail_R3\Data\Working\DataFiles\Spliced\VB\splice3.vbs(21,
2) Microsoft VBScript runtime error: Subscript out of range: '[number: 1]'
elwayisgod

ASKER
Nevermind.  My filelist is pipe delimted in this directory.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
elwayisgod

ASKER
OK.  So this seems to be working.  There is a year actually named 'Other Yr'.  Which is ok as these rows end up in the 'Non' file by themselves.  

So, if I only wanted to extract  2011, 2012 and 2013 rows, can that be done?  Maybe have a line in script where I can define which years?  In 2017 they will need to add 2016, 2017 and 2018.  But not yet.
Robert Schutt

How's the speed? Any use testing with the original code, reading the file into memory at once and then splitting it? Probably mainly depends on the size of your biggest file.

Yes, you could add a check on line 34:
If CheckYear(CInt(arrFields(intYearColumnNumber-1))) Then strYear = arrFields(intYearColumnNumber-1)

Open in new window

and for example at the bottom of the script add:
Function CheckYear(intYear)
	Select Case intYear
		Case 2011, 2012, 2013:
			CheckYear = True
		Case Else
			CheckYear = False
	End Select
End Function

Open in new window

That last function is easy to extend when needed or even based on the current year (if that's the logic you're after).
elwayisgod

ASKER
I was just getting to the speed thing.  I'm running original dos code now against a set of data files.  I put a start time and end time in the log.  How do I do this for this script?  I want to know how long it takes for every year first.  If it takes 10 min vs 40 min, I'm good.  I will run against same set of data files.  I'm calling it like:

cscript splice3.vbs > splice3.log
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Robert Schutt

If you call it like that it should be possible to just use a line like this a number of times, obviously at least at the start and at the end:
WScript.Echo Now()

Open in new window

You could test one specific year easily by adjusting the CheckYear function now. If you want to take out the 'non' year processing that would be a bit more work...
elwayisgod

ASKER
OK. I ran it but it's not returning all the years for each file.  For example on file 'file.txt' it returns 2008_file.txt.  However it's not picking up 2004 which exists in the source file.
Robert Schutt

Earlier it ran fine, so is this after adding the CheckYear function? If so, what does it look like now? The version I posted was for 2011/12/13 so maybe a typo there? If not using this new functionality, can you post the file because like I said, it was running fine before?
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
elwayisgod

ASKER
The years in each column aren't in order necessarily.  Not sure if that matters or not.  The lines it did pick up for 2012 are correct though.  do you want a sample file?
Robert Schutt

No I meant the code file. Order doesn't matter. New files are created as needed and kept open until the end of the input file.
elwayisgod

ASKER
OK.  So here's the issue.  The dos batch returns just the header row even if it doesn't find a year.  Which I actually need.  So in 2004, no rows were found but it created a file with just the header row.  Reason I need this is because the script that uses these files is hardcoded and every year for each file must be there or it will fail.   Is that possible?

Speed is phenomenal.  Blows through everything in 5 minutes.

Sorry I was looking at the wrong directory earlier.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
elwayisgod

ASKER
Non Year is ok.  I just need it to be 'OtherYr_' prefix instead.
elwayisgod

ASKER
This is really encouraging.  This smokes by the .bat file.. wow.....  just a few more tweaks and I think we are there
Robert Schutt

Ah, I see. Yeah, this behaviour changed when the split changed (at the time I thought for the better) to make files for each year as necessary. So then I guess it would be needed to define a range of years you want to have and create those files (with header) up front even when no data is found, that part of the code should be easy to adjust. We could use an array so the later addition of years is still easy. I'll be back...
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
elwayisgod

ASKER
OK. Sorry.  That one slipped by me.
elwayisgod

ASKER
Also,

The script will reside in  C:\Automation\Scripts

The data files will reside in C:\Automation\Data

The resulting files need to reside in C:\Automation\Data also.

The resulting files should overwrite any existing files that exist.  Very important.  Maybe delete all files in C:\Automation\Data that begin with 2004 to 2017 or OtherYr first?
elwayisgod

ASKER
I can make these separate questions so more points can be rewarded.  This is getting crazy.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
ASKER CERTIFIED SOLUTION
Robert Schutt

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
elwayisgod

ASKER
OK. So I'm trying to run for all years in source files.  All years are:

2004 to 2017  and  "Other Year".

Array line is:  

arrYears = Array("Other Yr", 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 )

I'm still getting 'OtherYr' files.  That shouldn't be should it?
elwayisgod

ASKER
One of my file results is empty and is named 'Other Yr_Filename.txt'.  There is another file named 'OtherYr_Filename.txt' that has all the 'Other Yr' rows in it.  Not sure why.....
elwayisgod

ASKER
I just saw the OtherYr part you put in.  I uncommentd the 5 lines and am running again.
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
elwayisgod

ASKER
OK. The resulting files are PERFECT!!!   So to fire off within another .vbs file wiill this work:

Print Time() & " - Splitting data files into individual years for load"
intErr = objShell.Run("\Scripts\splice_years.vbs", 0, False)
elwayisgod

ASKER
This won't run either:

      Print Time() & " - Splitting data files into individual years for load"
         objShell.Run "D:\Scripts\splice_years.vbs"
Robert Schutt

I was out for a bit, reading your comments just now.

I see you found the commented out bit. And assuming you took out that first entry from the Array you posted earlier, as those should all be numbers.

I'm not sure what that last post means. You don't need another .vbs file, or are you running all scripts from one other script for testing purposes? In that case post the whole code because I don't know what objShell is (I can guess but what if that is actually the problem itself)...
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
elwayisgod

ASKER
line 254, it won't execute.  Also can the script 'splice_years.vbs' reside in a different directory than the data files it needs to splice up?  I need that to work too.  Data directory is:

D:\AppFiles\SDS\SDS_Retail_R3\Data
update.vbs
elwayisgod

ASKER
Script directory is:

D:\AppFiles\SDS\SDS_Retail_R3\Scripts
Robert Schutt

Oh wow, that was a bit more than I expected... ;-)

Well at least it's clear now that the Print function is defined, because when you posted the bit of code earlier I couldn't tell what that did.

It's a common problem that when you call a Sub (or a function without using the return value) you have to leave out the parentheses. So either
objShell.Run """C:\Windows\System32\cscript.exe"" ""D:\AppFiles\SDS\SDS_Retail_R3\Scripts\splice_years.vbs"""

Open in new window

or
intErr = objShell.Run("""C:\Windows\System32\cscript.exe"" ""D:\AppFiles\SDS\SDS_Retail_R3\Scripts\splice_years.vbs""")

Open in new window

It will usually work when there is only 1 parameter because then it's just as if you're surrounding that value with parentheses instead of using them to 'bind' the parameter(s) to the function call.

The reason I go into this is that you probably need to add the parameters
, 0, True

Open in new window

back in unless the process can run in the background (same applies to line 259). The 'True' means the calling process waits for the called process to end. This is obviously necessary if another process (or the calling process) needs to work with the output generated by the previous call. Have a look here for more info: http://technet.microsoft.com/en-us/library/ee156605.aspx 

The data directory can be configured in "splice_years.vbs":
Const C_DATA_DIR = "D:\AppFiles\SDS\SDS_Retail_R3\Data\"

Open in new window

Your help has saved me hundreds of hours of internet surfing.
fblack61
elwayisgod

ASKER
I think it can be false as this takes around 5 minutes.  The other process that runs takes 20 min or so to complete before it uses this output.  That is why i needed it faster so it wouldn't slow down the process as a whole.

So this maybe:

         objShell.Run("""C:\Windows\System32\cscript.exe"" ""D:\AppFiles\SDS\SDS_Retail_R3\Scripts\splice_years.vbs""", 0, False)
Robert Schutt

In that case don't use parentheses.