Solved

# Windows batch file split text file based on record count

Posted on 2014-01-30
8,322 Views
I have a very large text file in with over a million records.
I want to be able to split the text file up into multiple smaller files based on a set number of records using a Windows batch file.   For instance if the text file has 2.5 million records, I would like to split this up into 3 different output text files, with each file having no more than 1 million records.   Any ideas on how to do this with a batch file?  Thanks.
0
Question by:fjkaykr11
• 14
• 13
• 4

LVL 53

Expert Comment

ID: 39822187
This is going to be VERY slow in a pure DOS batch script, and will be sensitive to special characters in the file (and may fail).  I would strongly recommend picking up a freeware utility that can do this, there are a lot of them, and using that approach.  It will work better and be a lot faster.

~bp
0

LVL 3

Author Comment

ID: 39822217
thanks for the info, i would prefer to see if I can do this with VBScript or bath rather than a utility.  Time is not an issue and there are no special characters.
0

LVL 53

Expert Comment

ID: 39822261
A freeware utility that could likely get the job done would be a Windows port of the unix "split" util, available here:

http://gnuwin32.sourceforge.net/packages/coreutils.htm

~bp
0

LVL 53

Expert Comment

ID: 39822268
So how do you envision this working? Would you split at say every 1,000,000 lines?  And you would specify an output folder?  How would you want the new files named?  Etc...

Could there ever be blank lines in the file?

~bp
0

LVL 53

Assisted Solution

Bill Prew earned 250 total points
ID: 39822358
Here's the basic idea in a BAT file, assuming there are no blank lines in the file.  Adjust the SET lines near the top as needed.

@echo off
setlocal EnableDelayedExpansion

set InFile=c:\ee\EE28352646\in.txt
set OutDir=c:\ee\EE28352646
REM Can not be larger than 2147483648 !!!
set MaxLines=1000000

if not exist "%InFile%" (
echo *ERROR* Input file does not exist!
exit /b
)

if not exist "%OutDir%\" (
echo *ERROR* Output folder does not exist!
exit /b
)

for %%A in ("%InFile%") do (
set Name=%%~nA
set Ext=%%~xA
)

set /a Line=MaxLines+1
set File=0
for /f "usebackq tokens=*" %%A in ("%InFile%") do (
set /a Line+=1
if !Line! GTR %MaxLines% (
set /a File+=1
set OutFile=%OutDir%\%Name%_!File!%Ext%
if exist "!OutFile!" del "!OutFile!"
set Line=1
)
echo.%%A>>"!OutFile!"
)

~bp
0

LVL 3

Author Comment

ID: 39822363
It looks like you added in some code before I got a chance to answer your questions.
Yes to split at every 1,000,000 lines.  The new file names doesn't matter as long as the output gets redirected into files with different names.   There are no blank lines in the file.
0

LVL 3

Author Comment

ID: 39822400
I ran the batch file but no output files get created.  Below is the message that is displyed when I run the batch file:

C:\ee\EE28352646>setlocal EnableDelayedExpansion

C:\ee\EE28352646>set InFile=c:\ee\EE28352646\in.txt

C:\ee\EE28352646>set OutDir=c:\ee\EE28352646

C:\ee\EE28352646>REM Can not be larger than 2147483648 !!!

C:\ee\EE28352646>set MaxLines=1000000

C:\ee\EE28352646>if not exist "c:\ee\EE28352646\in.txt" (
echo *ERROR* Input file does not exist!
exit /b
)

C:\ee\EE28352646>if not exist "c:\ee\EE28352646\" (
echo *ERROR* Output folder does not exist!
exit /b
)

C:\ee\EE28352646>for %A in ("c:\ee\EE28352646\in.txt") do (
set Name=%~nA
set Ext=%~xA
)

C:\ee\EE28352646>(
set Name=in
set Ext=.txt
)
0

LVL 53

Expert Comment

ID: 39822406
Those answers match the assumptions I made in that code.  The output files will have the same name and extension as the input file, with a number, like this:

Input file:
myfile.txt

Output files:
myfile_1.txt
myfile_2.txt
myfile_3.txt
. . .

You can specify what folder to place the output files into via the SET lines.  It can be the same one as the input file, since the names are different, but if there were any other files in that same folder that already existed with the name of the new output file they will be deleted.  For that reason it's probably safest to send it to a different folder just for the split output.

~bp
0

LVL 53

Expert Comment

ID: 39822408
How did you run it, did you remove the first line?  If you did not, then you shouldn't have seen any of that output.

~bp
0

LVL 3

Author Comment

ID: 39822429
I ran the batch file the first time with the first line and didn't see any output files.  So I removed the first line and ran the batch file again, so I would be able to see the batch file running and have the info to post here.
0

LVL 53

Expert Comment

ID: 39822457
Ran it here, works fine.

Did you create the folder c:\ee\EE28352646"

And the file c:\ee\EE28352646\in.txt" with records in it?

~bp
0

LVL 3

Author Comment

ID: 39822503
yes
0

LVL 43

Accepted Solution

Steve Knight earned 250 total points
ID: 39822716
I haven't tried this on more than a couple of hundred lines but this VBScript should immune to any special characters etc.... though I don't see anything wrong with Bills script off hand.

REM Script to split file up every x lines
REM Usage: cscript //nologo split.vbs "filename.txt" x
REM Stephen Knight. http://scripts.dragon-it.co.uk/

dim objFSO, objFile, objFileOut
dim args,strFileName,intLines,LineNum, ShowNext

set args = wscript.Arguments
if args.count <1 then
wscript.echo "There were not enough command line entries given. Usage: cscript //nologo split.vbs filename.txt x"
wscript.quit
end if

strFileName=args(0)
if args.count=2 then
intLines=args(1)+0
else
intLines=1000000
end if

LineNum=0
FileNum=1

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile(strFileName)
Set objFileOut = objFSO.CreateTextFile(FileNum & "_" & strFileName,true)

Do Until objFile.AtEndOfStream
LineNum=LineNum+1
if LineNum >=IntLines then
objFileOut.Close
LineNum=0
FileNum=FileNum+1
Set objFileOut = objFSO.CreateTextFile(FileNum & "_" & strFileName,true)
end if
loop

wscript.echo "There are " & FileNum & " file(s) with " & LineNum & " entries in last file."

objFile.Close
objFileOut.Close


cscript //nologo "in.txt" 500

would give you files of 500 lines each called 1_in.txt etc.

Steve

[Edit - line 9 was incorrectly "if args.count <2 then"- fixed it]
0

LVL 3

Author Comment

ID: 39822852
@dragon-it, thanks for posting this.  The Vbscript runs fine with no errors.  However, I am getting a new file with all the records from the original file. I specified 1 Million records when I ran this hoping it would create 3 output files from the original file that contains 2.5 Million records.  Any ideas?  Thanks again.
0

LVL 43

Expert Comment

ID: 39822939
I tried it with less, suggest try smaller file first and split at 1000 :

Latenow, will look in morning.

Steve
0

LVL 3

Author Comment

ID: 39822966
That's not what I am looking to do.    I appreciate the effort.
0

LVL 53

Expert Comment

ID: 39823062
Where does this file come from?  I imagine it's too large to ZIP and post here? But I'm wondering if there is something odd about the file that is causing my script to fail on it when it worked on a local test file here.

~bp
0

LVL 3

Author Comment

ID: 39823067
it is too large to post. Its not the end of the world if I don't have this, but it would be helpful.  I will leave my post up for a while, might be other solutions to test.  I checked out the link for the utility. I saw different files that look like it might need to be compiled (not sure). I have limited access to download and test things in my environment. I guess at this point I have to be open to using a utility. I often prefer batch files because I find its easier to add other functions to perform in the future.    If you can find the one for Windows that doesn't need to be compiled, I would appreciate you sending the file info to download.  Thanks again.
0

LVL 53

Expert Comment

ID: 39823076
0

LVL 53

Expert Comment

ID: 39823115
I am running two tests here based on a large log file I found that has just under 2.4M lines in it.

Steve's VBS solution is understandably much faster than the BAT script I fashioned, and has finished.  It worked as desired, creating two files with 1,000,000 lines, and a third with the remaining 400,000 or so lines.  It worked properly.

The BAT is still chugging away, but is only 50% through the writing of the first file, so will take a while.

I also ran the SPLIT.EXE util and it was quite fast, the best performance.  It produced the same results as the VBS.  TO get it to work you will also need the two DLL files from the following link:

I used the following command, and one slight weakness is that split.exe doesn't add the extension back on the output files.  But it would pretty easy to add that back witj a RENAME command afterwards.

split --lines=1000000 --numeric-suffixes in.txt in_

~bp
0

LVL 3

Author Comment

ID: 39823121
all 3 options worked with log file?  I tried the batch and vbs and had issues.
i will try the utility when i get a chance and let you know how it goes. I have to assume its an issue with my text file if the vbs isn't working.
0

LVL 53

Expert Comment

ID: 39823135
Yes, the VBS worked, and the utility worked.  The BAT is still running and honestly due to it performance I would say its not viable.  But it's still chugging away.

Have you ever used powershell, that's another option that would likely perform close to the utility.

~bp
0

LVL 3

Author Comment

ID: 39823188
i don't think I can run that on my desktop, I have very limited access to anything.
0

LVL 3

Author Comment

ID: 39823191
Thanks for all the help.  I will have to try to download the utility and test, if I able to access from network.
0

LVL 43

Expert Comment

ID: 39823275
it may well be your log file isnt a proper text file then... sorry I couldnt respond, was 1!am and grabbing few hours kip between boys waking me up.

glad to hear the vbs worked for you too Bill.

wonder if is a unix end of lne file or something.

Perhaps you could post part of the file, thoughh you have already closed the q.
0

LVL 53

Expert Comment

ID: 39823751
Just to follow up, the BAT script approach did ultimately finish, and did the job, although it took well over an hour.

Keep us posted on how you proceed and make out, and if you need additional help.

~bp
0

LVL 3

Author Comment

ID: 39824787
@dragon-it, it looks like I owe you some points and apology to both of you.   Thanks for sticking this out, I realized when testing the VBScript again this morning that I had accidently added an extra zero (specifying 10 million instead of 1 Million) when first executing.  Dragon-IT how can I request you get additional points?
0

LVL 3

Author Closing Comment

ID: 39824875
Thanks to both of you for the help.
0

LVL 53

Expert Comment

ID: 39824899
Welcome.

~bp
0

LVL 43

Expert Comment

ID: 39825285
No problem, there was no need to change the points as long as it worked...  there is a default of 1m lines if you don't specify a number except line #9 was wrong in the script, have changed that now.

Steve
0

LVL 3

Author Comment

ID: 39825869
@dragon-it, thanks again and thanks for updating the script.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question