Solved

Windows batch file split text file based on record count

Posted on 2014-01-30
32
8,110 Views
Last Modified: 2014-01-31
I have a very large text file in with over a million records.  
I want to be able to split the text file up into multiple smaller files based on a set number of records using a Windows batch file.   For instance if the text file has 2.5 million records, I would like to split this up into 3 different output text files, with each file having no more than 1 million records.   Any ideas on how to do this with a batch file?  Thanks.
0
Comment
Question by:fjkaykr11
  • 14
  • 13
  • 4
32 Comments
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39822187
This is going to be VERY slow in a pure DOS batch script, and will be sensitive to special characters in the file (and may fail).  I would strongly recommend picking up a freeware utility that can do this, there are a lot of them, and using that approach.  It will work better and be a lot faster.

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822217
thanks for the info, i would prefer to see if I can do this with VBScript or bath rather than a utility.  Time is not an issue and there are no special characters.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39822261
A freeware utility that could likely get the job done would be a Windows port of the unix "split" util, available here:

http://gnuwin32.sourceforge.net/packages/coreutils.htm

~bp
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39822268
So how do you envision this working? Would you split at say every 1,000,000 lines?  And you would specify an output folder?  How would you want the new files named?  Etc...

Could there ever be blank lines in the file?

~bp
0
 
LVL 51

Assisted Solution

by:Bill Prew
Bill Prew earned 250 total points
ID: 39822358
Here's the basic idea in a BAT file, assuming there are no blank lines in the file.  Adjust the SET lines near the top as needed.

@echo off
setlocal EnableDelayedExpansion

set InFile=c:\ee\EE28352646\in.txt
set OutDir=c:\ee\EE28352646
REM Can not be larger than 2147483648 !!!
set MaxLines=1000000

if not exist "%InFile%" (
  echo *ERROR* Input file does not exist!
  exit /b
)

if not exist "%OutDir%\" (
  echo *ERROR* Output folder does not exist!
  exit /b
)

for %%A in ("%InFile%") do (
  set Name=%%~nA
  set Ext=%%~xA
)

set /a Line=MaxLines+1
set File=0
for /f "usebackq tokens=*" %%A in ("%InFile%") do (
  set /a Line+=1
  if !Line! GTR %MaxLines% (
    set /a File+=1
    set OutFile=%OutDir%\%Name%_!File!%Ext%
    if exist "!OutFile!" del "!OutFile!"
    set Line=1
  )
  echo.%%A>>"!OutFile!"
)

Open in new window

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822363
It looks like you added in some code before I got a chance to answer your questions.
Yes to split at every 1,000,000 lines.  The new file names doesn't matter as long as the output gets redirected into files with different names.   There are no blank lines in the file.
Will my answers impact the code you already posted?   Thanks.
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822400
I ran the batch file but no output files get created.  Below is the message that is displyed when I run the batch file:


C:\ee\EE28352646>setlocal EnableDelayedExpansion


C:\ee\EE28352646>set InFile=c:\ee\EE28352646\in.txt

C:\ee\EE28352646>set OutDir=c:\ee\EE28352646

C:\ee\EE28352646>REM Can not be larger than 2147483648 !!!

C:\ee\EE28352646>set MaxLines=1000000

C:\ee\EE28352646>if not exist "c:\ee\EE28352646\in.txt" (
echo *ERROR* Input file does not exist!
 exit /b
)

C:\ee\EE28352646>if not exist "c:\ee\EE28352646\" (
echo *ERROR* Output folder does not exist!
 exit /b
)

C:\ee\EE28352646>for %A in ("c:\ee\EE28352646\in.txt") do (
set Name=%~nA
 set Ext=%~xA
)

C:\ee\EE28352646>(
set Name=in
 set Ext=.txt
)
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39822406
Those answers match the assumptions I made in that code.  The output files will have the same name and extension as the input file, with a number, like this:

Input file:
  myfile.txt

Output files:
  myfile_1.txt
  myfile_2.txt
  myfile_3.txt
   . . .

You can specify what folder to place the output files into via the SET lines.  It can be the same one as the input file, since the names are different, but if there were any other files in that same folder that already existed with the name of the new output file they will be deleted.  For that reason it's probably safest to send it to a different folder just for the split output.

~bp
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39822408
How did you run it, did you remove the first line?  If you did not, then you shouldn't have seen any of that output.

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822429
I ran the batch file the first time with the first line and didn't see any output files.  So I removed the first line and ran the batch file again, so I would be able to see the batch file running and have the info to post here.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39822457
Ran it here, works fine.

Did you create the folder c:\ee\EE28352646"

And the file c:\ee\EE28352646\in.txt" with records in it?

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822503
yes
0
 
LVL 43

Accepted Solution

by:
Steve Knight earned 250 total points
ID: 39822716
I haven't tried this on more than a couple of hundred lines but this VBScript should immune to any special characters etc.... though I don't see anything wrong with Bills script off hand.

REM Script to split file up every x lines
REM Usage: cscript //nologo split.vbs "filename.txt" x
REM Stephen Knight. http://scripts.dragon-it.co.uk/

dim objFSO, objFile, objFileOut
dim args,strFileName,intLines,LineNum, ShowNext

set args = wscript.Arguments
if args.count <1 then
wscript.echo "There were not enough command line entries given. Usage: cscript //nologo split.vbs filename.txt x"
wscript.quit
end if

strFileName=args(0)
if args.count=2 then
  intLines=args(1)+0
else
  intLines=1000000
end if


LineNum=0
FileNum=1

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile(strFileName)
Set objFileOut = objFSO.CreateTextFile(FileNum & "_" & strFileName,true)

Do Until objFile.AtEndOfStream
  objFileOut.WriteLine objFile.ReadLine
  LineNum=LineNum+1
  if LineNum >=IntLines then
    objFileOut.Close
    LineNum=0
    FileNum=FileNum+1
    Set objFileOut = objFSO.CreateTextFile(FileNum & "_" & strFileName,true)
  end if
loop

wscript.echo "There are " & FileNum & " file(s) with " & LineNum & " entries in last file."

objFile.Close
objFileOut.Close

Open in new window


cscript //nologo "in.txt" 500

would give you files of 500 lines each called 1_in.txt etc.

Steve

[Edit - line 9 was incorrectly "if args.count <2 then"- fixed it]
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822852
@dragon-it, thanks for posting this.  The Vbscript runs fine with no errors.  However, I am getting a new file with all the records from the original file. I specified 1 Million records when I ran this hoping it would create 3 output files from the original file that contains 2.5 Million records.  Any ideas?  Thanks again.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39822939
I tried it with less, suggest try smaller file first and split at 1000 :

Latenow, will look in morning.

Steve
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 3

Author Comment

by:fjkaykr11
ID: 39822966
That's not what I am looking to do.    I appreciate the effort.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39823062
Where does this file come from?  I imagine it's too large to ZIP and post here? But I'm wondering if there is something odd about the file that is causing my script to fail on it when it worked on a local test file here.

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39823067
it is too large to post. Its not the end of the world if I don't have this, but it would be helpful.  I will leave my post up for a while, might be other solutions to test.  I checked out the link for the utility. I saw different files that look like it might need to be compiled (not sure). I have limited access to download and test things in my environment. I guess at this point I have to be open to using a utility. I often prefer batch files because I find its easier to add other functions to perform in the future.    If you can find the one for Windows that doesn't need to be compiled, I would appreciate you sending the file info to download.  Thanks again.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39823076
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39823115
I am running two tests here based on a large log file I found that has just under 2.4M lines in it.

Steve's VBS solution is understandably much faster than the BAT script I fashioned, and has finished.  It worked as desired, creating two files with 1,000,000 lines, and a third with the remaining 400,000 or so lines.  It worked properly.

The BAT is still chugging away, but is only 50% through the writing of the first file, so will take a while.

I also ran the SPLIT.EXE util and it was quite fast, the best performance.  It produced the same results as the VBS.  TO get it to work you will also need the two DLL files from the following link:

http://sourceforge.net/projects/gnuwin32/files/coreutils/5.3.0/coreutils-5.3.0-dep.zip/download?use_mirror=softlayer-dal&download=

I used the following command, and one slight weakness is that split.exe doesn't add the extension back on the output files.  But it would pretty easy to add that back witj a RENAME command afterwards.

split --lines=1000000 --numeric-suffixes in.txt in_

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39823121
all 3 options worked with log file?  I tried the batch and vbs and had issues.
i will try the utility when i get a chance and let you know how it goes. I have to assume its an issue with my text file if the vbs isn't working.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39823135
Yes, the VBS worked, and the utility worked.  The BAT is still running and honestly due to it performance I would say its not viable.  But it's still chugging away.

Have you ever used powershell, that's another option that would likely perform close to the utility.

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39823188
i don't think I can run that on my desktop, I have very limited access to anything.
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39823191
Thanks for all the help.  I will have to try to download the utility and test, if I able to access from network.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39823275
it may well be your log file isnt a proper text file then... sorry I couldnt respond, was 1!am and grabbing few hours kip between boys waking me up.

glad to hear the vbs worked for you too Bill.

wonder if is a unix end of lne file or something.

Perhaps you could post part of the file, thoughh you have already closed the q.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39823751
Just to follow up, the BAT script approach did ultimately finish, and did the job, although it took well over an hour.

Keep us posted on how you proceed and make out, and if you need additional help.

~bp
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39824787
@dragon-it, it looks like I owe you some points and apology to both of you.   Thanks for sticking this out, I realized when testing the VBScript again this morning that I had accidently added an extra zero (specifying 10 million instead of 1 Million) when first executing.  Dragon-IT how can I request you get additional points?
0
 
LVL 3

Author Closing Comment

by:fjkaykr11
ID: 39824875
Thanks to both of you for the help.
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39824899
Welcome.

~bp
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39825285
No problem, there was no need to change the points as long as it worked...  there is a default of 1m lines if you don't specify a number except line #9 was wrong in the script, have changed that now.

Steve
0
 
LVL 3

Author Comment

by:fjkaykr11
ID: 39825869
@dragon-it, thanks again and thanks for updating the script.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

The following is a collection of cases for strange behaviour when using advanced techniques in DOS batch files. You should have some basic experience in batch "programming", as I'm assuming some knowledge and not further explain the basics. For some…
How to remove superseded packages in windows w60 or w61 installation media (.wim) or online system to prevent unnecessary space. w60 means Windows Vista or Windows Server 2008. w61 means Windows 7 or Windows Server 2008 R2. There are various …
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now