Link to home
Start Free TrialLog in
Avatar of Star Gazr1
Star Gazr1Flag for United States of America

asked on

Windows batch file split text file based on record count

I have a very large text file in with over a million records.  
I want to be able to split the text file up into multiple smaller files based on a set number of records using a Windows batch file.   For instance if the text file has 2.5 million records, I would like to split this up into 3 different output text files, with each file having no more than 1 million records.   Any ideas on how to do this with a batch file?  Thanks.
Avatar of Bill Prew
Bill Prew

This is going to be VERY slow in a pure DOS batch script, and will be sensitive to special characters in the file (and may fail).  I would strongly recommend picking up a freeware utility that can do this, there are a lot of them, and using that approach.  It will work better and be a lot faster.

~bp
Avatar of Star Gazr1

ASKER

thanks for the info, i would prefer to see if I can do this with VBScript or bath rather than a utility.  Time is not an issue and there are no special characters.
A freeware utility that could likely get the job done would be a Windows port of the unix "split" util, available here:

http://gnuwin32.sourceforge.net/packages/coreutils.htm

~bp
So how do you envision this working? Would you split at say every 1,000,000 lines?  And you would specify an output folder?  How would you want the new files named?  Etc...

Could there ever be blank lines in the file?

~bp
SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It looks like you added in some code before I got a chance to answer your questions.
Yes to split at every 1,000,000 lines.  The new file names doesn't matter as long as the output gets redirected into files with different names.   There are no blank lines in the file.
Will my answers impact the code you already posted?   Thanks.
I ran the batch file but no output files get created.  Below is the message that is displyed when I run the batch file:


C:\ee\EE28352646>setlocal EnableDelayedExpansion


C:\ee\EE28352646>set InFile=c:\ee\EE28352646\in.txt

C:\ee\EE28352646>set OutDir=c:\ee\EE28352646

C:\ee\EE28352646>REM Can not be larger than 2147483648 !!!

C:\ee\EE28352646>set MaxLines=1000000

C:\ee\EE28352646>if not exist "c:\ee\EE28352646\in.txt" (
echo *ERROR* Input file does not exist!
 exit /b
)

C:\ee\EE28352646>if not exist "c:\ee\EE28352646\" (
echo *ERROR* Output folder does not exist!
 exit /b
)

C:\ee\EE28352646>for %A in ("c:\ee\EE28352646\in.txt") do (
set Name=%~nA
 set Ext=%~xA
)

C:\ee\EE28352646>(
set Name=in
 set Ext=.txt
)
Those answers match the assumptions I made in that code.  The output files will have the same name and extension as the input file, with a number, like this:

Input file:
  myfile.txt

Output files:
  myfile_1.txt
  myfile_2.txt
  myfile_3.txt
   . . .

You can specify what folder to place the output files into via the SET lines.  It can be the same one as the input file, since the names are different, but if there were any other files in that same folder that already existed with the name of the new output file they will be deleted.  For that reason it's probably safest to send it to a different folder just for the split output.

~bp
How did you run it, did you remove the first line?  If you did not, then you shouldn't have seen any of that output.

~bp
I ran the batch file the first time with the first line and didn't see any output files.  So I removed the first line and ran the batch file again, so I would be able to see the batch file running and have the info to post here.
Ran it here, works fine.

Did you create the folder c:\ee\EE28352646"

And the file c:\ee\EE28352646\in.txt" with records in it?

~bp
yes
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@dragon-it, thanks for posting this.  The Vbscript runs fine with no errors.  However, I am getting a new file with all the records from the original file. I specified 1 Million records when I ran this hoping it would create 3 output files from the original file that contains 2.5 Million records.  Any ideas?  Thanks again.
I tried it with less, suggest try smaller file first and split at 1000 :

Latenow, will look in morning.

Steve
That's not what I am looking to do.    I appreciate the effort.
Where does this file come from?  I imagine it's too large to ZIP and post here? But I'm wondering if there is something odd about the file that is causing my script to fail on it when it worked on a local test file here.

~bp
it is too large to post. Its not the end of the world if I don't have this, but it would be helpful.  I will leave my post up for a while, might be other solutions to test.  I checked out the link for the utility. I saw different files that look like it might need to be compiled (not sure). I have limited access to download and test things in my environment. I guess at this point I have to be open to using a utility. I often prefer batch files because I find its easier to add other functions to perform in the future.    If you can find the one for Windows that doesn't need to be compiled, I would appreciate you sending the file info to download.  Thanks again.
I am running two tests here based on a large log file I found that has just under 2.4M lines in it.

Steve's VBS solution is understandably much faster than the BAT script I fashioned, and has finished.  It worked as desired, creating two files with 1,000,000 lines, and a third with the remaining 400,000 or so lines.  It worked properly.

The BAT is still chugging away, but is only 50% through the writing of the first file, so will take a while.

I also ran the SPLIT.EXE util and it was quite fast, the best performance.  It produced the same results as the VBS.  TO get it to work you will also need the two DLL files from the following link:

http://sourceforge.net/projects/gnuwin32/files/coreutils/5.3.0/coreutils-5.3.0-dep.zip/download?use_mirror=softlayer-dal&download=

I used the following command, and one slight weakness is that split.exe doesn't add the extension back on the output files.  But it would pretty easy to add that back witj a RENAME command afterwards.

split --lines=1000000 --numeric-suffixes in.txt in_

~bp
all 3 options worked with log file?  I tried the batch and vbs and had issues.
i will try the utility when i get a chance and let you know how it goes. I have to assume its an issue with my text file if the vbs isn't working.
Yes, the VBS worked, and the utility worked.  The BAT is still running and honestly due to it performance I would say its not viable.  But it's still chugging away.

Have you ever used powershell, that's another option that would likely perform close to the utility.

~bp
i don't think I can run that on my desktop, I have very limited access to anything.
Thanks for all the help.  I will have to try to download the utility and test, if I able to access from network.
it may well be your log file isnt a proper text file then... sorry I couldnt respond, was 1!am and grabbing few hours kip between boys waking me up.

glad to hear the vbs worked for you too Bill.

wonder if is a unix end of lne file or something.

Perhaps you could post part of the file, thoughh you have already closed the q.
Just to follow up, the BAT script approach did ultimately finish, and did the job, although it took well over an hour.

Keep us posted on how you proceed and make out, and if you need additional help.

~bp
@dragon-it, it looks like I owe you some points and apology to both of you.   Thanks for sticking this out, I realized when testing the VBScript again this morning that I had accidently added an extra zero (specifying 10 million instead of 1 Million) when first executing.  Dragon-IT how can I request you get additional points?
Thanks to both of you for the help.
Welcome.

~bp
No problem, there was no need to change the points as long as it worked...  there is a default of 1m lines if you don't specify a number except line #9 was wrong in the script, have changed that now.

Steve
@dragon-it, thanks again and thanks for updating the script.