Another splice up of a bunch of files

DOS runs out of memory thus was going to see if easy VB way.  Source files are 2GB each.

I have 10 source files that are ! delimited.  The first row is a header row.  The first column in each file is the Year column.  It appears as:  "2012"!"January"!

I need a script that will read every .txt file in the directory and create new files based on the column 1 value.  Each new file will only contain the rows of the same year.  For Example all rows that begin with "2012" will be a new file named the same as the original but prefixed with '2012'.  So if the original file name was 'file.txt' new filename would be '2012file.txt' and so on.  I need the header row in each new file too.

Who is Participating?
aikimarkConnect With a Mentor Commented:
What DOS command are you using?

If you run the following command, replacing the file name of one of the big source files in place of "Q_27976301_Data.txt", how long does it take?
findstr /b   """2012""" Q_27976301_Data.txt >> 2012_Data.txt

Open in new window

I/O is almost always going to be the biggest performance obstacle.  However, we might take advantage of device buffering to provide a sequential read of the file for several extract tasks.

In theory, you can use the START command to launch several of these FindStr commands with separate CPU affinity.  Assuming you have a multi-core processor, you might be able to extract four different years with a single pass.  I would also try this on a single CPU (do not use the affinity switch on the START commands), since you are actually I/O bound and not CPU bound.
Perhaps the information at one of the below links will help you.

vbscript to split very large text files

Function Creates Multidimensional Arrays from Delimited Text Files
 This VBScript user-defined function can help streamline many text-based processes

Split text fiel searching for specific string of text and saving in mutltiple directories

Working With Arrays in VBScript

Scripts to manage Text Files

Topics for Writing or Appending to a File with VBScript

 SciTE A free source code editing component for Win32 and GTK+, including VB & VBScript..  
SciTE is a SCIntilla based Text Edito
Text editing in SciTE works similarly to most Macintosh or Windows editors with the added feature of automatic syntax styling.

HxD - Freeware Hex Editor and Disk Editor
•Available as a portable and installable edition
•RAM-Editor ¿To edit the main memory
¿Memory sections are tagged with data-folds
•Disk-Editor (Hard disks, floppy disks, ZIP-disks, USB flash drives, CDs, ...) ¿RAW reading and writing of disks and drives
¿for Win9x, WinNT and higher
•Instant opening regardless of file-size ¿Up to 8GB, opening and editing is very fast
•Liberal but safe file sharing with other programs
•Flexible and fast searching/replacing for several data types ¿Data types: text (including Unicode), hex-values, integers and floats
¿Search direction: Forward, Backwards, All (starting from the beginning)
•File compare (simple)
•View data in Ansi, DOS, EBCDIC and Macintosh character sets
•Checksum-Generator: Checksum, CRCs, Custom CRC, SHA-1, SHA-512, MD5, ...
•Exporting of data to several formats ¿Source code (Pascal, C, Java, C#, VB.NET)
¿Formatted output (plain text, HTML, Richtext, TeX)
¿Hex files (Intel HEX, Motorola S-record)
•Insertion of byte patterns
•File tools ¿File shredder for safe file deletion
¿Splitting or concatenating of files
•Basic data analysis (statistics) ¿Graphical representation of the byte/character distribution
¿Helps to identify the data type of a selection
•Byte grouping ¿1, 2, 4, 8 or 16 bytes packed together into one column
•"Hex only" or "text only"-modes
•Progress-window for lengthy operations ¿Shows the remaining time
¿Button to cancel
•Modified data is highlighted
•Unlimited undo
•"Find updates..."-function
•Easy to use and modern interface
•Goto address
•Overwrite or insert mode
•Cut, copy, paste insert, paste write
•Clipboard support for other hex editors ¿Visual Studio/Visual C++, WinHex, HexWorkshop, ...
•Bookmarks ¿Ctrl+Shift+Number (0-9) sets a bookmark
¿Ctrl+Number (0-9) goes to a bookmark
•Navigating to nibbles with Ctrl+Left or Ctrl+Right
•Flicker free display and fast drawing
Does year data appear in more than one of these files?  That is, might we find 2012 lines in more than one file?
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

elwayisgodAuthor Commented:
Yes.  no telling where the 2012 is
So, solutions need to check for an existing target (year) file and only append the line data, rather than including the header.

Will the target files ever exceed 2GB?
elwayisgodAuthor Commented:
I doubt it but hard to say.
If you can work it into your process, you would simplify all this processing by eliminating the header row from the front of each data file.  You would only need to append the data to the appropriate year file.  If you needed the header at a later time, you could use a simple copy command or convert the header file into a schema.ini file.
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
Is PowerShell an option here? The average speed I got with my test data is about 1MB per second. That is 35 minutes per 2GB file.
elwayisgodAuthor Commented:
Its Windows Server 2003.  Is.Powershell built into the OS?
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
No, you will have to download it. All necessary info can be found at .
What extra software do you have on your server?  For instance, if you have Syncsort, you should get much better performance due to its I/O optimization.
elwayisgodAuthor Commented:
Nothing that I know of.  We are looking possibly to try Cygwin.  But not sure.  Thinking Linux could do this file manipulation faster than DOS Batch?????
elwayisgodAuthor Commented:

It runs in two seconds but produces a blank file.... No idea what's happening.
Did you run this in the same directory as the big data file?
Did you substitute your big file name in place of Q_27976301_Data.txt ?

Note: if the big file name contains spaces, you need to put the file name in quotes.
It would be helpful if you posted the first three lines in the file.
elwayisgodAuthor Commented:
Back tomorrow to try and resolve.
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
Did that (http:#a38722808) really work for you? It did not for me.

Did that command extract the 2012 rows?
elwayisgodAuthor Commented:
It did but got more that I bargained for.  But my question wasn't fully qualified with all the requirements.  Thus I just want to move on.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.