Solved

Another splice up of a bunch of files

Posted on 2012-12-22
19
411 Views
Last Modified: 2013-01-03
DOS runs out of memory thus was going to see if easy VB way.  Source files are 2GB each.

I have 10 source files that are ! delimited.  The first row is a header row.  The first column in each file is the Year column.  It appears as:  "2012"!"January"!

I need a script that will read every .txt file in the directory and create new files based on the column 1 value.  Each new file will only contain the rows of the same year.  For Example all rows that begin with "2012" will be a new file named the same as the original but prefixed with '2012'.  So if the original file name was 'file.txt' new filename would be '2012file.txt' and so on.  I need the header row in each new file too.

Thanks
0
Comment
Question by:elwayisgod
  • 8
  • 7
  • 3
  • +1
19 Comments
 
LVL 8

Expert Comment

by:-Mystique-
ID: 38716611
Perhaps the information at one of the below links will help you.

vbscript to split very large text files
http://stackoverflow.com/questions/4606367/vbscript-to-split-very-large-text-files

Function Creates Multidimensional Arrays from Delimited Text Files
 This VBScript user-defined function can help streamline many text-based processes
http://www.windowsitpro.com/article/user-defined-function-udf/function-creates-multidimensional-arrays-from-delimited-text-files

Split text fiel searching for specific string of text and saving in mutltiple directories
http://stackoverflow.com/questions/12255167/split-text-fiel-searching-for-specific-string-of-text-and-saving-in-mutltiple-di

Working With Arrays in VBScript
http://www.aspfree.com/c/a/windows-scripting/working-with-arrays-in-vbscript/

Scripts to manage Text Files
http://www.activexperts.com/activmonitor/windowsmanagement/adminscripts/other/textfiles/

Topics for Writing or Appending to a File with VBScript
http://www.computerperformance.co.uk/vbscript/vbscript_file_opentextfile.htm

 SciTE A free source code editing component for Win32 and GTK+, including VB & VBScript..  
SciTE is a SCIntilla based Text Edito
http://www.scintilla.org/index.html
http://www.scintilla.org/ScintillaDownload.html
Text editing in SciTE works similarly to most Macintosh or Windows editors with the added feature of automatic syntax styling.
http://www.scintilla.org/SciTEDoc.html

HxD - Freeware Hex Editor and Disk Editor
http://mh-nexus.de/en/hxd/
Features
•Available as a portable and installable edition
•RAM-Editor ¿To edit the main memory
¿Memory sections are tagged with data-folds
•Disk-Editor (Hard disks, floppy disks, ZIP-disks, USB flash drives, CDs, ...) ¿RAW reading and writing of disks and drives
¿for Win9x, WinNT and higher
•Instant opening regardless of file-size ¿Up to 8GB, opening and editing is very fast
•Liberal but safe file sharing with other programs
•Flexible and fast searching/replacing for several data types ¿Data types: text (including Unicode), hex-values, integers and floats
¿Search direction: Forward, Backwards, All (starting from the beginning)
•File compare (simple)
•View data in Ansi, DOS, EBCDIC and Macintosh character sets
•Checksum-Generator: Checksum, CRCs, Custom CRC, SHA-1, SHA-512, MD5, ...
•Exporting of data to several formats ¿Source code (Pascal, C, Java, C#, VB.NET)
¿Formatted output (plain text, HTML, Richtext, TeX)
¿Hex files (Intel HEX, Motorola S-record)
•Insertion of byte patterns
•File tools ¿File shredder for safe file deletion
¿Splitting or concatenating of files
•Basic data analysis (statistics) ¿Graphical representation of the byte/character distribution
¿Helps to identify the data type of a selection
•Byte grouping ¿1, 2, 4, 8 or 16 bytes packed together into one column
•"Hex only" or "text only"-modes
•Progress-window for lengthy operations ¿Shows the remaining time
¿Button to cancel
•Modified data is highlighted
•Unlimited undo
•"Find updates..."-function
•Easy to use and modern interface
•Goto address
•Printing
•Overwrite or insert mode
•Cut, copy, paste insert, paste write
•Clipboard support for other hex editors ¿Visual Studio/Visual C++, WinHex, HexWorkshop, ...
•Bookmarks ¿Ctrl+Shift+Number (0-9) sets a bookmark
¿Ctrl+Number (0-9) goes to a bookmark
•Navigating to nibbles with Ctrl+Left or Ctrl+Right
•Flicker free display and fast drawing
0
 
LVL 45

Expert Comment

by:aikimark
ID: 38716975
Does year data appear in more than one of these files?  That is, might we find 2012 lines in more than one file?
0
 

Author Comment

by:elwayisgod
ID: 38716976
Yes.  no telling where the 2012 is
0
 
LVL 45

Expert Comment

by:aikimark
ID: 38717114
So, solutions need to check for an existing target (year) file and only append the line data, rather than including the header.

Will the target files ever exceed 2GB?
0
 

Author Comment

by:elwayisgod
ID: 38720136
I doubt it but hard to say.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 38720233
If you can work it into your process, you would simplify all this processing by eliminating the header row from the front of each data file.  You would only need to append the data to the appropriate year file.  If you needed the header at a later time, you could use a simple copy command or convert the header file into a schema.ini file.
0
 
LVL 68

Expert Comment

by:Qlemo
ID: 38720439
Is PowerShell an option here? The average speed I got with my test data is about 1MB per second. That is 35 minutes per 2GB file.
0
 

Author Comment

by:elwayisgod
ID: 38721097
Its Windows Server 2003.  Is.Powershell built into the OS?
0
 
LVL 68

Expert Comment

by:Qlemo
ID: 38721209
No, you will have to download it. All necessary info can be found at http://support.microsoft.com/kb/968929 .
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 45

Expert Comment

by:aikimark
ID: 38722657
What extra software do you have on your server?  For instance, if you have Syncsort, you should get much better performance due to its I/O optimization.
0
 

Author Comment

by:elwayisgod
ID: 38722666
Nothing that I know of.  We are looking possibly to try Cygwin.  But not sure.  Thinking Linux could do this file manipulation faster than DOS Batch?????
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 38722808
What DOS command are you using?

If you run the following command, replacing the file name of one of the big source files in place of "Q_27976301_Data.txt", how long does it take?
findstr /b   """2012""" Q_27976301_Data.txt >> 2012_Data.txt

Open in new window


I/O is almost always going to be the biggest performance obstacle.  However, we might take advantage of device buffering to provide a sequential read of the file for several extract tasks.

In theory, you can use the START command to launch several of these FindStr commands with separate CPU affinity.  Assuming you have a multi-core processor, you might be able to extract four different years with a single pass.  I would also try this on a single CPU (do not use the affinity switch on the START commands), since you are actually I/O bound and not CPU bound.
0
 

Author Comment

by:elwayisgod
ID: 38723841
Aikimark,

It runs in two seconds but produces a blank file.... No idea what's happening.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 38724115
Did you run this in the same directory as the big data file?
Did you substitute your big file name in place of Q_27976301_Data.txt ?

Note: if the big file name contains spaces, you need to put the file name in quotes.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 38726872
It would be helpful if you posted the first three lines in the file.
0
 

Author Comment

by:elwayisgod
ID: 38730887
Back tomorrow to try and resolve.
0
 
LVL 68

Expert Comment

by:Qlemo
ID: 38741241
Did that (http:#a38722808) really work for you? It did not for me.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 38741591
@elwayisgod

Did that command extract the 2012 rows?
0
 

Author Comment

by:elwayisgod
ID: 38742062
It did but got more that I bargained for.  But my question wasn't fully qualified with all the requirements.  Thus I just want to move on.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

This article is the result of a quest to better understand Task Scheduler 2.0 and all the newer objects available in vbscript in this version over  the limited options we had scripting in Task Scheduler 1.0.  As I started my journey of knowledge I f…
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now