Link to home
Start Free TrialLog in
Avatar of elwayisgod
elwayisgodFlag for United States of America

asked on

Script to Chomp a huge .txt file

Hi,

I have a file located in C:\Temp named 'giantfile.txt'.  It's about 2 million rows.  The very first column has about 50 different values in it.  I need to create smaller files from this file based on the value in that file.  So if there are 50 different values I would end up with 50 text files etc...

Every row that contains the same value in that first column gets added to the new file.  The new file will contain the name of the value in it.  So if 'giantfile.txt' is:

Houston, Sam, Thompson, 303456789
Dallas, May, Jones, 123456789
Denver, Felix, Martinez, 8472346521

It would create 3 new files named: Houston.txt, Dallas.txt and Denver.txt and each file would contain the corresponding row.  However, in real example each newly created file would be thousands of rows.  The total rows in the new files added up would total the number of rows in 'giantfile.txt'.  The destination of the new files would be C:\Temp\Completed

Prefer DOS or VB as I don't know Perl etc....

This is jus the first step.  I will have other steps and will provide points necessary to achieve end goal.  Hopefully we can work through this from beginning to end to ease the clients needs.  They are stumbling around itn MS Access right now and I figured all this would be simple in scripting.....
Avatar of elwayisgod
elwayisgod
Flag of United States of America image

ASKER

It's a pipe delimted file
ASKER CERTIFIED SOLUTION
Avatar of Mike Tomlinson
Mike Tomlinson
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I run it but in the 'Completed' folder there is only 1 file and it's identical to the input file... There should be 10 different files?
It's splitting the input lines using the pipe symbol "|" and using the first column value as the basis for building the output file name.  Are you sure that the values are separate by pipes?  You originally gave an example using commas...
Actually I think I know the issue.  The file has a '|' as first character.  Let me run after removing that.
If you need to leave that pipe there, for the bigger file, then change "values(0)" to "values(1)" at line #27.
OK. Not sure yet as haven't seen real file.  Another question.  The file is huge thus it sit's out on a share drive on the SAN.

\\share05\transfer\folder\giantfile.csv

When I put that in for the path and file name it says it can't find path.  I tried mapping it to Z:\ drive and it still won't work.  Trying to avoid copying this thing to my laptop to run...
Did you change lines #4 and #5?

inputFolder = "\\share05\transfer\folder\"
inputFile = "giantfile.txt"

I'm not sure how FSO deals with shares like that.  I'm a VB.Net developer and don't deal with VBScript normally.
Yes... I changed those to reflect real paths.   I can run from laptop, just takes 2 hrs to copy file from share drive.... I would think there's a way...
I think this works... So now this begins a 15 step process.  So basically on the source file I will need to do around 15 things to get it to my final format and data in order to load to my database.  Was thinking I could post these one by one on site or if your interested for a nominal fee (real $$) we could communicate on the side about it?
You can just post the requirements here, one by one, or all at once and I'll see what I can do.
OK.  They will be 500pts each.
I will award points on this one as soon as I can attempt it on real file.
Ok.  Let me know of any issues.  You can ask the rest of the questions using the "Related Question" feature so other people can look back and see what was done to get the ball rolling.
Great.  Never knew about related question thing...
So the problem is every pipe is surrounded in double quotes.... like   "|".    If I run the file through Access and then export it the script runs fine.  If I just try and run script on the original file without running through Access I get:

C:\temp\Scripts\scripts>cscript 01_Create_Files.vbs
Microsoft (R) Windows Script Host Version 5.7
Copyright (C) Microsoft Corporation. All rights reserved.

C:\temp\Scripts\scripts\01_Create_Files.vbs(29, 4) Microsoft VBScript runtime er
ror: Bad file name or number


C:\temp\Scripts\scripts>
Hmmm....try this:

            values = Split(line, Chr(34) & "|" & Chr(34))
Well it looks like script will successfully process the very first line.  However it chokes up on the 2nd line.  Here's a very small sample (3 lines) of my file.
file.txt
What possibly could running it through Access fix?  The actual file is directly from our ETL guys.  Could it be some type of Unicode or something?  This is making no sense :)
This file named 'File3.txt' it runs flawlessly on.  Why on this one but not the original file is beyond me.
file3.txt
Here's code that worked perfectly against File3.txt but wont run when I change it to real file or the File2.txt.


Option Explicit

Dim inputFolder, inputFile, inputFileName, outputFolder, outputFilename
inputFolder = "C:\temp\Scripts\Original\"
inputFile = "file3.txt"
inputFileName = inputFolder & inputFile
outputFolder = "C:\temp\Scripts\Original\Completed\"

Dim outputFiles
Set outputFiles = CreateObject("Scripting.Dictionary")

Dim fso, fIn, fOut
Const ForReading = 1, ForWriting = 2

Set fso = CreateObject("Scripting.FileSystemObject")
If fso.FileExists(inputFileName) Then
      If Not fso.FolderExists(outputFolder) Then
            fso.CreateFolder(outputFolder)
      End If

      Dim line, values      
      Set fIn = fso.OpenTextFile(inputFileName, ForReading, False)
      While Not fIn.AtEndOfStream
            line = fIn.ReadLine
            values = Split(line, "|") ' <-- Pipe delimited file

            outputFileName = outputFolder & values(0) & ".txt"         
            If Not outputFiles.Exists(outputFileName) Then
                  Set fOut = fso.OpenTextFile(outputFileName, ForWriting, True)
                  outputFiles.Add outputFileName, fOut
            Else
                  Set fOut = outputFiles.Item(outputFileName)
            End If
            fOut.WriteLine line
      Wend
   
      MsgBox inputFile, vbOKOnly, "Done Processing File"
Else
      MsgBox inputFile, vbOKOnly, "Input File Not Found"
End If
OK.  It chokes on the " quotes in the file.  So if I remove all double quotes the script works.  So is there a way to add this to script so it does that first and then runs it?
*I haven't looked at the files.

You can remove the quotes on each line as it is read:

    line = fIn.ReadLine
    line = Replace(line, Chr(34), "") ' <-- Remove quotes
    values = Split(line, "|")
Perfect