Solved

Powershell: How to transform a very large  text file?

Posted on 2009-07-14
4
1,660 Views
Last Modified: 2012-05-07
I have a very large xml document. I wanted to load it into memory, do a bunch of regex replacements on it, extract some details from it and write the new version back out.

I did this in c#, reading in and passing along the entire file's contents before writing it back out.

In Powershell I'm getting an Out of Memory error. Admittedly, I don't know if I ever processed as large a file with the old application, so I can't be 100% sure this would work.

In any case, I've tried a number of things w/o luck.  There are no CRs in the document, so it's one long string. I tried the -ReadSize and -Encoding arguments, w/o any improvements.

(Get-Content "C:\Test\Input.xml" -Encoding Byte -ReadCount 10kb)  | 
		Set-Content -Encoding Byte "C:\Test\output.xml"

Open in new window

0
Comment
Question by:ToddBeaulieu
  • 2
  • 2
4 Comments
 
LVL 70

Expert Comment

by:Chris Dent
ID: 24857616

How big is very large?

ReadCount is a line count, that 10Kb will be converted to 10240 rather than being treated as a size, doesn't help much.

You might consider using System.IO.StreamReader and the ReadBlock method. Although chances are you can use whatever you used in your C# code. What did you use there?

Chris
0
 
LVL 16

Author Comment

by:ToddBeaulieu
ID: 24859519
Actually, in the original version I unzipped the file directly to a string using a 3rd party library. I then did my replacements on that string and wrote it out in a single operation. When I tried that approach with PS it was unbearable, gobbling up memory and resources before finally failing with out of memory. Again, I haven't tried this large file with the old system to see how it would respond, but I figure it doesn't matter, because I want the new system to be guaranteed to work.

Using your suggestion, I was able to chunk the file up and copy it to a new file (haven't started replacing yet). I'm surprised that the output file is a different size than the input, since I'm passing everything through. Do you know why this would be, given the code below? Encoding issue?

Because the files are so large and have no CR/LF, I've been unable to open either yet to try to compare them. Even my trusty TextPad is brought to its knees.

Input size: 240,141,602 bytes
Output size: 240,996,352 bytes

$if = new-object System.IO.StreamReader "C:\Test\input.xml"  
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  
 
[Char[]]$buffer = new-object char[] 1000000  
[int]$bytesRead
 
$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
 
while ($bytesRead -gt 0)
{
	[string]$Chunk = New-Object string(,$buffer)
	$of.Write($Chunk)
	
	$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
}

Open in new window

0
 
LVL 70

Accepted Solution

by:
Chris Dent earned 500 total points
ID: 24860003

Very odd. It defaults to UTF8, but even if the writer is forced to use Unicode it still gives a smaller file. I'm not sure why it's showing a difference

Compare-Object shows the only difference in content is white space at the end (my test file is rather small :)). So you might want to watch for white space at the end as well. Perhaps .Trim() the $Chunk before writing it. Of course, that won't help with the file being smaller, I cannot explain that.

Chris
0
 
LVL 16

Author Closing Comment

by:ToddBeaulieu
ID: 31603425
Your comment of trimming made me realize something. The very last buffer was being written out, even if it were only a partial buffer. Therefore, even if it read just 1 byte in the final chunk, it would still write out one full buffer.

The final code is shown below.

Thanks!


$if = new-object System.IO.StreamReader "C:\Test\input.xml"
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  

[Char[]]$buffer = new-object char[] 5000000  

[int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)

while ($bytesRead -gt 0)
{
    [int]$bytesToWrite = $bytesRead - 1
    [string]$Chunk = New-Object string($buffer, 0, $bytesToWrite)
   
    $of.Write($Chunk)
   
    [int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)
}

$if.Close()
$if.Dispose()

$of.Close()
$of.Dispose()
0

Featured Post

NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

How to sign a powershell script so you can prevent tampering, and only allow users to run authorised Powershell scripts
This article will help you understand what HashTables are and how to use them in PowerShell.
This Micro Tutorial hows how you can integrate  Mac OSX to a Windows Active Directory Domain. Apple has made it easy to allow users to bind their macs to a windows domain with relative ease. The following video show how to bind OSX Mavericks to …
This Micro Tutorial demonstrates using Microsoft Excel pivot tables, how to reverse engineer competitors' marketing strategies through backlinks.

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question