?
Solved

Powershell: How to transform a very large  text file?

Posted on 2009-07-14
4
Medium Priority
?
1,795 Views
Last Modified: 2012-05-07
I have a very large xml document. I wanted to load it into memory, do a bunch of regex replacements on it, extract some details from it and write the new version back out.

I did this in c#, reading in and passing along the entire file's contents before writing it back out.

In Powershell I'm getting an Out of Memory error. Admittedly, I don't know if I ever processed as large a file with the old application, so I can't be 100% sure this would work.

In any case, I've tried a number of things w/o luck.  There are no CRs in the document, so it's one long string. I tried the -ReadSize and -Encoding arguments, w/o any improvements.

(Get-Content "C:\Test\Input.xml" -Encoding Byte -ReadCount 10kb)  | 
		Set-Content -Encoding Byte "C:\Test\output.xml"

Open in new window

0
Comment
Question by:ToddBeaulieu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 71

Expert Comment

by:Chris Dent
ID: 24857616

How big is very large?

ReadCount is a line count, that 10Kb will be converted to 10240 rather than being treated as a size, doesn't help much.

You might consider using System.IO.StreamReader and the ReadBlock method. Although chances are you can use whatever you used in your C# code. What did you use there?

Chris
0
 
LVL 16

Author Comment

by:ToddBeaulieu
ID: 24859519
Actually, in the original version I unzipped the file directly to a string using a 3rd party library. I then did my replacements on that string and wrote it out in a single operation. When I tried that approach with PS it was unbearable, gobbling up memory and resources before finally failing with out of memory. Again, I haven't tried this large file with the old system to see how it would respond, but I figure it doesn't matter, because I want the new system to be guaranteed to work.

Using your suggestion, I was able to chunk the file up and copy it to a new file (haven't started replacing yet). I'm surprised that the output file is a different size than the input, since I'm passing everything through. Do you know why this would be, given the code below? Encoding issue?

Because the files are so large and have no CR/LF, I've been unable to open either yet to try to compare them. Even my trusty TextPad is brought to its knees.

Input size: 240,141,602 bytes
Output size: 240,996,352 bytes

$if = new-object System.IO.StreamReader "C:\Test\input.xml"  
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  
 
[Char[]]$buffer = new-object char[] 1000000  
[int]$bytesRead
 
$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
 
while ($bytesRead -gt 0)
{
	[string]$Chunk = New-Object string(,$buffer)
	$of.Write($Chunk)
	
	$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
}

Open in new window

0
 
LVL 71

Accepted Solution

by:
Chris Dent earned 2000 total points
ID: 24860003

Very odd. It defaults to UTF8, but even if the writer is forced to use Unicode it still gives a smaller file. I'm not sure why it's showing a difference

Compare-Object shows the only difference in content is white space at the end (my test file is rather small :)). So you might want to watch for white space at the end as well. Perhaps .Trim() the $Chunk before writing it. Of course, that won't help with the file being smaller, I cannot explain that.

Chris
0
 
LVL 16

Author Closing Comment

by:ToddBeaulieu
ID: 31603425
Your comment of trimming made me realize something. The very last buffer was being written out, even if it were only a partial buffer. Therefore, even if it read just 1 byte in the final chunk, it would still write out one full buffer.

The final code is shown below.

Thanks!


$if = new-object System.IO.StreamReader "C:\Test\input.xml"
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  

[Char[]]$buffer = new-object char[] 5000000  

[int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)

while ($bytesRead -gt 0)
{
    [int]$bytesToWrite = $bytesRead - 1
    [string]$Chunk = New-Object string($buffer, 0, $bytesToWrite)
   
    $of.Write($Chunk)
   
    [int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)
}

$if.Close()
$if.Dispose()

$of.Close()
$of.Dispose()
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Previously, on our Nano Server Deployment series, we've created a new nano server image and deployed it on a physical server in part 2. Now we will go through configuration.
My attempt to use PowerShell and other great resources found online to simplify the deployment of Office 365 ProPlus client components to any workstation that needs it, regardless of existing Office components that may be needing attention.
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the admini…
In this video you will find out how to export Office 365 mailboxes using the built in eDiscovery tool. Bear in mind that although this method might be useful in some cases, using PST files as Office 365 backup is troublesome in a long run (more on t…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question