Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Powershell: How to transform a very large  text file?

Posted on 2009-07-14
4
Medium Priority
?
1,833 Views
Last Modified: 2012-05-07
I have a very large xml document. I wanted to load it into memory, do a bunch of regex replacements on it, extract some details from it and write the new version back out.

I did this in c#, reading in and passing along the entire file's contents before writing it back out.

In Powershell I'm getting an Out of Memory error. Admittedly, I don't know if I ever processed as large a file with the old application, so I can't be 100% sure this would work.

In any case, I've tried a number of things w/o luck.  There are no CRs in the document, so it's one long string. I tried the -ReadSize and -Encoding arguments, w/o any improvements.

(Get-Content "C:\Test\Input.xml" -Encoding Byte -ReadCount 10kb)  | 
		Set-Content -Encoding Byte "C:\Test\output.xml"

Open in new window

0
Comment
Question by:ToddBeaulieu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 71

Expert Comment

by:Chris Dent
ID: 24857616

How big is very large?

ReadCount is a line count, that 10Kb will be converted to 10240 rather than being treated as a size, doesn't help much.

You might consider using System.IO.StreamReader and the ReadBlock method. Although chances are you can use whatever you used in your C# code. What did you use there?

Chris
0
 
LVL 16

Author Comment

by:ToddBeaulieu
ID: 24859519
Actually, in the original version I unzipped the file directly to a string using a 3rd party library. I then did my replacements on that string and wrote it out in a single operation. When I tried that approach with PS it was unbearable, gobbling up memory and resources before finally failing with out of memory. Again, I haven't tried this large file with the old system to see how it would respond, but I figure it doesn't matter, because I want the new system to be guaranteed to work.

Using your suggestion, I was able to chunk the file up and copy it to a new file (haven't started replacing yet). I'm surprised that the output file is a different size than the input, since I'm passing everything through. Do you know why this would be, given the code below? Encoding issue?

Because the files are so large and have no CR/LF, I've been unable to open either yet to try to compare them. Even my trusty TextPad is brought to its knees.

Input size: 240,141,602 bytes
Output size: 240,996,352 bytes

$if = new-object System.IO.StreamReader "C:\Test\input.xml"  
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  
 
[Char[]]$buffer = new-object char[] 1000000  
[int]$bytesRead
 
$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
 
while ($bytesRead -gt 0)
{
	[string]$Chunk = New-Object string(,$buffer)
	$of.Write($Chunk)
	
	$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
}

Open in new window

0
 
LVL 71

Accepted Solution

by:
Chris Dent earned 2000 total points
ID: 24860003

Very odd. It defaults to UTF8, but even if the writer is forced to use Unicode it still gives a smaller file. I'm not sure why it's showing a difference

Compare-Object shows the only difference in content is white space at the end (my test file is rather small :)). So you might want to watch for white space at the end as well. Perhaps .Trim() the $Chunk before writing it. Of course, that won't help with the file being smaller, I cannot explain that.

Chris
0
 
LVL 16

Author Closing Comment

by:ToddBeaulieu
ID: 31603425
Your comment of trimming made me realize something. The very last buffer was being written out, even if it were only a partial buffer. Therefore, even if it read just 1 byte in the final chunk, it would still write out one full buffer.

The final code is shown below.

Thanks!


$if = new-object System.IO.StreamReader "C:\Test\input.xml"
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  

[Char[]]$buffer = new-object char[] 5000000  

[int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)

while ($bytesRead -gt 0)
{
    [int]$bytesToWrite = $bytesRead - 1
    [string]$Chunk = New-Object string($buffer, 0, $bytesToWrite)
   
    $of.Write($Chunk)
   
    [int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)
}

$if.Close()
$if.Dispose()

$of.Close()
$of.Dispose()
0

Featured Post

Does Powershell have you tied up in knots?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This script can help you clean up your user profile database by comparing profiles to Active Directory users in a particular OU, and removing the profiles that don't match.
A quick Powershell script I wrote to find old program installations and check versions of a specific file across the network.
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the admini…
Please read the paragraph below before following the instructions in the video — there are important caveats in the paragraph that I did not mention in the video. If your PaperPort 12 or PaperPort 14 is failing to start, or crashing, or hanging, …

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question