Solved

Powershell: How to transform a very large  text file?

Posted on 2009-07-14
4
1,730 Views
Last Modified: 2012-05-07
I have a very large xml document. I wanted to load it into memory, do a bunch of regex replacements on it, extract some details from it and write the new version back out.

I did this in c#, reading in and passing along the entire file's contents before writing it back out.

In Powershell I'm getting an Out of Memory error. Admittedly, I don't know if I ever processed as large a file with the old application, so I can't be 100% sure this would work.

In any case, I've tried a number of things w/o luck.  There are no CRs in the document, so it's one long string. I tried the -ReadSize and -Encoding arguments, w/o any improvements.

(Get-Content "C:\Test\Input.xml" -Encoding Byte -ReadCount 10kb)  | 
		Set-Content -Encoding Byte "C:\Test\output.xml"

Open in new window

0
Comment
Question by:ToddBeaulieu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 71

Expert Comment

by:Chris Dent
ID: 24857616

How big is very large?

ReadCount is a line count, that 10Kb will be converted to 10240 rather than being treated as a size, doesn't help much.

You might consider using System.IO.StreamReader and the ReadBlock method. Although chances are you can use whatever you used in your C# code. What did you use there?

Chris
0
 
LVL 16

Author Comment

by:ToddBeaulieu
ID: 24859519
Actually, in the original version I unzipped the file directly to a string using a 3rd party library. I then did my replacements on that string and wrote it out in a single operation. When I tried that approach with PS it was unbearable, gobbling up memory and resources before finally failing with out of memory. Again, I haven't tried this large file with the old system to see how it would respond, but I figure it doesn't matter, because I want the new system to be guaranteed to work.

Using your suggestion, I was able to chunk the file up and copy it to a new file (haven't started replacing yet). I'm surprised that the output file is a different size than the input, since I'm passing everything through. Do you know why this would be, given the code below? Encoding issue?

Because the files are so large and have no CR/LF, I've been unable to open either yet to try to compare them. Even my trusty TextPad is brought to its knees.

Input size: 240,141,602 bytes
Output size: 240,996,352 bytes

$if = new-object System.IO.StreamReader "C:\Test\input.xml"  
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  
 
[Char[]]$buffer = new-object char[] 1000000  
[int]$bytesRead
 
$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
 
while ($bytesRead -gt 0)
{
	[string]$Chunk = New-Object string(,$buffer)
	$of.Write($Chunk)
	
	$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
}

Open in new window

0
 
LVL 71

Accepted Solution

by:
Chris Dent earned 500 total points
ID: 24860003

Very odd. It defaults to UTF8, but even if the writer is forced to use Unicode it still gives a smaller file. I'm not sure why it's showing a difference

Compare-Object shows the only difference in content is white space at the end (my test file is rather small :)). So you might want to watch for white space at the end as well. Perhaps .Trim() the $Chunk before writing it. Of course, that won't help with the file being smaller, I cannot explain that.

Chris
0
 
LVL 16

Author Closing Comment

by:ToddBeaulieu
ID: 31603425
Your comment of trimming made me realize something. The very last buffer was being written out, even if it were only a partial buffer. Therefore, even if it read just 1 byte in the final chunk, it would still write out one full buffer.

The final code is shown below.

Thanks!


$if = new-object System.IO.StreamReader "C:\Test\input.xml"
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  

[Char[]]$buffer = new-object char[] 5000000  

[int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)

while ($bytesRead -gt 0)
{
    [int]$bytesToWrite = $bytesRead - 1
    [string]$Chunk = New-Object string($buffer, 0, $bytesToWrite)
   
    $of.Write($Chunk)
   
    [int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)
}

$if.Close()
$if.Dispose()

$of.Close()
$of.Dispose()
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A procedure for exporting installed hotfix details of remote computers using powershell
The following article is intended as a guide to using PowerShell as a more versatile and reliable form of application detection in SCCM.
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the adminiā€¦

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question