Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Powershell: How to transform a very large  text file?

Posted on 2009-07-14
4
1,690 Views
Last Modified: 2012-05-07
I have a very large xml document. I wanted to load it into memory, do a bunch of regex replacements on it, extract some details from it and write the new version back out.

I did this in c#, reading in and passing along the entire file's contents before writing it back out.

In Powershell I'm getting an Out of Memory error. Admittedly, I don't know if I ever processed as large a file with the old application, so I can't be 100% sure this would work.

In any case, I've tried a number of things w/o luck.  There are no CRs in the document, so it's one long string. I tried the -ReadSize and -Encoding arguments, w/o any improvements.

(Get-Content "C:\Test\Input.xml" -Encoding Byte -ReadCount 10kb)  | 
		Set-Content -Encoding Byte "C:\Test\output.xml"

Open in new window

0
Comment
Question by:ToddBeaulieu
  • 2
  • 2
4 Comments
 
LVL 70

Expert Comment

by:Chris Dent
ID: 24857616

How big is very large?

ReadCount is a line count, that 10Kb will be converted to 10240 rather than being treated as a size, doesn't help much.

You might consider using System.IO.StreamReader and the ReadBlock method. Although chances are you can use whatever you used in your C# code. What did you use there?

Chris
0
 
LVL 16

Author Comment

by:ToddBeaulieu
ID: 24859519
Actually, in the original version I unzipped the file directly to a string using a 3rd party library. I then did my replacements on that string and wrote it out in a single operation. When I tried that approach with PS it was unbearable, gobbling up memory and resources before finally failing with out of memory. Again, I haven't tried this large file with the old system to see how it would respond, but I figure it doesn't matter, because I want the new system to be guaranteed to work.

Using your suggestion, I was able to chunk the file up and copy it to a new file (haven't started replacing yet). I'm surprised that the output file is a different size than the input, since I'm passing everything through. Do you know why this would be, given the code below? Encoding issue?

Because the files are so large and have no CR/LF, I've been unable to open either yet to try to compare them. Even my trusty TextPad is brought to its knees.

Input size: 240,141,602 bytes
Output size: 240,996,352 bytes

$if = new-object System.IO.StreamReader "C:\Test\input.xml"  
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  
 
[Char[]]$buffer = new-object char[] 1000000  
[int]$bytesRead
 
$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
 
while ($bytesRead -gt 0)
{
	[string]$Chunk = New-Object string(,$buffer)
	$of.Write($Chunk)
	
	$bytesRead = $if.ReadBlock($buffer, $index, $buffer.Length)
}

Open in new window

0
 
LVL 70

Accepted Solution

by:
Chris Dent earned 500 total points
ID: 24860003

Very odd. It defaults to UTF8, but even if the writer is forced to use Unicode it still gives a smaller file. I'm not sure why it's showing a difference

Compare-Object shows the only difference in content is white space at the end (my test file is rather small :)). So you might want to watch for white space at the end as well. Perhaps .Trim() the $Chunk before writing it. Of course, that won't help with the file being smaller, I cannot explain that.

Chris
0
 
LVL 16

Author Closing Comment

by:ToddBeaulieu
ID: 31603425
Your comment of trimming made me realize something. The very last buffer was being written out, even if it were only a partial buffer. Therefore, even if it read just 1 byte in the final chunk, it would still write out one full buffer.

The final code is shown below.

Thanks!


$if = new-object System.IO.StreamReader "C:\Test\input.xml"
$of = new-object System.IO.StreamWriter "C:\Test\output.xml"  

[Char[]]$buffer = new-object char[] 5000000  

[int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)

while ($bytesRead -gt 0)
{
    [int]$bytesToWrite = $bytesRead - 1
    [string]$Chunk = New-Object string($buffer, 0, $bytesToWrite)
   
    $of.Write($Chunk)
   
    [int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)
}

$if.Close()
$if.Dispose()

$of.Close()
$of.Dispose()
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Create and license users in Office 365 in bulk based on a CSV file. A step-by-step guide with PowerShell script examples.
This video shows how to quickly and easily add an email signature for all users on Exchange 2016. The resulting signature is applied on a server level by Exchange Online. The email signature template has been downloaded from: www.mail-signatures…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question