Why do I need to use Encoding when reading a flat file?

ToddBeaulieu
ToddBeaulieu used Ask the Experts™
on
I'm processing xml files. Because of their size, I need to chunk them. This means a StreadReader, from what I've determined.

I soon noticed a couple of strange "transformations". For instance a node like "<Person attribute=value>" would come through as "<Person attribute=value />". Note the closing slash. Even stranger, actual closing slashes like "/>" were transformed to "/\>".

After quite a bit of struggling I discovered that if I specify the encoding on the StreadReader as UTF8 it seems to work as desired.

Can anyone explain what might be going on here? I assumed these files were simple "ASCII" encoded. Why would they be transformed?
$if = new-object System.IO.StreamReader -ArgumentList ([string]$sourceFile, [System.Text.Encoding]::UTF8, [Boolean]$false, [int]$bufferSize)
 
[Char[]]$buffer = new-object char[] $bufferSize
 
[int]$bytesRead = $if.ReadBlock($buffer, 0, $buffer.Length)
 
while ($bytesRead -gt 0)
{
    [string]$chunk = New-Object string($buffer, 0, $bytesRead)

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015
Commented:
XML files can be encoded in whatever the designer wants them to be coded in--hence, do not always assume ASCII. I believe UTF-8 is the standard encoding for XML, although there is UTF-16 (and I think UTF-32). You don't always have to specify the encoding type, but it is better to do so, as you have seen in your example.

Author

Commented:
I've done a bit of searching but I'm stilla bit confused. There is actually no encoding line in the xml file. In fact, that couldn't be it, anyway, since my problem wasn't related to xml. I simply opened a streamreader and copied it to a streamwriter. This took me HOURS to find a solution to and frankly, I just don't understand the problem, never mind the solution!
Chris DentPowerShell Developer
Top Expert 2010
Commented:

> There is actually no encoding line in the xml file.

This is File Encoding, not XML Encoding. File Encoding defines how the text within the file (on your hard disk) is held / represented.

Since you've using StreamReader you haven't had anything to do with the XML within the file, it's simply not relevant. Any Encoding statement, or lack of, in the XML would not be used.

Chris

Author

Commented:
Right. That's exactly what I was thinking.

I just did an experiment. I found a example that determines a file's encoding by looking at the header bytes.

http://www.personalmicrocosms.com/Pages/dotnettips.aspx?c=15&t=17#tip

I ran it on three files:

1. File I created with my text editor: {System.Text.SBCSCodePageEncoding}
2. File I created with PowerShell (simple redirection): {System.Text.UnicodeEncoding}
3. File created be external system (subject of this thread): {System.Text.UTF8Encoding}

While this was all quite interesting, in the end I found the problem to be an invalid search/replace.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial