C#: How do you determine the encoding of a file

trevor1940
trevor1940 used Ask the Experts™
on
Hi
I have a text file with words like "intimate°" intimateý & tête-à-tête There's probably others

Is there a way to work out its encoding?  I'm guessing extended ASCII  

I've been using code bellow or similar StreamReader / StreamWriter like I said this a pure guess  

string[] AllLines = File.ReadAllLines(FileName,Encoding.GetEncoding("iso-8859-1"));

Open in new window


It looks OK on a sample file

Is there a Difference between StreamReader & File.Read..??

I'm assuming once you read a file into a string with a particular encoding the string inherits that encoding & this maintained if used in a class.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
ste5anSenior Developer

Commented:
The idea here is to let .NET decide. A stream reader does its own (imperfect) detection. But something like this should work:

Encoding GetEncoding(string filePath)
{
	using(var streamReader = new StreamReader(filePath, true))
	{
		streamReader.Read();
		return streamReader.CurrentEncoding;
	}
}

Open in new window

You'll only need to overwrite or specify an encoding, when it's a format without marker bytes and/or detection is not possible for .NET.

Commented:
There is no guaranteed way of doing it. Ultimately, a file is just a bunch of bytes. The encoding is simply the instruction manual on how to assemble those bytes into the right result. Determining an encoding from bytes is like trying to write an instruction manual by taking apart a piece of furniture. You MIGHT get it right, but there is no guarantee.

That said, there are a few things to consider:

1. Sometimes text files that use Unicode encoding like UTF-8 will have a few invisible bytes at the beginning. These bytes are called the byte order mark (BOM) and are there as a signpost for text readers to say, "This file is encoded as UTF-8 / UTF-16 / etc..."

2. UTF-8 is the default encoding for streamreaders. It's the default because it is not only the most common encoding, but it's also mostly compatible with ASCII plain text files. The only time it will have trouble is it the file contains extended ASCII characters. So basically if there are any special characters (any single byte characters above the 127th byte), Unicode encoding won't be sure how to read those bytes since they will look like corrupted partial characters.

In any event, you can usually just use the default encoding and if there are corrupted characters in the output, you can try an ASCII encoding.

Worst case, you can look at the file with a hex editor and manually figure it out....
You may want to read  the case study https://diveinto.org/python3/case-study-porting-chardet-to-python-3.html (the beginning of the text). It is in Python, but the first part is general. It also contains some refrences to the library by Mozilla team. However, the book is old and the targets moved.

But you can find the C# port here https://github.com/errepi/ude or here https://github.com/CharsetDetector/UTF-unknown (or elsewhere, just picked by chance).

Possibly the NuGet https://www.nuget.org/packages/UDE.CSharp/ will be the fastest to be used.
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

Author

Commented:
@ste5an  your code produced
Console.WriteLine("Encoding is: {0}", EncodingAscii.EncodingName); 
=>
Encoding is: Unicode (UTF-8)

Open in new window


This is clearly wrong result

To prove this I read in and wrote out without specifying the encoding, (@gr8gonzo UTF-8 is the default encoding for streamreaders) then used PSPAD to compare the two I knew these would be different as the file   sizes are different

affect° => affect�  corrupt

Encoding.Unicode (I think is  UTF-16?) produced  something akin to 1980's space invaders

㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭

I had similar with Encoding.GetEncoding("UTF-16") and Encoding.BigEndianUnicode


Encoding.GetEncoding("iso-8859-1") no difference

@gr8gonzo PSPad can open in HEX I couldn't see any header info let alone encoding

So my original guess was correct
Commented:
1. That "corrupt" messaging in the UTF-8 version basically says, "I was able to read the word 'affect'  but then there was one byte that was above 127 so I assumed it was the beginning of a multibyte char, but there were no more bytes afterwards. Since you have left the reader encoding as the UTF-8 default, I will tell you that the last byte is a corrupt character."

2. If you specify UTF-16 (yes you are correct that Encoding.Unicode is the same thing - blame Microsoft for that confusing little tidbit), then the reader is going to read "af" as though those two bytes should equal a character. Then it will read "fe" and then "ct" and so on. Since none of those match up quite right to proper UTF-16 characters, you end up with characters that will get printed out looking like Space Invaders. Pew! Pew!

3. Technically speaking, there is no "extended ASCII" encoding. That term refers to several encodings that act very very similar. The ISO-8859-1 one is usually the one that people are referring to when they say "extended ASCII" (or if they save from MS Office, you'll end up with encodings like Windows-1252 which are almost the same). The extended ASCII charsets are all single byte, so they only have a maximum of 255 values. Special characters are mapped to byte values above 127, so you end up with 127 possible special characters (accented letters and stuff).

So given that UTF-8 was able to read everything except that last degree symbol, you're correct that the encoding was extended ASCII (ISO-8859-1).

4. That said, your original question was "is there a way to work out the original encoding?" So while we've been able to manually determine the encoding for this particular example, if you tried the same steps with a UTF-8-encoded text file with the same text, you'd get different results. When reading it using ISO-8859-1, you'd get 2 strange characters at the end while the degree symbol would show up properly when UTF-8 was specified as the encoding.

So at the end of the day, I re-iterate that it's impossible to perfectly guess the encoding each and every time via code. Things get more complicated if you venture internationally. For example, there are Russian character sets (non-Unicode) that are very similar and have valid overlapping bytes with different letters. The English equivalent would be having bytes that could be decoded as "Hello" or "Hallo" depending on which encoding you specified while reading. Technically both result in legitimate letters that could both be valid English words, so there's no way to know for sure.

That's why using UTF-8 is so popular - it is designed to allow for almost every character known to man without any character confusion. The only reason to ever not use UTF-8 is for situations where you have a LOT of characters that could be more efficiently stored. For example, if you were storing a book written in Chinese or Arabic, most of the characters would take up 3 bytes in UTF-8 while they would take up 2 bytes in UTF-16. But if you're working primarily in English or other Latin-type languages with limited use of special characters, then UTF-8 is by far the most efficient.

Getting back to your program, if you are planning to accept files from different sources, then I'd suggest you stick with the default UTF-8 and detect exceptions while reading and fall back to ISO-8859-1 if UTF-8 throws an exception. You'll have to use an instance of Utf8Encoding with true as the second parameter to the constructor as shown here:

https://docs.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=netframework-4.8#System_Text_UTF8Encoding__ctor_System_Boolean_System_Boolean_

If you're working primarily with English sources then that should give you near-100% success.

Commented:
Alternatively, if the file is being provided by a user via UI input, then add a combo box to allow the user to select the encoding and default it to UTF-8.

That way, they can pick the right encoding if they know it, or they can retry with different encodings if they don't and UTF-8 fails.
To add... The problem of detection is not simple. So, in your case, or you can choose to detect (ad-hoc) few encodings that you may expect from your sources, or you should use some of the ready-to-be-used detectors.

The following page shows the article from 2002 that analyses the problem: A composite approach to language/encoding detection

The https://github.com/errepi/ude gives you the code with the latest update from May 2015, and the https://www.nuget.org/packages/UDE.CSharp/ gives you that one as a NuGet package. The GitHub page shows also the code how it can be used:

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

Open in new window


The tool is rather old, but you can still use it. Anyway, there is the newer project that deals with the problem -- see http://userguide.icu-project.org/conversion/detection, and their GitHub page https://github.com/google/compact_enc_det 

Unfortunately (for you, maybe), the library is written in C++; so, it may still be better for you to use the above mentioned NuGet package for C#.

To add to gr8gonzo gr8 comments, I second to orientation to UTF-8 in any future project. It is true that Microsoft caused a great mess in the encoding world. The main problem of UTF-16 is (in my opinion) with the byte ordering. While UTF-8 is byte oriented and the same text is encoded to the same stream of bytes in any case, the UTF-16 depends on Little/Big Endian. Also because of that the BOM (Byte Order Mark) was introduced. And also BOM for UTF-8 was introduced to "enhance" the situation. So, the Microsoft UTF-8 may contain BOM that actually only complicates things.

Author

Commented:
Looking at gr8gonzo  comment he is basically saying If input file is English, use UTF8 (The Default)  unless it is know the input could use extended ASCII in which case use ISO-8859-1 which makes sense


I thought I'd test Utf8Encoding link with

Input file opened without encoding (UTF8) this has characters beyond  127 bytes and using iso-8859-1
intimate° intimateý & tête-à-tête

Open in new window

Saved as UTF8 I got This
intimate� intimate� & t�te-�-t�te

Open in new window


So why doesn't the code below throw an exception?


using System;
using System.IO;
using System.Text;

namespace FileEncoding
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");

            string UTFtest = @"C:\Temp\UTFtest.txt";


            string UTFtestTxt = File.ReadAllText(UTFtest, Encoding.GetEncoding("iso-8859-1"));
            File.WriteAllText(@"C:\Temp\UTFtestOut.txt", UTFtestTxt);

            Console.WriteLine("UTFtestTxt iso-8859-1 {0}", UTFtestTxt);
            UTF8Test(UTFtestTxt);

            UTFtestTxt = File.ReadAllText(UTFtest);
            

            Console.WriteLine("UTFtestTxt Default {0}", UTFtestTxt);
            UTF8Test(UTFtestTxt);



        }

        private static void UTF8Test(string UTFtestTxt)
        {
            UTF8Encoding utf8 = new UTF8Encoding();
            UTF8Encoding utf8ThrowException = new UTF8Encoding(false, true);

            char[] chars = UTFtestTxt.ToCharArray();

            // The following method call will not throw an exception.
            Byte[] bytes = utf8.GetBytes(chars);
            ShowArray(bytes);
            Console.WriteLine();

            try
            {
                // The following method call will throw an exception.
                bytes = utf8ThrowException.GetBytes(chars);
                ShowArray(bytes);
            }
            catch (EncoderFallbackException e)
            {
                Console.WriteLine("{0} exception\nMessage:\n{1}",
                                  e.GetType().Name, e.Message);
            }


        }

        public static void ShowArray(Array theArray)
        {
            foreach (Object o in theArray)
                Console.Write("{0:X2} ", o);

            Console.WriteLine();
        }
    }

}

Open in new window


Console Output

Hello World!
UTFtestTxt iso-8859-1 intimate° intimatey & tête-à-tête
69 6E 74 69 6D 61 74 65 C2 B0 20 69 6E 74 69 6D 61 74 65 C3 BD 20 26 20 74 C3 AA 74 65 2D C3 A0 2D 74 C3 AA 74 65

69 6E 74 69 6D 61 74 65 C2 B0 20 69 6E 74 69 6D 61 74 65 C3 BD 20 26 20 74 C3 AA 74 65 2D C3 A0 2D 74 C3 AA 74 65
UTFtestTxt Default intimate? intimate? & t?te-?-t?te
69 6E 74 69 6D 61 74 65 EF BF BD 20 69 6E 74 69 6D 61 74 65 EF BF BD 20 26 20 74 EF BF BD 74 65 2D EF BF BD 2D 74 EF BF BD 74 65

69 6E 74 69 6D 61 74 65 EF BF BD 20 69 6E 74 69 6D 61 74 65 EF BF BD 20 26 20 74 EF BF BD 74 65 2D EF BF BD 2D 74 EF BF BD 74 65

Open in new window

Commented:
Because when you read in the file the second time with the default utf-8 encoding, it changed the corrupted characters into a valid "unknown" utf-8 character "EF BF BD". So when you tried to re-read that array, you were already dealing with a valid utf-8 string (with the special characters swapped out)

Author

Commented:
Yep but wouldn't reading the file in as iso-8859-1  contain invalid UTF8 Characters?

Commented:
Yes. The second time you read the file, it is automatically "fixing" any invalid characters so that the resulting string is valid UTF-8. So you're testing AFTER this has been done, which is why there is no exception thrown.

Instead of ReadAllText, try ReadAllBytes to get the raw byte array. Then use your Utf8Encoding to read that byte array.

Disclosure: I have a newborn and am functioning on very little sleep at the moment (and am only using my phone for all this) so there's a chance something might not be coming out right. The phone thing is much slower than being at my PC, so I am taking some shortcuts in my explanations. So bear with me if I make a mistake or just don't make sense. :)

Author

Commented:
The Utf8Encoding that I nicked from the link provieded wants to test each char so ReadAllBytes  involves converting each byte to a char which you need the encoding for.

I attempted to reverse it to throw the exception converting each byte to char but I kept hitting multiple errors
the effort didn't seem worth the gain considering we've already established it's not possible to determine, with 100% accuracy, a files encoding in code.

Before closing may I ask how are strings (and derived sub strings etc.)   stored do they keep there initial encoding or get converted on the way in I've seen when writing back to a file you have to specify which encoding to use. do you need to do the same for read/write to a database?
@trevor1940: I do not want to spoil you discussion too much. Only some note: the C# string is the UNICODE string (a string as an abstraction, composed of abstract characters). It hides the implementation. When assigning to it, the characters should be converted to the UNICODE in advance. In contrast, the C++ std::string is a stream of bytes. It can be misused for the UTF-8 stream of bytes. In other words, if you want to make some corrections before storing the result into a C# string, you need to store the bytes say to a byte array, and only then to convert the byte array into the string using the correct encoding.

Commented:
So the way I was suggesting looks something like this:

            var bytes = System.IO.File.ReadAllBytes("C:\\iso88591.txt");
   
            try
            {
                var utf8 = new UTF8Encoding(false, true);
                var result = utf8.GetString(bytes);
            }
            catch(System.Text.DecoderFallbackException)
            {
               ...exception should be thrown if the file has special characters encoded with ISO-8859-1...
            }

As far as string storage, it can be a little complicated to understand how it works but all strings in the CLR (the .NET runtime) are stored as UTF-16, regardless of the original encoding. But that's not something you should try to worry about - it's a transparent layer and you can just work with the encodings the way you would EXPECT them to be stored in memory.

Author

Commented:
Ran This on "intimate° intimateý & tête-à-tête" saved to UTFtest.txt
            var bytes = System.IO.File.ReadAllBytes(@"H:\Temp\UTFtest.txt");

            try
            {
                var utf8 = new UTF8Encoding(false, true);
                var result = utf8.GetString(bytes);
            }
            catch (Exception e)
            {

                Console.WriteLine("{0} exception\nMessage:\n{1}",
                                  e.GetType().Name, e.Message);
            }

Open in new window

Received this
Unable to translate bytes [B0] at index 8 from specified code page to Unicode.

Open in new window


That message is a bit misleading as Microsoft's / .NET definition of Unicode is UTF16 not UTF8 as we were attempting to Encode to  

Anyway can we conclude from this if we suspect a file contains non UTF8 characters you could build a method to test based on this. However depending on the application size of files ect.,  It may be better to simply use ISO-8859-1 for fiIe Read Write

Commented:
I definitely would NOT just use ISO-8859-1. That should be something you support reading for compatibility, not intentionally use for your own storage. The efficiency benefits of any extended ASCII / ANSI charsets are very minimal over UTF-8 and they will never have more than 127 special characters. You are very likely to run into Unicode characters in the real world - ones that cannot be represented by a single-byte charset.

If I were you, I would simply support reading ISO-8859-1 but convert everything to UTF-8 when saving the data.

Commented:
Also, regarding that message, it's actually pretty accurate. For the most part, Microsoft correctly uses the "Unicode" term. The only confusing usage is that class name.

Remember, UTF-8 and UTF-16 are both just different ways of encoding the same Unicode character set, just like I could choose to write the letter A with a pen or with a crayon. The final result is still the same letter but I just used different instruments to get to the same result. UTF-8 and UTF-16 are simply instruments to reach the same Unicode letter.

So when the message says it can't translate a byte to Unicode, it is because it is expecting to encounter nothing but perfectly-valid Unicode characters but it has run into a byte that does not translate to a valid Unicode character.

Author

Commented:
Taking your analogy when you read the A in crayon your telling .NET this is crayon  / ISO-8859-1  
Doesn't .NET convert the bytes & hold it into pen / Unicode UTF-16?

If so when your writing back to a file  why do you have to use the same  crayon  / ISO-8859-1  encoding?

I just tried to read a larger file in using (The sample file saved correctly)
            // read it as iso-8859-1

            string OxfordTxt = File.ReadAllText(@"C:\Temp\UTFtestIn.txt", Encoding.GetEncoding("iso-8859-1"));
            // Write as UTF8 shouldn't be any differance
            File.WriteAllText(@"C:\Temp\UTFtestOut.txt", OxfordTxt);

Open in new window


affect°  => affect°
affectý  => affectý

Open in new window

Commented:
A better example may be:
ISO-8859-1 = Crayon
UTF-8 = Pen
Your brain remembers things in UTF-16.

When you read the crayon "A", you are using your understanding of crayon letters to read "A" into your brain. You don't have to explicitly think anything like "read this into my brain" - it's a transparent process.

Once it's in your brain, you can choose to output that letter somewhere else, and you can choose whatever means you want. So if you wanted it to be in crayon, you could pick the crayon and draw the letter. Again, you wouldn't have to explicitly think, "Convert this value from my brain cells into crayon and then draw" - you just pick the crayon and draw.

If someone told you to fill out a form but didn't specify what tool to use, you would use your favorite tool, your pen.

So what you did in your example was read the file correctly using ISO-8859-1, but you didn't specify the output encoding so when .NET wrote the file, it wrote it in UTF-8.

Author

Commented:
Thanx @gr8gonzo for your help and this interesting discussion: You have a knack  of explaining things simply

Congratulations on the newborn

Author

Commented:
Thanx @gr8gonzo for your help and this interesting discussion: You have a knack  of explaining things simply

Congratulations on the newborn

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial