Avatar of trevor1940
trevor1940
 asked on

C#: How do you determine the encoding of a file

Hi
I have a text file with words like "intimate°" intimateý & tête-à-tête There's probably others

Is there a way to work out its encoding?  I'm guessing extended ASCII  

I've been using code bellow or similar StreamReader / StreamWriter like I said this a pure guess  

string[] AllLines = File.ReadAllLines(FileName,Encoding.GetEncoding("iso-8859-1"));

Open in new window


It looks OK on a sample file

Is there a Difference between StreamReader & File.Read..??

I'm assuming once you read a file into a string with a particular encoding the string inherits that encoding & this maintained if used in a class.
.NET ProgrammingC#

Avatar of undefined
Last Comment
trevor1940

8/22/2022 - Mon
ste5an

The idea here is to let .NET decide. A stream reader does its own (imperfect) detection. But something like this should work:

Encoding GetEncoding(string filePath)
{
	using(var streamReader = new StreamReader(filePath, true))
	{
		streamReader.Read();
		return streamReader.CurrentEncoding;
	}
}

Open in new window

You'll only need to overwrite or specify an encoding, when it's a format without marker bytes and/or detection is not possible for .NET.
gr8gonzo

There is no guaranteed way of doing it. Ultimately, a file is just a bunch of bytes. The encoding is simply the instruction manual on how to assemble those bytes into the right result. Determining an encoding from bytes is like trying to write an instruction manual by taking apart a piece of furniture. You MIGHT get it right, but there is no guarantee.

That said, there are a few things to consider:

1. Sometimes text files that use Unicode encoding like UTF-8 will have a few invisible bytes at the beginning. These bytes are called the byte order mark (BOM) and are there as a signpost for text readers to say, "This file is encoded as UTF-8 / UTF-16 / etc..."

2. UTF-8 is the default encoding for streamreaders. It's the default because it is not only the most common encoding, but it's also mostly compatible with ASCII plain text files. The only time it will have trouble is it the file contains extended ASCII characters. So basically if there are any special characters (any single byte characters above the 127th byte), Unicode encoding won't be sure how to read those bytes since they will look like corrupted partial characters.

In any event, you can usually just use the default encoding and if there are corrupted characters in the output, you can try an ASCII encoding.

Worst case, you can look at the file with a hex editor and manually figure it out....
pepr

You may want to read  the case study https://diveinto.org/python3/case-study-porting-chardet-to-python-3.html (the beginning of the text). It is in Python, but the first part is general. It also contains some refrences to the library by Mozilla team. However, the book is old and the targets moved.

But you can find the C# port here https://github.com/errepi/ude or here https://github.com/CharsetDetector/UTF-unknown (or elsewhere, just picked by chance).

Possibly the NuGet https://www.nuget.org/packages/UDE.CSharp/ will be the fastest to be used.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
trevor1940

ASKER
@ste5an  your code produced
Console.WriteLine("Encoding is: {0}", EncodingAscii.EncodingName); 
=>
Encoding is: Unicode (UTF-8)

Open in new window


This is clearly wrong result

To prove this I read in and wrote out without specifying the encoding, (@gr8gonzo UTF-8 is the default encoding for streamreaders) then used PSPAD to compare the two I knew these would be different as the file   sizes are different

affect° => affect�  corrupt

Encoding.Unicode (I think is  UTF-16?) produced  something akin to 1980's space invaders

㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭㴭

I had similar with Encoding.GetEncoding("UTF-16") and Encoding.BigEndianUnicode


Encoding.GetEncoding("iso-8859-1") no difference

@gr8gonzo PSPad can open in HEX I couldn't see any header info let alone encoding

So my original guess was correct
ASKER CERTIFIED SOLUTION
gr8gonzo

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
gr8gonzo

Alternatively, if the file is being provided by a user via UI input, then add a combo box to allow the user to select the encoding and default it to UTF-8.

That way, they can pick the right encoding if they know it, or they can retry with different encodings if they don't and UTF-8 fails.
pepr

To add... The problem of detection is not simple. So, in your case, or you can choose to detect (ad-hoc) few encodings that you may expect from your sources, or you should use some of the ready-to-be-used detectors.

The following page shows the article from 2002 that analyses the problem: A composite approach to language/encoding detection

The https://github.com/errepi/ude gives you the code with the latest update from May 2015, and the https://www.nuget.org/packages/UDE.CSharp/ gives you that one as a NuGet package. The GitHub page shows also the code how it can be used:

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

Open in new window


The tool is rather old, but you can still use it. Anyway, there is the newer project that deals with the problem -- see http://userguide.icu-project.org/conversion/detection, and their GitHub page https://github.com/google/compact_enc_det 

Unfortunately (for you, maybe), the library is written in C++; so, it may still be better for you to use the above mentioned NuGet package for C#.

To add to gr8gonzo gr8 comments, I second to orientation to UTF-8 in any future project. It is true that Microsoft caused a great mess in the encoding world. The main problem of UTF-16 is (in my opinion) with the byte ordering. While UTF-8 is byte oriented and the same text is encoded to the same stream of bytes in any case, the UTF-16 depends on Little/Big Endian. Also because of that the BOM (Byte Order Mark) was introduced. And also BOM for UTF-8 was introduced to "enhance" the situation. So, the Microsoft UTF-8 may contain BOM that actually only complicates things.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
trevor1940

ASKER
Looking at gr8gonzo  comment he is basically saying If input file is English, use UTF8 (The Default)  unless it is know the input could use extended ASCII in which case use ISO-8859-1 which makes sense


I thought I'd test Utf8Encoding link with

Input file opened without encoding (UTF8) this has characters beyond  127 bytes and using iso-8859-1
intimate° intimateý & tête-à-tête

Open in new window

Saved as UTF8 I got This
intimate� intimate� & t�te-�-t�te

Open in new window


So why doesn't the code below throw an exception?


using System;
using System.IO;
using System.Text;

namespace FileEncoding
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");

            string UTFtest = @"C:\Temp\UTFtest.txt";


            string UTFtestTxt = File.ReadAllText(UTFtest, Encoding.GetEncoding("iso-8859-1"));
            File.WriteAllText(@"C:\Temp\UTFtestOut.txt", UTFtestTxt);

            Console.WriteLine("UTFtestTxt iso-8859-1 {0}", UTFtestTxt);
            UTF8Test(UTFtestTxt);

            UTFtestTxt = File.ReadAllText(UTFtest);
            

            Console.WriteLine("UTFtestTxt Default {0}", UTFtestTxt);
            UTF8Test(UTFtestTxt);



        }

        private static void UTF8Test(string UTFtestTxt)
        {
            UTF8Encoding utf8 = new UTF8Encoding();
            UTF8Encoding utf8ThrowException = new UTF8Encoding(false, true);

            char[] chars = UTFtestTxt.ToCharArray();

            // The following method call will not throw an exception.
            Byte[] bytes = utf8.GetBytes(chars);
            ShowArray(bytes);
            Console.WriteLine();

            try
            {
                // The following method call will throw an exception.
                bytes = utf8ThrowException.GetBytes(chars);
                ShowArray(bytes);
            }
            catch (EncoderFallbackException e)
            {
                Console.WriteLine("{0} exception\nMessage:\n{1}",
                                  e.GetType().Name, e.Message);
            }


        }

        public static void ShowArray(Array theArray)
        {
            foreach (Object o in theArray)
                Console.Write("{0:X2} ", o);

            Console.WriteLine();
        }
    }

}

Open in new window


Console Output

Hello World!
UTFtestTxt iso-8859-1 intimate° intimatey & tête-à-tête
69 6E 74 69 6D 61 74 65 C2 B0 20 69 6E 74 69 6D 61 74 65 C3 BD 20 26 20 74 C3 AA 74 65 2D C3 A0 2D 74 C3 AA 74 65

69 6E 74 69 6D 61 74 65 C2 B0 20 69 6E 74 69 6D 61 74 65 C3 BD 20 26 20 74 C3 AA 74 65 2D C3 A0 2D 74 C3 AA 74 65
UTFtestTxt Default intimate? intimate? & t?te-?-t?te
69 6E 74 69 6D 61 74 65 EF BF BD 20 69 6E 74 69 6D 61 74 65 EF BF BD 20 26 20 74 EF BF BD 74 65 2D EF BF BD 2D 74 EF BF BD 74 65

69 6E 74 69 6D 61 74 65 EF BF BD 20 69 6E 74 69 6D 61 74 65 EF BF BD 20 26 20 74 EF BF BD 74 65 2D EF BF BD 2D 74 EF BF BD 74 65

Open in new window

gr8gonzo

Because when you read in the file the second time with the default utf-8 encoding, it changed the corrupted characters into a valid "unknown" utf-8 character "EF BF BD". So when you tried to re-read that array, you were already dealing with a valid utf-8 string (with the special characters swapped out)
trevor1940

ASKER
Yep but wouldn't reading the file in as iso-8859-1  contain invalid UTF8 Characters?
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
gr8gonzo

Yes. The second time you read the file, it is automatically "fixing" any invalid characters so that the resulting string is valid UTF-8. So you're testing AFTER this has been done, which is why there is no exception thrown.

Instead of ReadAllText, try ReadAllBytes to get the raw byte array. Then use your Utf8Encoding to read that byte array.

Disclosure: I have a newborn and am functioning on very little sleep at the moment (and am only using my phone for all this) so there's a chance something might not be coming out right. The phone thing is much slower than being at my PC, so I am taking some shortcuts in my explanations. So bear with me if I make a mistake or just don't make sense. :)
trevor1940

ASKER
The Utf8Encoding that I nicked from the link provieded wants to test each char so ReadAllBytes  involves converting each byte to a char which you need the encoding for.

I attempted to reverse it to throw the exception converting each byte to char but I kept hitting multiple errors
the effort didn't seem worth the gain considering we've already established it's not possible to determine, with 100% accuracy, a files encoding in code.

Before closing may I ask how are strings (and derived sub strings etc.)   stored do they keep there initial encoding or get converted on the way in I've seen when writing back to a file you have to specify which encoding to use. do you need to do the same for read/write to a database?
pepr

@trevor1940: I do not want to spoil you discussion too much. Only some note: the C# string is the UNICODE string (a string as an abstraction, composed of abstract characters). It hides the implementation. When assigning to it, the characters should be converted to the UNICODE in advance. In contrast, the C++ std::string is a stream of bytes. It can be misused for the UTF-8 stream of bytes. In other words, if you want to make some corrections before storing the result into a C# string, you need to store the bytes say to a byte array, and only then to convert the byte array into the string using the correct encoding.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
gr8gonzo

So the way I was suggesting looks something like this:

            var bytes = System.IO.File.ReadAllBytes("C:\\iso88591.txt");
   
            try
            {
                var utf8 = new UTF8Encoding(false, true);
                var result = utf8.GetString(bytes);
            }
            catch(System.Text.DecoderFallbackException)
            {
               ...exception should be thrown if the file has special characters encoded with ISO-8859-1...
            }

As far as string storage, it can be a little complicated to understand how it works but all strings in the CLR (the .NET runtime) are stored as UTF-16, regardless of the original encoding. But that's not something you should try to worry about - it's a transparent layer and you can just work with the encodings the way you would EXPECT them to be stored in memory.
trevor1940

ASKER
Ran This on "intimate° intimateý & tête-à-tête" saved to UTFtest.txt
            var bytes = System.IO.File.ReadAllBytes(@"H:\Temp\UTFtest.txt");

            try
            {
                var utf8 = new UTF8Encoding(false, true);
                var result = utf8.GetString(bytes);
            }
            catch (Exception e)
            {

                Console.WriteLine("{0} exception\nMessage:\n{1}",
                                  e.GetType().Name, e.Message);
            }

Open in new window

Received this
Unable to translate bytes [B0] at index 8 from specified code page to Unicode.

Open in new window


That message is a bit misleading as Microsoft's / .NET definition of Unicode is UTF16 not UTF8 as we were attempting to Encode to  

Anyway can we conclude from this if we suspect a file contains non UTF8 characters you could build a method to test based on this. However depending on the application size of files ect.,  It may be better to simply use ISO-8859-1 for fiIe Read Write
gr8gonzo

I definitely would NOT just use ISO-8859-1. That should be something you support reading for compatibility, not intentionally use for your own storage. The efficiency benefits of any extended ASCII / ANSI charsets are very minimal over UTF-8 and they will never have more than 127 special characters. You are very likely to run into Unicode characters in the real world - ones that cannot be represented by a single-byte charset.

If I were you, I would simply support reading ISO-8859-1 but convert everything to UTF-8 when saving the data.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
gr8gonzo

Also, regarding that message, it's actually pretty accurate. For the most part, Microsoft correctly uses the "Unicode" term. The only confusing usage is that class name.

Remember, UTF-8 and UTF-16 are both just different ways of encoding the same Unicode character set, just like I could choose to write the letter A with a pen or with a crayon. The final result is still the same letter but I just used different instruments to get to the same result. UTF-8 and UTF-16 are simply instruments to reach the same Unicode letter.

So when the message says it can't translate a byte to Unicode, it is because it is expecting to encounter nothing but perfectly-valid Unicode characters but it has run into a byte that does not translate to a valid Unicode character.
trevor1940

ASKER
Taking your analogy when you read the A in crayon your telling .NET this is crayon  / ISO-8859-1  
Doesn't .NET convert the bytes & hold it into pen / Unicode UTF-16?

If so when your writing back to a file  why do you have to use the same  crayon  / ISO-8859-1  encoding?

I just tried to read a larger file in using (The sample file saved correctly)
            // read it as iso-8859-1

            string OxfordTxt = File.ReadAllText(@"C:\Temp\UTFtestIn.txt", Encoding.GetEncoding("iso-8859-1"));
            // Write as UTF8 shouldn't be any differance
            File.WriteAllText(@"C:\Temp\UTFtestOut.txt", OxfordTxt);

Open in new window


affect°  => affect°
affectý  => affectý

Open in new window

gr8gonzo

A better example may be:
ISO-8859-1 = Crayon
UTF-8 = Pen
Your brain remembers things in UTF-16.

When you read the crayon "A", you are using your understanding of crayon letters to read "A" into your brain. You don't have to explicitly think anything like "read this into my brain" - it's a transparent process.

Once it's in your brain, you can choose to output that letter somewhere else, and you can choose whatever means you want. So if you wanted it to be in crayon, you could pick the crayon and draw the letter. Again, you wouldn't have to explicitly think, "Convert this value from my brain cells into crayon and then draw" - you just pick the crayon and draw.

If someone told you to fill out a form but didn't specify what tool to use, you would use your favorite tool, your pen.

So what you did in your example was read the file correctly using ISO-8859-1, but you didn't specify the output encoding so when .NET wrote the file, it wrote it in UTF-8.
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
trevor1940

ASKER
Thanx @gr8gonzo for your help and this interesting discussion: You have a knack  of explaining things simply

Congratulations on the newborn
trevor1940

ASKER
Thanx @gr8gonzo for your help and this interesting discussion: You have a knack  of explaining things simply

Congratulations on the newborn