Link to home
Start Free TrialLog in
Avatar of djon2003
djon2003Flag for Canada

asked on

Converting UTF8 to UTF7.. seems easy, but look inside please !

Hi all world experts !

Let introduce the problem situation. I'm using a StreamReader to read from a NetworkStream coming of a TCPClient connected to an Email server. (So a POP3 protocol collecting emails). Using the UTF-8 encoding works almost all the time, but one of my tested email got some weird character in the message body. Reading this email via Outlook Express tells me that it is a french accent character "é". So, I tried using the UTF-7 encoding with the StreamReader... Wow, it works. I see the "é" character. But here, a major problem encounters : all other emails got weird bugs, and all Files attached to email bugs. So, I decided to keep UTF-8 as encoding for the StreamReader.

So, if you read carefully, UTF-7 encoding with the StreamReader with the accent character problem, but can't be used directly on the StreamReader. What I would like to do, is converting the string that was given to me from UTF-8 to UTF-7. Using the code above doesn't make any changes to the text. Probably I'm doing something wrong.

Could someone help out on this one ?..

PS The code is currently reading line by line, instead of the whole string. I tried different things, without success : Encoding 65000 = UTF-7 (Tried both.. no change), Tried to convert from Unicode to UTF-8 to UTF-7 (No change)
Dim utf8Bytes As Byte() = System.Text.Encoding.UTF8.GetBytes(lines(i))
Dim enc As System.Text.Encoding = System.Text.Encoding.GetEncoding(65000)
Dim utf7Bytes As Byte() = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, enc, utf8Bytes)
Dim text2 As String = enc.GetString(utf7Bytes)
lines(i) = text2

Open in new window

Avatar of VBRocks
VBRocks
Flag of United States of America image

You know, I never use UTF 8 encoding.  I always use the Default encoding, which for US Windows is usually Western European.  Try that out and see if it makes a difference.

  Dim enc As System.Text.Encoding = System.Text.Encoding.Default

ASKER CERTIFIED SOLUTION
Avatar of abel
abel
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of fmmexpertexchange
fmmexpertexchange

Have you tied this?

Dim utf8Bytes As Byte() = System.Text.Encoding.UTF8.GetBytes(lines(i))
        Dim enc As System.Text.Encoding = System.Text.Encoding.GetEncoding(65000)
        Dim utf7Bytes As Byte() = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.UTF7, utf8Bytes)
        Dim text2 As String = enc.GetString(utf7Bytes)
        lines(i) = text2

Open in new window

Avatar of djon2003

ASKER

Thanks guys for you quick answer... First of all, let me tell you that I've managed the encoding from the email header (or the email header part). So this problem is coming from a particular situation. All my other tests work fine. (Mostly for abel)

I cannot post you the whole email.. but the header in this case is :
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit

VBRocks : Tried the default encoding.. doesn't work.
fmmexpertexchange : I already tried all combinaison with UTF-8, UTF-7 and Unicode. No changes.

I know i will repeat myself, but even though charset is iso-8859-1, UTF-7 used on the StreamReader makes it work. But i can't use this solution. (Also, I tried to used BinaryReader instead of the StreamReader, but it is awfully slow --> This way I would have to basic bytes which I could convert either to UTF-7 or UTF-8.. or what I want.. though unusable). This is just to tell you that I tried a lot of things before asking you guys. Thanks again !!
so, to summarize, you have a situation where someone is sending you a message in the encoding iso-8859-1 but one single character is encoded incorrectly. By luck you found that this single character decodes well when you apply UTF-7 decoding on that part (or on the message as a whole? That sounds weird, because then many more characters must be wrong).

There's a method on all the decoding classes, which is EncodingFallback/DecoderFallback. In short, you can write your own handler that deals with encoding exceptions. It's not even that hard. How to do it is explained in this post (you'll need to read through the cs code, because that's where the action is). In that post unknown characters are escaped using cahracter entities: http://blogs.msdn.com/shawnste/archive/2006/10/12/example-of-overriding-your-own-encoding.aspx.
Thanks abel for your reply. I will try it out, but seriously I don't think the solution will be this way.

First, let tell you that it is the whole message which I transform from UTF-8 to UTF-7. In the message, all the french accents doesn't work. Not only one character.

Why your proposed solution is probably not the correct one ? Because using UTF-7 instead of UTF-8 on the StreamReader makes it OK !. Which means that somehow the StreamReader is able to use UTF-7 encoding to fix my problem. I added the code to receive from the email server below.

What I'm asking here is to : TRANSFORM an UTF-8 string to another encoding which will support the french accents of this message (which UTF-8 does, but it doesn't work.) I quiet confident that the code I posted in my question should do this, but it doesn't.
Dim LireReponse As New StreamReader(oTcp.GetStream, System.Text.Encoding.UTF8, True)
'''' THE LINE ABOVE.. IF I USE UTF7.. MY MESSAGE IS OK... UNFORTUNATLY I CANT USE A IF TO DECIDE WHICH ENCODING HERE TO USE, BECAUSE THE MESSAGE IS STILL NOT DOWNLOAD FROM SERVER (SO ENCODING OF MESSAGE IS UNKNOWN FOR NOW)
 
Threading.Thread.Sleep(500)
 
'getting first line
strReponse.Append(LireReponse.ReadLine)
collReponse.Add(strReponse.ToString)
 
Dim n As Integer = 0
Do While LireReponse.Peek <> -1
strReponse.Append(vbCrLf)
collReponse.Add(LireReponse.ReadLine)
strReponse.Append(collReponse.Item(collReponse.Count - 1).ToString)
totalBytes += LireReponse.CurrentEncoding.GetByteCount(collReponse.Item(collReponse.Count - 1).ToString.ToCharArray) 'Counts bytes
RaiseEvent BytesReceived(totalBytes)
 
n += 1
 
Loop

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Good morning abel !

I've retested again the solution using a BinaryReader instead of the StreamReader. I added the code below. I tested two ways (for now, later will transpose this if it works.) : First, reading bytes and transfer immediatly to a string. (this one is commented) (I know, this won't fix my problem, but I'd just want to be sure it works, than I'll convert the bytes later in the code) ; Second, reading bytes, add them to a Generic.List(Of Byte) and at the end, convert the whole to a string. Both ways are that slow that they crashed in Identification process... which sends two commands (USER & PASS) and receives stuff from both.

Using StreamReader code above, I download all the emails I want.. but some problems with encoding. Now using the BinaryReader can't even download 1 email. The server is closing the connection. (Probably due to timeout.. but why should I change the timeout if first method same timeout works).

Thanks abel for looking into this. This makes a while that this part of my soft is desactivated.
Dim LireReponse2 As New BinaryReader(oTcp.GetStream)
 
            'Threading.Thread.Sleep(500)
 
            'on recupere la pemiere ligne 
            'Dim reponse As String = System.Text.Encoding.UTF8.GetString(LireReponse2.ReadBytes(4096))
            'strReponse.Append(reponse)
            'strReponse.Append(LireReponse.ReadLine)
            collReponse.Add(strReponse.ToString)
            Dim curBytes As New Generic.List(Of Byte)
            curBytes.AddRange(LireReponse2.ReadBytes(4096))
 
            Dim n As Integer = 0
            'Do While LireReponse.Peek <> -1
            Do While LireReponse2.PeekChar <> -1
                'strReponse.Append(vbCrLf)
                'collReponse.Add(LireReponse.ReadLine)
                'strReponse.Append(collReponse.Item(collReponse.Count - 1).ToString)
                'totalBytes += LireReponse.CurrentEncoding.GetByteCount(collReponse.Item(collReponse.Count - 1).ToString.ToCharArray) 'Counts bytes
                totalBytes += 4096
                'reponse = System.Text.Encoding.UTF8.GetString(LireReponse2.ReadBytes(4096))
                'strReponse.Append(reponse)
                curBytes.AddRange(LireReponse2.ReadBytes(4096))
 
                RaiseEvent BytesReceived(totalBytes)
 
                Application.DoEvents()
                n += 1
            Loop

Open in new window

It's not really getting better or helping, isn't it? I read a bit on BinaryReader and BufferedReader. Apparently, Peek is extremely slow, when used in a loop can slow things down with large factors.

At the moment I can't copy your method to try it at home, because I do not know enough of the original Stream (some TCP stream apparently) which may not work well with the settings we use. My suggestion may not have been the best choice considering how you are doing it. Instead of BinaryReader, you could use a MemoryStream.

But before trying to read your data more efficiently, let's have a look at that other method. I experimented a bit but I need your help on whether the input is correct. Can you try your wrong code with the following string and check whether the end result is the same (wrong) output? Then I know whether the method I use to correct your wrong output is the right one.

Input string (which is in your case the expected output string, or, the string that the email was supposed to contain):

La cathédrale Notre-Dame de Constance est lancien siège de lévêque de Constance.
Output string, after wrong encoding-transformation into UTF8 (the way you do it now):


La cathédrale Notre-Dame de Constance est l’ancien siège de l’évêque de Constance.
Since there are many encoding problems with the E-E website, make sure to check the screenshots (and even there, you see that the Visual Studio font uses different glyphs then my Office program).

I created this "wrong" encoded string by doing as if the bytes of UTF7 of the same string were actually UTF8.

ScreenShot172.png
ScreenShot173.png
funny, these encoding issues... and... EE.

So, apparently the text shows correctly when I pasted it inside the Attach File comment. Sorry for that link, and the text. That happens sometimes with the EE RTE, it removes anything that came before a part that is pasted.

I meant to say, "La cathédrale Notre-Dame de Constance est l'ancien siège de l'évêque de Constance, en Allemagne." did not get well encoded by EE, so here's a screenshot. The text came from http://fr.wikipedia.org/wiki/Cath%C3%A9drale_Notre-Dame_de_Constance.

(apologies for the clutter)
Hi abel, I'm really glad that you particapate that much. I read on another website which someone had the same problem of speed with BinaryReader, and the answerer told the asker to use the stream directly (with BeginRead.. etc).

Then I though, woo... I already have my ChatClient which does exactly the same.. but convert the bytes using ASCII encoding (because here I'm sure what is transmitted because it's internal software conversation). So, I will transform this ChatClient to a more general object, which I'll be able to use instead of my protocol object (which is the one that the last code snippet is coming from). It will take me some times. So I'll maybe not reply within this day or the next. But I will !

Getting the bytes directly.. than I'll be able to work from bytes to whatever I want. Instead of string going back to bytes to get back to string..

Thanks again man. Its really appreciated. News soon.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
BTW: my earlier attempts on finding out what actually went wrong can be ignored, I made a mistake in creating the strings. Bytes from any UTF7 stream can never be higher then 127, which means they are all printable characters (which is why UTF7 was invented in the first place).
Hi abel !

So, as I said, I changed my ChatClient to a more generic called TCPClientPlus. Now, using this new way, I receive the bytes and the email extraction decide then which encoder to use. Fine, the problem with my test email is fixed! More, other emails seems to work too. But, as usual, there is always a glitch. I don't know why, but with this code, sometimes emails aren't complete. So for sure, in email extraction, it bugs.

Now I have two versions of my software..
- the one using UTF-8 with StreamReader... works fine except for some accents.
- the one using the NetworkStream directly and returns bytes... works fine too, except for incomplete emails.

Unfortunatly, I'm quiet tired today, and I won't try your new solution now. I'll do it tomorrow. Though, if you read this before tomorrow, I'd like your advice.

Which method would you choose ? I think personnaly that the second one is the best, but still buggy.

I attached you here the code used to receive bytes... seems that the DataAvailable property isn't always correct ?

I know this seems to be another problem... let me know if it's too much.
    Public Sub Read()
        Threading.Thread.Sleep(200) 'Wait for server to have time to respond !?
        Me.lastBytesReceived.Clear()
        SyncLock MyBase.GetStream
            While MyBase.GetStream.DataAvailable
                MyBase.GetStream.Read(readBuffer, 0, READ_BUFFER_SIZE)
                Me.lastBytesReceived.AddRange(readBuffer)
                RaiseEvent BytesReceived(Me, readBuffer)
            End While
            RaiseEvent EndStream(Me, EventArgs.Empty)
        End SyncLock
    End Sub
 
    Public Function ReadToEnd() As Byte()
        Read()
 
        Return lastBytesReceived.ToArray
    End Function

Open in new window

New development : Solution #1 now works.. because in fact downloading headers is no problem. Even more, I was currently getting the header before the whole message for nothing. Now I use this part to check out encoding and pass it to my pop3 object when reading the whole message. It works just fine.

Even though, I would prefer using my new object, because there is a last one with seems to persists. This email contains a reply from another one which reply contains an "é" that shows as a square. (This email by OE is correctly read). But the bug is because it's a multi-part email. I manage these no problem. But still, the connection with the stream is define with the main charset (which in this case there is none, until the sub part). So downloading via bytes, and deciding after then is a better choice. Because I can't do what ever encoding required on the good parts. The sub parts (text/plain and text/html) are base64 encoded.(Just a detail).

But seriously, now tell me.. is this a new question? If yes, I'll award you the points and start a new question, though.. I'll admit you that I don't have anymore points. So.. I give you the choice. Anyhow, I could reach free access via answering some questions.. but still time is missing a bit.
Do I understand it correctly that you got the crazy and ill-formed email in correctly now? That's great news!

I don't really understand the new problem. Are you saying that parts of the email are in different formats? That happens quite often, check your incoming mail in Outlook, Eudora or Thunderbird. Whenever a mail passes by a non-compliant email server (or one that's wrongly configured), it can go wrong, and it goes wrong often, even though the basics are really so simple!

In all honesty, if this is indeed a new problem, then I would suggest you to open up a new thread. Though I also understand you being under time pressure. But subscription points only cost a handful of dollars, so surely, that's not the problem? ;-)

I leave it up to you. I can do only so much and I am not sure whether I am the right person for the new question. If you want others to be notified, it is probably best to open up a new thread.

-- Abel --
Thanks a lot mister. I saved the solution #2 into different files so I'll be able to work on it later. Solution #1 is working perfect (well enough for my purpose). Solution #2 downloads email partially.
> Thanks a lot mister. I saved the solution #2 into different files so I'll be able to
> work on it later. Solution #1 is working perfect (well enough for my purpose).

You're welcome. It was nice working with you. It is not often that you get such thorough feedback during a thread discussion and that's nice and makes it possible to work structurally towards a solution. Tx,

-- Abel --