[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Converting Ascii to UTF8

Posted on 2009-02-20
12
Medium Priority
?
2,455 Views
Last Modified: 2012-05-06
Im converting a ascii to UTF8.
When I get a value of 251 the Encoding.UTF8.GetBytes(data);
adds a 194 in the start of the array & replaces 251 with 195 & 187.
Why is this ?
byte[] buffer = null;
string data = "something";
buffer = Encoding.UTF8.GetBytes(data);

Open in new window

0
Comment
Question by:u2envy1
  • 7
  • 5
12 Comments
 
LVL 39

Expert Comment

by:abel
ID: 23692244
The first 127 characters are equal in both UTF8 and in ASCII. After that, there are differences. This is necessary, because UTF8 needs to store many more characters in an 8-bit array (actually, it is a variable-length encoding and it uses more two bytes or three bytes depending on the character, only for those first 127 characters it really uses the same bit pattern)
0
 
LVL 39

Expert Comment

by:abel
ID: 23692306
Btw, note that ASCII is really only the original subset of the first 128 characters originally (codepage name usually US-ASCII or ISO-639) and that the wider sets are expansions on US-ASCII, like Latin-1 ASCII (ISO-8859-1) etc. These use that extra bit to fill in the whole range of two nibbles (1 byte).
0
 
LVL 39

Expert Comment

by:abel
ID: 23692361
Btw2: here's a table that shows "C3 & BB" as the encoding for ASCII-codepoint 251: http://kellyjones.netfirms.com/webtools/ascii_utf8_table.shtml. They don't say, but if you check, you find out that the actual table used is ISO-8859-1, see Unicode table entry FB: http://www.unicode.org/charts/PDF/U0080.pdf
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:u2envy1
ID: 23692365
The last digit is my checksum the clock recognizes. If it is split in two then the device does not recognize the command. How can I by pass this ?
0
 
LVL 39

Expert Comment

by:abel
ID: 23692400
You got me lost here for a moment. If you transpose it to UTF-8, you will end up with two characters. But if it is a checksum, it should be treated as bytes, shouldn't it, and not as a string that needs to be translated to another codepage... What is the actual task you are trying to accomplish?

Note that *any* character higher then 127 (dec) will result in two bytes, so you may have a problem more often here.
0
 

Author Comment

by:u2envy1
ID: 23692484
The device only accept input in UTF8. How do I send 191,1,6, Checksum to the clock without things being altered.
Im using sokets.
  public override void SendData(string data)
        { 
buffer = Encoding.UTF8.GetBytes(data);
 mSocket.Send(buffer);
}

Open in new window

0
 
LVL 39

Expert Comment

by:abel
ID: 23692956
If you want to convert something to unicode UTF8 without altering it you should not convert it. That would of course work if the bytes you mention would comprise a valid unicode codepoint. However, the byte 191 (dec) is 10111111 (bin) and is only valid as a second byte in a two-byte UTF-8 character or as a second or third byte in a three-byte UTF-8 character (same for four, five or six byte UTF-8 characters).

The sequence 191-1-6 is not a valid UTF-8 sequence and as such cannot be send unchanged if you can only except valid UTF-8.

What is that device, the "clock" you are talking of? Do you have documentation? Maybe I can have a look and help you from there, maybe there's a misunderstanding on the terminology here.
0
 
LVL 39

Accepted Solution

by:
abel earned 2000 total points
ID: 23692960
Btw, reference of how to build valid UTF-8 sequences: http://www.python.org/doc/2.5.2/lib/encodings-overview.html
0
 

Author Comment

by:u2envy1
ID: 23708786
This clock is a access control device that was created in house & has no documentation. I had to read Clarion code to rewrite the SDK into C#. I convert the send data to Unicode remove all leading char 0. If the converted code has a 0 then that will be removed as well. How can I remove the added spaces that Unicode adds but not the char 0 values ?
0
 

Author Closing Comment

by:u2envy1
ID: 31549235
Thx
0
 
LVL 39

Expert Comment

by:abel
ID: 23754771
Ah, I missed that last comment of you, sorry. Unicode does not add spaces or null values. The link I showed you also shows that the UTF-8 encoding (which is an encoding for Unicode) accepts null-values, but then it represents the legal character NUL. But it is legal in Unicode, not necessary legal in an application.
0
 

Author Comment

by:u2envy1
ID: 23754820
No prob. Any website that explain Ascii, UTF8, & the rest in detail & show comparisons.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The article shows the basic steps of integrating an HTML theme template into an ASP.NET MVC project
High user turnover can cause old/redundant user data to consume valuable space. UserResourceCleanup was developed to address this by automatically deleting user folders when the user account is deleted.
This video shows how to quickly and easily deploy an email signature for all users in Office 365 and prevent it from being added to replies and forwards. (the resulting signature is applied on the server level in Exchange Online) The email signat…
Look below the covers at a subform control , and the form that is inside it. Explore properties and see how easy it is to aggregate, get statistics, and synchronize results for your data. A Microsoft Access subform is used to show relevant calcul…
Suggested Courses

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question