Solved

how to separate UNICODE data from ANSI

Posted on 2002-05-08
24
450 Views
Last Modified: 2013-11-20
hi.
i am working in VC++. i have some data that is mixture of UNICODE and ANSI.
can any one tell me how can i separate one from another.
thankx in advance.
0
Comment
Question by:Ultpak
  • 9
  • 6
  • 2
  • +4
24 Comments
 
LVL 32

Expert Comment

by:jhance
ID: 6996843
Please explain what you mean by "MIXTURE".  Perhaps an example of what you mean.
0
 

Author Comment

by:Ultpak
ID: 6996892
mixture mean, some characters are ANSi then some UNICODE then may be one ANSI then UNICODE, then ANSI then UNICODE
like this
      ax?sdf///????asdf//asdf??F?FD???DSF?DF??F?????DF?DSF?
conside ???? as unicode and others as ANSI.
this is the situation.
kindly help me , it is very urgent.
0
 
LVL 32

Expert Comment

by:jhance
ID: 6996911
In a single buffer?  

I don't see any way to do this since there is no way to distinguish any two ANSI characters from any one UNICODE character.  

In other words, the set of and two ANSI characters taken together has a UNION with the set of UNICODE characters.
0
 
LVL 86

Expert Comment

by:jkr
ID: 6996935
What about 'IsTextUnicode()'?
0
 

Author Comment

by:Ultpak
ID: 6996959
i think you do't understand the question.
i got some characters in a buffer.
in that buffer there are unicode characters as well as ansi
now i want to get all ansi characters in one buffer and alll unicode characters in another buffer to display them properly.
those are not inter mixed with each other. either u are thinking that two ansi characters are mixed up to form a unicode character.
this is not the case.
0
 
LVL 32

Accepted Solution

by:
jhance earned 100 total points
ID: 6996990
You're last comment has now confused me.  In the earlier comment you said the ANSI and UNICODE characters were mixed as in:

"ax?sdf///????asdf//asdf??F?FD???DSF?DF??F?????DF?DSF?"

But now you say:

"those are not inter mixed with each other. either u are thinking that two ansi characters are mixed
up to form a unicode character.
this is not the case"


My understanding of what you are saying is in conflict.  Please clarify.
0
 
LVL 32

Expert Comment

by:jhance
ID: 6996998
BTW, what I'm saying is that the 2 BYTE sequence:

0x55 0x56

Could be either ANSI sequence "UV" or the UNICODE character 0x5556.
0
 
LVL 32

Expert Comment

by:jhance
ID: 6997004
jkr,

The IsTextUnicode() can be easily fooled and the above scenario is one that is very likely to confuse it.
0
 

Author Comment

by:Ultpak
ID: 6997023
yes that is the actuall problem
thats why i am confusing , how to separte both.
0
 
LVL 32

Expert Comment

by:jhance
ID: 6997036
How are these characters getting into this mess?  Sometimes the best approach is to keep a mess from happening in the first place.

I think there is no solution to this problem as you have framed it.  There is no reliable way to separate ANSI and UNICODE text which have been intermixed in such a way as this.
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 

Author Comment

by:Ultpak
ID: 6997048
can you atleast tell me, if there is a space between two unicode words, is that space will be an ANSI character or UNICODE character.
0
 
LVL 32

Expert Comment

by:jhance
ID: 6997054
In a UNICODE string, ALL the characters will be UNICODE.  In your example, who knows??

Again, how are you getting into this mess?  Perhaps there is a better way...
0
 

Author Comment

by:Ultpak
ID: 6997057
it must be an ANSI character.
then when we will convert the data, conversion will spoil all the formating as there is an ANSI character between two unicode words.
then when i will display it will display something like it |||||||||||||||||||||||||
now what to do with this situation.
0
 
LVL 32

Expert Comment

by:jhance
ID: 6997069
No, you are incorrect.  A UNICODE string is all UNICODE.  Consider the following:

The C++ source statement:

WCHAR *wszTest = L"This is a UNICODE test";

Causes the following pattern to be generated as a constant UNICODE:

DB 'T'
DB     00H, 'h', 00H, 'i', 00H, 's', 00H, ' ', 00H, 'i', 00H, 's', 00H
DB     ' ', 00H, 'a', 00H, ' ', 00H, 'U', 00H, 'N', 00H, 'I', 00H, 'C'
DB     00H, 'O', 00H, 'D', 00H, 'E', 00H, ' ', 00H, 't', 00H, 'e', 00H
DB     's', 00H, 't', 00H, 00H, 00H               ;
CONST     ENDS

So you get:

Note that ALL the characters are UNICODE characters and that the string is terminated with a UNICODE NULL or "0x0000".
0
 
LVL 1

Expert Comment

by:Mukki
ID: 6997087
You can read sth bout Unicode here:
http://www.unicode.org/

You may try to see if particular character (char) is real ascii displayable character, if not - that can be Unicode. If this can be Unicode, than next character will be a Unicode too (in fact one Unicode character has two bytes. This method _MAY_ sometimes work.

BTW: see MBCS too: as (from msdn)"Languages that use MBCS, such as Japanese, are also unique. Since a character may consist of _one_ or _two_ bytes, you should always manipulate both bytes at the same time"

As jhance wrote, try to solve problem by removing its source, for instance: create only Unicode string instead of mixing two character encoding modes.
According to Msdn: "Take care if you mix ANSI (8-bit) and Unicode (16-bit) characters in your application. It’s possible to use ANSI characters in some parts of your program and Unicode characters in others, but you cannot mix them in the same string."

Mukki
0
 
LVL 2

Expert Comment

by:Lockias
ID: 6997252
Sounds like a UTF-8 string, which uses the least number of bytes to represent a character.  In which case you chouse use MultiByteTo* functions.

Or maybe I am completely wrong.

Lockias
0
 

Author Comment

by:Ultpak
ID: 6999276
thanks to all.
i have converted all the data reading byte by byte.
separating unicode from ansi.
thanks again for alll to take so much concern in my matter.
ult
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7000073
For future reference, can you please tell us:

How were you able to determine whether two consecutive bytes were a single UNICODE character or two ANSI characters?  

Inquiring minds wan to know.

-- Dan
0
 
LVL 1

Expert Comment

by:Moondancer
ID: 7008961
ADMINISTRATION WILL BE CONTACTING YOU SHORTLY.  Moderators Computer101, Netminder or Mindphaser will return to finalize these if they are still open in 7 days.  Experts, please post closing recommendations before that time.

Below are your open questions as of today.  Questions which have been inactive for 21 days or longer are considered to be abandoned and for those, your options are:
1. Accept a Comment As Answer (use the button next to the Expert's name).
2. Close the question if the information was not useful to you, but may help others. You must tell the participants why you wish to do this, and allow for Expert response.  This choice will include a refund to you, and will move this question to our PAQ (Previously Asked Question) database.  If you found information outside this question thread, please add it.
3. Ask Community Support to help split points between participating experts, or just comment here with details and we'll respond with the process.
4. Delete the question (if it has no potential value for others).
   --> Post comments for expert of your intention to delete and why
   --> YOU CANNOT DELETE A QUESTION with comments; special handling by a Moderator is required.

For special handling needs, please post a zero point question in the link below and include the URL (question QID/link) that it regards with details.
http://www.experts-exchange.com/jsp/qList.jsp?ta=commspt
 
Please click this link for Help Desk, Guidelines/Member Agreement and the Question/Answer process.  http://www.experts-exchange.com/jsp/cmtyHelpDesk.jsp

Click you Member Profile to view your question history and please keep them updated. If you are a KnowledgePro user, use the Power Search option to find them.  

Questions which are LOCKED with a Proposed Answer but do not help you, should be rejected with comments added.  When you grade the question less than an A, please comment as to why.  This helps all involved, as well as others who may access this item in the future.  PLEASE DO NOT AWARD POINTS TO ME.

To view your open questions, please click the following link(s) and keep them all current with updates.
http://www.experts-exchange.com/questions/Q.20293209.html
http://www.experts-exchange.com/questions/Q.20114573.html
http://www.experts-exchange.com/questions/Q.20298407.html
http://www.experts-exchange.com/questions/Q.20298409.html

To view your locked questions, please click the following link(s) and evaluate the proposed answer.
http://www.experts-exchange.com/questions/Q.20298831.html

*****  E X P E R T S    P L E A S E  ******  Leave your closing recommendations.
If you are interested in the cleanup effort, please click this link
http://www.experts-exchange.com/jsp/qManageQuestion.jsp?ta=commspt&qid=20274643
POINTS FOR EXPERTS awaiting comments are listed in the link below
http://www.experts-exchange.com/commspt/Q.20277028.html
 
Moderators will finalize this question if in @7 days Asker has not responded.  This will be moved to the PAQ (Previously Asked Questions) at zero points, deleted or awarded.
 
Thanks everyone.
Moondancer
Moderator @ Experts Exchange
0
 
LVL 32

Expert Comment

by:jhance
ID: 7017294
Who knows?  This user is so confused, I'm not sure he knows what he was asking.....
0
 
LVL 1

Expert Comment

by:Moondancer
ID: 7023107
No response, corrected.
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Introduction: Load and Save to file, Document-View interaction inside the SDI. Continuing from the second article about sudoku.   Open the project in visual studio. From the class view select CSudokuDoc and double click to open the header …
Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now