Validating multibyte strings

Posted on 2006-11-14
Medium Priority
Last Modified: 2010-04-15
I'm trying to create a function that validates a string.
The string cannot contain '<', '>', and '/'.
This function works great for English.
For Chinese string, the function works on Windows, but on unix, it returns false when the string does not contain the three characters.  

is_valid_name (char *name){
    char *cp;

    if (name == NULL || *name == '\0')
      return 0;

    for (cp = name; *cp != '\0'; cp++) {
      if (*cp == '/' || *cp == '<' || *cp == '>')
          return 0;
    return 1;

Is it because Chinese uses multibyte characters and the part of the string can contain one of the three characters?
What can I do so that I can validate multibyte strings?

Question by:jamie_lynn
LVL 15

Accepted Solution

bpmurray earned 2000 total points
ID: 17942334
For multibyte characters, you have to know the encoding. In fact, you should be doing this for ALL text, including western. Since the encoding patterns vary for each encoding-type, you have to be very careful. For example, a simple algorithm for Windows encodings varies depending on locale.

Japanese (CP 932):
for (cp=name; *cp; cp++) {
   if (*cp == '/' || *cp == '<' || *cp == '>')
      return 0;
   if ((*cp > 0x80 && *cp < 0xA0) || (*cp > 0xDF && *cp < 0xFD)) /* if it's a double-byte char, increment once more */

Korean (CP 949):
for (cp=name; *cp; cp++) {
   if (*cp == '/' || *cp == '<' || *cp == '>')
      return 0;
   if (*cp > 0x80 && *cp < 0xFF) /* if it's a double-byte char, increment once more */

As you can see, there isn't a one-size-fits-all solution. For example, there are about 6 common encodings in use in Japan, all different. The reality is that you MUST know the encoding before you do this kind of thing. A good solution is to always use Unicode UTF16 internally, converting when you read data and when you write data. That way the only time you have to manage different encodings is at IO, while internally everything is the same for all languages.

Since you're doing stuff cross-platform, I'd suggest your best solution is to use the functionality in a responably standard lib. Have you looked at ICU? See http://icu.sourceforge.net/ - it's backed by IBM and many other big companies, so it's pretty much the de facto cross-platform standard.

Expert Comment

ID: 17948833
you may want to use unicode.
there are libs for unicode.

Author Comment

ID: 17951080

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question