• C

Validating multibyte strings

Hi,
I'm trying to create a function that validates a string.
The string cannot contain '<', '>', and '/'.
This function works great for English.
For Chinese string, the function works on Windows, but on unix, it returns false when the string does not contain the three characters.  

i.e.
is_valid_name (char *name){
    char *cp;

    if (name == NULL || *name == '\0')
      return 0;

    for (cp = name; *cp != '\0'; cp++) {
      if (*cp == '/' || *cp == '<' || *cp == '>')
          return 0;
    }
    return 1;
}


Is it because Chinese uses multibyte characters and the part of the string can contain one of the three characters?
What can I do so that I can validate multibyte strings?

Thanks
Jamie
jamie_lynnAsked:
Who is Participating?
 
bpmurrayCommented:
For multibyte characters, you have to know the encoding. In fact, you should be doing this for ALL text, including western. Since the encoding patterns vary for each encoding-type, you have to be very careful. For example, a simple algorithm for Windows encodings varies depending on locale.

Japanese (CP 932):
for (cp=name; *cp; cp++) {
   if (*cp == '/' || *cp == '<' || *cp == '>')
      return 0;
   if ((*cp > 0x80 && *cp < 0xA0) || (*cp > 0xDF && *cp < 0xFD)) /* if it's a double-byte char, increment once more */
      cp++;
}

Korean (CP 949):
for (cp=name; *cp; cp++) {
   if (*cp == '/' || *cp == '<' || *cp == '>')
      return 0;
   if (*cp > 0x80 && *cp < 0xFF) /* if it's a double-byte char, increment once more */
      cp++;
}

As you can see, there isn't a one-size-fits-all solution. For example, there are about 6 common encodings in use in Japan, all different. The reality is that you MUST know the encoding before you do this kind of thing. A good solution is to always use Unicode UTF16 internally, converting when you read data and when you write data. That way the only time you have to manage different encodings is at IO, while internally everything is the same for all languages.

Since you're doing stuff cross-platform, I'd suggest your best solution is to use the functionality in a responably standard lib. Have you looked at ICU? See http://icu.sourceforge.net/ - it's backed by IBM and many other big companies, so it's pretty much the de facto cross-platform standard.
   
0
 
_iskywalker_Commented:
you may want to use unicode.
there are libs for unicode.
0
 
jamie_lynnAuthor Commented:
Thanks!
Jamie
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.