Solved

Validating multibyte strings

Posted on 2006-11-14
3
187 Views
Last Modified: 2010-04-15
Hi,
I'm trying to create a function that validates a string.
The string cannot contain '<', '>', and '/'.
This function works great for English.
For Chinese string, the function works on Windows, but on unix, it returns false when the string does not contain the three characters.  

i.e.
is_valid_name (char *name){
    char *cp;

    if (name == NULL || *name == '\0')
      return 0;

    for (cp = name; *cp != '\0'; cp++) {
      if (*cp == '/' || *cp == '<' || *cp == '>')
          return 0;
    }
    return 1;
}


Is it because Chinese uses multibyte characters and the part of the string can contain one of the three characters?
What can I do so that I can validate multibyte strings?

Thanks
Jamie
0
Comment
Question by:jamie_lynn
3 Comments
 
LVL 15

Accepted Solution

by:
bpmurray earned 500 total points
ID: 17942334
For multibyte characters, you have to know the encoding. In fact, you should be doing this for ALL text, including western. Since the encoding patterns vary for each encoding-type, you have to be very careful. For example, a simple algorithm for Windows encodings varies depending on locale.

Japanese (CP 932):
for (cp=name; *cp; cp++) {
   if (*cp == '/' || *cp == '<' || *cp == '>')
      return 0;
   if ((*cp > 0x80 && *cp < 0xA0) || (*cp > 0xDF && *cp < 0xFD)) /* if it's a double-byte char, increment once more */
      cp++;
}

Korean (CP 949):
for (cp=name; *cp; cp++) {
   if (*cp == '/' || *cp == '<' || *cp == '>')
      return 0;
   if (*cp > 0x80 && *cp < 0xFF) /* if it's a double-byte char, increment once more */
      cp++;
}

As you can see, there isn't a one-size-fits-all solution. For example, there are about 6 common encodings in use in Japan, all different. The reality is that you MUST know the encoding before you do this kind of thing. A good solution is to always use Unicode UTF16 internally, converting when you read data and when you write data. That way the only time you have to manage different encodings is at IO, while internally everything is the same for all languages.

Since you're doing stuff cross-platform, I'd suggest your best solution is to use the functionality in a responably standard lib. Have you looked at ICU? See http://icu.sourceforge.net/ - it's backed by IBM and many other big companies, so it's pretty much the de facto cross-platform standard.
   
0
 
LVL 6

Expert Comment

by:_iskywalker_
ID: 17948833
you may want to use unicode.
there are libs for unicode.
0
 

Author Comment

by:jamie_lynn
ID: 17951080
Thanks!
Jamie
0

Featured Post

Is Your AD Toolbox Looking More Like a Toybox?

Managing Active Directory can get complicated.  Often, the native tools for managing AD are just not up to the task.  The largest Active Directory installations in the world have relied on one tool to manage their day-to-day administration tasks: Hyena. Start your trial today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
Examines three attack vectors, specifically, the different types of malware used in malicious attacks, web application attacks, and finally, network based attacks.  Concludes by examining the means of securing and protecting critical systems and inf…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.

806 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question