Recognize Chinese Multibyte Character

Hi

I'd like to ask any of you the method of recognizing Chinese character (a multibyte character) in a passage containing both Chinese and some single byte characters, such as English and numbers.

When I use a pointer, it only points the passage byte by byte and it is not able to detect whether it is a multibyte character or not.

Is there a way to:
1. Extract these Chinese characters from the passage OR
2. Intelligently pointing character by character (not matter the character is multibyte or single byte)  OR
3. Convert all of them to multibyte characters?

Your suggestions will be much appreciated! Thanks!
happy_emilyAsked:
Who is Participating?
 
pb_indiaCommented:
Hi,

I think what you can do is:
COnvert all the characters from narrow to wide and do a byte comparison to count number of characters.

Use:
mbstate_t ps;
mbsrtowcs(wchar_t* wide,const char* narrow,  int len, mbstate_t* ps);

char* narrow will be your string from passage

and then use code with logic as following... (You wil need to modify it for your own use)
[I can develop the program for you, but 125 is too less for that much work.]


#include <iostream.h>
#include <fstream.h>

int main () {
  ifstream f1;
  char c;
  int numchars, numlines;

  f1.open("test");

  numchars = 0;
  numlines = 0;
  f1.get(c);
  while (f1) {
    while (f1 && c != '\n') {
      numchars = numchars + 1;
      f1.get(c);
    }
    numlines = numlines + 1;
    f1.get(c);
  }
  cout << "The file has " << numlines << " lines and " 
    << numchars << " characters" << endl;
  return(0);
}

0
 
pb_indiaCommented:
You can use, depending on your need :
1. wcsrtombcs(wchar_t*, char*, int); //wide to Multibyte

2. _mbbtombc //Convert 1-byte multibyte character to corresponding 2-byte multibyte character
0
 
happy_emilyAuthor Commented:
Can you show me some example programs demonstrating the use of these functions? (I am a newbie in C++ program)
Say for example, the passage is "abcdefXXXX23" where XXXX are the Chinese characters.

Thanks!
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
pb_indiaCommented:
Sure.

What exaclty you are trying to do. Just read these characters from a file or something and output it?
Or you just want to separate Chinese characters from English?
0
 
hellohelloworldCommented:
In fact, what I am trying to do is to count the number of occurrence of every character (Chinese character must be counted) appeared in the passage, which consists of different types of characters (ie. English + Chinese + Numbers).

What I can think of is using pointers to do so. However, I have encountered the problem mentioned...... So, I am pondering whether I should convert all the characters in the passage to be double-byte first and then increment the pointer by 2 everytime reading a character, or I should separate the multibyte characters (Chinese) from the singlebyte ones (English + Numbers) and then count them respectively.

Do you have any idea?
0
 
happy_emilyAuthor Commented:
PS whoops! hellohelloworld is my second account
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.