• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 293
  • Last Modified:

Unicode support in a C CGI script.


i was writing a C CGI, which uses another application specific API to write the incoming form data into DB.
now i have to add unicode support to the same. i dont know what to do. does changing the char to unicode char alone is enough.i read the form data from the input stream. what happens then if i have unicode characters in the stream. should i have to make sure the API and the DB supports unicode. how should i implement unicode compatibility in a C program.
any site/article that can guide me in this ??
any help in this regard will be greatly appreciated.
1 Solution
The incoming form data is effectively in Unicode. Just setup IE 5.0 on WinNT install a Russian keyboard (ie: just the driver) and hack in a few characters. You'll get "Unicode" (actually as &number;) in the input.
   The next problem is your database. If you are using simple varchars it may be worthwhile just converting to UTF-8 rather than 16 bit Unicode (UTF-16). UFT-8 will entail two characters for the >128 ANSI chars, and you'll have to either change the HTML pages to UTF-8 or convert between UTF-8 and HTML in the form of ISO-8898-1. The biggest advantage of UTF-8 is that C-strings remain C-strings so you don't have to change a lot of program logic.
   So, what exactly is your environment and we'll proceed from there.
i think you should also post this Q in the CGI forum at..

krishcharysudharAuthor Commented:
Hi BigRat,
thanks a lot for your support. but honestly i would like you to explain that bit more in detail. my environment is like this.
Point 1:
as such my http request ( with form data ) for my cgi DOES NOT COME FROM A BROWSER, rather a stand alone application sends its form data via http over internet.

Point 2:
My backend, is C CGI ( exe ), linked to an application specific API and then writing to SQL Server, all running on NT / IIS.

Point 3:
i have so far only used char in the code, should i have to change it, if so to what?.

Point 4:
iam not writing the client sending stuff, so i have no idea how he is going to support the unicode in that. in anycase, as far as the http request is concerned, what is difference, i will see. ( i know the data will be unicode, other than that ???)

Point 5:
I send the data to DB using API, should i make sure the API and DB supports unicode.

Point 6:
right now, for the form data is POSTed.
i get the CONTENT_LENGTH value, malloc a buffer for that size and read from stdin for the size. what should i change here to support unicode, i dont think allocating twice the size and reading it is just enough, please clarify.

i hope i have put down, all i can think of, please let me know if u have any more questions.
any help will be greatly appreciated.,
thanks for your time,
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

The major question of design is 16-bit or 8-bit. I'd tend to go for 8-bit Unicode support and use UTF-8. This diverges from the normal ANSI use in as such that all ANSI encodings above hex A0 will turn into two bytes instead of one. This is a problem if :-

1) Legacy data - need to be converted.
   (Applies only if the data currently contains characters like à è á ä ö ü etc)
2) Sorting and searching
   Creating indexes on characters such as ä ü ö and so on is different in UTF-8 as in ANSI

If you currently have NO ä ü ö characters the the support will be simple. What is the current status in this regard?
krishcharysudharAuthor Commented:
Hi BigRat,

  iam not sure if we have those characters. but i know, the data could from chinese, japanese, korean and other far easter languages, where each character may have more a series of bytes representing them. hope this gives you an idea.
To clarify:
Point 1.
I am assuming that the input stream can be any character encoding (ascii, high ascii, double-byte)?

Point 2.
You haven't mentioned what SQL server is being used (Oracle 8.0.5 or MS SQL7?) - most newer SQL DBs support unicode, but support for these character sets has to defined through setup.

Point 3.
You should switch to portable data types, and use the _tchar datatype defined in the MS run-time library include file 'tchar.h' (check the MSDN reference library for international functions included).

Point 4.
If you are certain the stream of data is Unicode - great. If not, Unicode input/output functions will assume Multi-byte and you can do any necessary conversion with appropriate functions from the run-time library.

Point 5.
Yes, you should make certain - specifically (as stated above) check the database for support of Unicode (see OracleNet for Natural Language Support [NLS] and Microsoft MSDN for MS SQL)

Point 6.
Be careful with passing string lengths to a function that are dependant on character lengths and not byte-lengths.
You can pass the length not by doing a 'string.length' type call, but by 'string.length * _TCHAR' (MS function).

Defining your build as unicode (depending on your build platform) should include the variant libraries for unicode functions.

Hope this helps.

krishcharysudharAuthor Commented:
Hi camough,

thanks for your response. i hope i will
use them to get out of this problem.
i will go ahead and look into the code with your suggestions in mind and get back to this. in the mean time, please dont hesitate to add more info on this issue.
Camough: You are proposing a rewrite to use 16-bit encoding. If krish has an 8-bit system where he HAS NOT used characters above 128 (ie none of the European accented characters - only basic ASCII) then a switch to UTF-8 format is trivial. He does not have to change much.

UTF-8 indexes just like ANSI (ie: not very well) so there is little change there. And some databases will support the format directly. If the application is completely browser based there is very little to change in the CGI programs/HTML pages (only setting the content type to UTF-8).

16-bit programming is a real pain and is only really supported by MS WinNT.
krishcharysudharAuthor Commented:
Hi BigRat,
i agree with you. in the mean time, my CGI stuff is working fine with non-english characters, without any modifications. since the content length sent from the request varies accordingly, the cgi still gets the data fine. but since the interfact API iam using in the middle tier doesnt support unicode, we are not bothering about this support anymore.
thanks anyway for your advises.
hope to get more help from you in
the future.
krishcharysudharAuthor Commented:
thanx, BigRat
Certainly. Thanks for the cheese!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now