Unicode support in a C CGI script.

Posted on 2000-04-18
Medium Priority
Last Modified: 2010-08-05

i was writing a C CGI, which uses another application specific API to write the incoming form data into DB.
now i have to add unicode support to the same. i dont know what to do. does changing the char to unicode char alone is enough.i read the form data from the input stream. what happens then if i have unicode characters in the stream. should i have to make sure the API and the DB supports unicode. how should i implement unicode compatibility in a C program.
any site/article that can guide me in this ??
any help in this regard will be greatly appreciated.
Question by:krishcharysudhar
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 27

Expert Comment

ID: 2730371
The incoming form data is effectively in Unicode. Just setup IE 5.0 on WinNT install a Russian keyboard (ie: just the driver) and hack in a few characters. You'll get "Unicode" (actually as &number;) in the input.
   The next problem is your database. If you are using simple varchars it may be worthwhile just converting to UTF-8 rather than 16 bit Unicode (UTF-16). UFT-8 will entail two characters for the >128 ANSI chars, and you'll have to either change the HTML pages to UTF-8 or convert between UTF-8 and HTML in the form of ISO-8898-1. The biggest advantage of UTF-8 is that C-strings remain C-strings so you don't have to change a lot of program logic.
   So, what exactly is your environment and we'll proceed from there.
LVL 16

Expert Comment

ID: 2730800
i think you should also post this Q in the CGI forum at..


Author Comment

ID: 2730810
Hi BigRat,
thanks a lot for your support. but honestly i would like you to explain that bit more in detail. my environment is like this.
Point 1:
as such my http request ( with form data ) for my cgi DOES NOT COME FROM A BROWSER, rather a stand alone application sends its form data via http over internet.

Point 2:
My backend, is C CGI ( exe ), linked to an application specific API and then writing to SQL Server, all running on NT / IIS.

Point 3:
i have so far only used char in the code, should i have to change it, if so to what?.

Point 4:
iam not writing the client sending stuff, so i have no idea how he is going to support the unicode in that. in anycase, as far as the http request is concerned, what is difference, i will see. ( i know the data will be unicode, other than that ???)

Point 5:
I send the data to DB using API, should i make sure the API and DB supports unicode.

Point 6:
right now, for the form data is POSTed.
i get the CONTENT_LENGTH value, malloc a buffer for that size and read from stdin for the size. what should i change here to support unicode, i dont think allocating twice the size and reading it is just enough, please clarify.

i hope i have put down, all i can think of, please let me know if u have any more questions.
any help will be greatly appreciated.,
thanks for your time,
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 27

Expert Comment

ID: 2732169
The major question of design is 16-bit or 8-bit. I'd tend to go for 8-bit Unicode support and use UTF-8. This diverges from the normal ANSI use in as such that all ANSI encodings above hex A0 will turn into two bytes instead of one. This is a problem if :-

1) Legacy data - need to be converted.
   (Applies only if the data currently contains characters like à è á ä ö ü etc)
2) Sorting and searching
   Creating indexes on characters such as ä ü ö and so on is different in UTF-8 as in ANSI

If you currently have NO ä ü ö characters the the support will be simple. What is the current status in this regard?

Author Comment

ID: 2732235
Hi BigRat,

  iam not sure if we have those characters. but i know, the data could from chinese, japanese, korean and other far easter languages, where each character may have more a series of bytes representing them. hope this gives you an idea.

Expert Comment

ID: 2742101
To clarify:
Point 1.
I am assuming that the input stream can be any character encoding (ascii, high ascii, double-byte)?

Point 2.
You haven't mentioned what SQL server is being used (Oracle 8.0.5 or MS SQL7?) - most newer SQL DBs support unicode, but support for these character sets has to defined through setup.

Point 3.
You should switch to portable data types, and use the _tchar datatype defined in the MS run-time library include file 'tchar.h' (check the MSDN reference library for international functions included).

Point 4.
If you are certain the stream of data is Unicode - great. If not, Unicode input/output functions will assume Multi-byte and you can do any necessary conversion with appropriate functions from the run-time library.

Point 5.
Yes, you should make certain - specifically (as stated above) check the database for support of Unicode (see OracleNet for Natural Language Support [NLS] and Microsoft MSDN for MS SQL)

Point 6.
Be careful with passing string lengths to a function that are dependant on character lengths and not byte-lengths.
You can pass the length not by doing a 'string.length' type call, but by 'string.length * _TCHAR' (MS function).

Defining your build as unicode (depending on your build platform) should include the variant libraries for unicode functions.

Hope this helps.


Author Comment

ID: 2743738
Hi camough,

thanks for your response. i hope i will
use them to get out of this problem.
i will go ahead and look into the code with your suggestions in mind and get back to this. in the mean time, please dont hesitate to add more info on this issue.
LVL 27

Accepted Solution

BigRat earned 150 total points
ID: 2748799
Camough: You are proposing a rewrite to use 16-bit encoding. If krish has an 8-bit system where he HAS NOT used characters above 128 (ie none of the European accented characters - only basic ASCII) then a switch to UTF-8 format is trivial. He does not have to change much.

UTF-8 indexes just like ANSI (ie: not very well) so there is little change there. And some databases will support the format directly. If the application is completely browser based there is very little to change in the CGI programs/HTML pages (only setting the content type to UTF-8).

16-bit programming is a real pain and is only really supported by MS WinNT.

Author Comment

ID: 2748946
Hi BigRat,
i agree with you. in the mean time, my CGI stuff is working fine with non-english characters, without any modifications. since the content length sent from the request varies accordingly, the cgi still gets the data fine. but since the interfact API iam using in the middle tier doesnt support unicode, we are not bothering about this support anymore.
thanks anyway for your advises.
hope to get more help from you in
the future.

Author Comment

ID: 2748949
thanx, BigRat
LVL 27

Expert Comment

ID: 2749030
Certainly. Thanks for the cheese!

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question