Solved

Unicode support in a C CGI script.

Posted on 2000-04-18
11
243 Views
Last Modified: 2010-08-05
Hi,

i was writing a C CGI, which uses another application specific API to write the incoming form data into DB.
now i have to add unicode support to the same. i dont know what to do. does changing the char to unicode char alone is enough.i read the form data from the input stream. what happens then if i have unicode characters in the stream. should i have to make sure the API and the DB supports unicode. how should i implement unicode compatibility in a C program.
any site/article that can guide me in this ??
any help in this regard will be greatly appreciated.
thanks,
sudhar
0
Comment
Question by:krishcharysudhar
11 Comments
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
The incoming form data is effectively in Unicode. Just setup IE 5.0 on WinNT install a Russian keyboard (ie: just the driver) and hack in a few characters. You'll get "Unicode" (actually as &number;) in the input.
   The next problem is your database. If you are using simple varchars it may be worthwhile just converting to UTF-8 rather than 16 bit Unicode (UTF-16). UFT-8 will entail two characters for the >128 ANSI chars, and you'll have to either change the HTML pages to UTF-8 or convert between UTF-8 and HTML in the form of ISO-8898-1. The biggest advantage of UTF-8 is that C-strings remain C-strings so you don't have to change a lot of program logic.
   So, what exactly is your environment and we'll proceed from there.
0
 
LVL 16

Expert Comment

by:maneshr
Comment Utility
i think you should also post this Q in the CGI forum at..

http://www.experts-exchange.com/Computers/WWW/CGI/
0
 

Author Comment

by:krishcharysudhar
Comment Utility
Hi BigRat,
thanks a lot for your support. but honestly i would like you to explain that bit more in detail. my environment is like this.
Point 1:
as such my http request ( with form data ) for my cgi DOES NOT COME FROM A BROWSER, rather a stand alone application sends its form data via http over internet.

Point 2:
My backend, is C CGI ( exe ), linked to an application specific API and then writing to SQL Server, all running on NT / IIS.

Point 3:
i have so far only used char in the code, should i have to change it, if so to what?.

Point 4:
iam not writing the client sending stuff, so i have no idea how he is going to support the unicode in that. in anycase, as far as the http request is concerned, what is difference, i will see. ( i know the data will be unicode, other than that ???)

Point 5:
I send the data to DB using API, should i make sure the API and DB supports unicode.

Point 6:
right now, for the form data is POSTed.
i get the CONTENT_LENGTH value, malloc a buffer for that size and read from stdin for the size. what should i change here to support unicode, i dont think allocating twice the size and reading it is just enough, please clarify.

i hope i have put down, all i can think of, please let me know if u have any more questions.
any help will be greatly appreciated.,
thanks for your time,
sudhar
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
The major question of design is 16-bit or 8-bit. I'd tend to go for 8-bit Unicode support and use UTF-8. This diverges from the normal ANSI use in as such that all ANSI encodings above hex A0 will turn into two bytes instead of one. This is a problem if :-

1) Legacy data - need to be converted.
   (Applies only if the data currently contains characters like à è á ä ö ü etc)
2) Sorting and searching
   Creating indexes on characters such as ä ü ö and so on is different in UTF-8 as in ANSI

If you currently have NO ä ü ö characters the the support will be simple. What is the current status in this regard?
0
 

Author Comment

by:krishcharysudhar
Comment Utility
Hi BigRat,

  iam not sure if we have those characters. but i know, the data could from chinese, japanese, korean and other far easter languages, where each character may have more a series of bytes representing them. hope this gives you an idea.
thanks,
sudhar
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 

Expert Comment

by:camough
Comment Utility
To clarify:
Point 1.
I am assuming that the input stream can be any character encoding (ascii, high ascii, double-byte)?

Point 2.
You haven't mentioned what SQL server is being used (Oracle 8.0.5 or MS SQL7?) - most newer SQL DBs support unicode, but support for these character sets has to defined through setup.

Point 3.
You should switch to portable data types, and use the _tchar datatype defined in the MS run-time library include file 'tchar.h' (check the MSDN reference library for international functions included).

Point 4.
If you are certain the stream of data is Unicode - great. If not, Unicode input/output functions will assume Multi-byte and you can do any necessary conversion with appropriate functions from the run-time library.

Point 5.
Yes, you should make certain - specifically (as stated above) check the database for support of Unicode (see OracleNet for Natural Language Support [NLS] and Microsoft MSDN for MS SQL)

Point 6.
Be careful with passing string lengths to a function that are dependant on character lengths and not byte-lengths.
You can pass the length not by doing a 'string.length' type call, but by 'string.length * _TCHAR' (MS function).

Hints:
Defining your build as unicode (depending on your build platform) should include the variant libraries for unicode functions.

Hope this helps.


0
 

Author Comment

by:krishcharysudhar
Comment Utility
Hi camough,

thanks for your response. i hope i will
use them to get out of this problem.
i will go ahead and look into the code with your suggestions in mind and get back to this. in the mean time, please dont hesitate to add more info on this issue.
thanks,
sudhar
0
 
LVL 27

Accepted Solution

by:
BigRat earned 50 total points
Comment Utility
Camough: You are proposing a rewrite to use 16-bit encoding. If krish has an 8-bit system where he HAS NOT used characters above 128 (ie none of the European accented characters - only basic ASCII) then a switch to UTF-8 format is trivial. He does not have to change much.

UTF-8 indexes just like ANSI (ie: not very well) so there is little change there. And some databases will support the format directly. If the application is completely browser based there is very little to change in the CGI programs/HTML pages (only setting the content type to UTF-8).

16-bit programming is a real pain and is only really supported by MS WinNT.
0
 

Author Comment

by:krishcharysudhar
Comment Utility
Hi BigRat,
i agree with you. in the mean time, my CGI stuff is working fine with non-english characters, without any modifications. since the content length sent from the request varies accordingly, the cgi still gets the data fine. but since the interfact API iam using in the middle tier doesnt support unicode, we are not bothering about this support anymore.
thanks anyway for your advises.
hope to get more help from you in
the future.
bye,
sudhar
0
 

Author Comment

by:krishcharysudhar
Comment Utility
thanx, BigRat
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
Certainly. Thanks for the cheese!
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now