Solved

flat database search with Far Eastern characters

Posted on 2002-03-03
13
246 Views
Last Modified: 2013-12-25
I have a flat database online where visitors can input  data or search the database. I have now to expand the search capability to include Far Eastern characters (Japanese).
The problem is, the script I am using currently is written in Python, which does not fully support Far Eastern characters.

However, I noted that it is possible to input Japanese characters properly, which is also correctly stored in the flat database. The search script garbles them, however.
Is there a simple solution to make a Python script handle Far Eastern characters?  Or, as an alternative, can anyone suggest another search script that can search Japanese in a pipe delineated database and display the result as HTML?
0
Comment
Question by:ppblue
  • 6
  • 5
  • 2
13 Comments
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
I would handle far eastern characters by first converting them to UTF-8. In fact UTF-8 will handle ALL character sets, so the conversion in the database will be worth it. You then have a byte stream where each byte is unique.

You'll have to ensure that the character set used for your HTML pages is also UTF-8. Luckily all UTF-8 bytes numerically less than 128 are US-ASCII so the authoring of the pages is not too difficult.

The next problem is the Input text /text area field in the HTML search submit form. You could use the ACCEPT-CHARSET attribute to get the set to be UTF-8 (I have not done this), or you convert the text from 16-bit Unicode (Javascript standard) into UTF-8 on submit with script (I have done that) using a bit of script (charAt function) and a hidden field.

Your Python code then thinks that it is dealing with 8-bit ISO-8898-1 whereas it is actually UTF-8 (which looks strikingly similar!).

HTH
0
 
LVL 2

Author Comment

by:ppblue
Comment Utility
Thanks for answering.
The flat database is a txt file. I do not know if the above method is practical in this case.
Also, most Japanese web sites have no encoding set whatsoever.  They usually are either in Shift-JIS (mostly) or EUC encoding. Maybe I am wrong, but this seems to mean that their browsers have no way of detecting a page's encoding, which, in return, probably means that their browsers are set to Shift-JIS by default and must be switched to EUC manually, if a page is not legible. I have some doubts whether they would detect UTF encoding. I have no idea of Python, but I would rather prefer to be on the safe side.
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
"browsers detecting the encoding..."

This WAS a problem in the old days, because the server send no content-type header. In fact if you just send content-type text/html without specifing the character set you get into this old "browser must detect char set problem".

Strictly speaking internally in the browser the character set is Unicode. Javascript's charAt function actually returns a 16-bit value for the characters which is actually UTF-16.

The problem is the octet stream returned by the server. This byte or octet stream is encoded in what is called the transport character set and says how the "Unicode" characters are encoded. In http/1.1 and HTML 4 there should be a char-set attribute on the content-type response and if missing this defaults to ISO-8898-1.

Because the servers do not set this correctly, and the default may not apply, one can often set this in the browser. Indeed often it is assumed that the browser and the server are using the same character set (which probably explains why no one in Japan surfs in Russia)

The real problem with multi-language sites is getting the correct characters displayed. There are two approaches. First the "least-common-denominator" approach, secondly the UTF approach.

The second approach I have described. This involves setting the http response to be UTF-8 and always processing this sort of data. This is always an 8-bit byte stream which looks almost identical to ISO-8898-x, indeed the US-ASCII part is the same, and programming and scripting languages are transparent to it. IE from 3.0 was UTF enabled. I believe Netscape from 4.0, but I'm not sure. In any event it is a W3C/Unicode standard.

The first approach involves converting every character which is not US-ASCII and is being sent to the browser into an HTML entity (ie: ampersand, number, semicolon). And of course you'd have to convert everything back again on input.

I personally would make the change to UTF-8. Notepad on Win2K will read a .txt file and store it in UTF-8. It inserts a three byte UTF-8 header in the beginning of the file which can be dected in scripts.
0
 
LVL 2

Author Comment

by:ppblue
Comment Utility
Thanks BigRat!
I have two scripts, one for inputting into the flat database from a form, the others perform search. But I do not know Python!  I heard that Python is weak in handling Far Eastern languages. At least earlier versions. Unfortunately, the company I got the script from seem to have folded. How would I go about doing the conversions you suggest? Do you know Python?
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
Sorry, no Python. But the input script assumes ANSI 8-bit, so if we send UTF 8-bit it won't know the difference. So no change.

The search script is interesting. Post it. It might do character conversions before searching and that we might have to change.

lastly, you'll need to change the HTML pages for the UTF character set.
0
 
LVL 2

Author Comment

by:ppblue
Comment Utility
Maybe I was wrong and the other part of the script set was in Python. Looks as if it were in Perl.
trans-direct.net/test/searchscript.txt
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 27

Expert Comment

by:BigRat
Comment Utility
Not being familiar with Perl ModDB, I can't see where the actual searching takes place. I mean where an input term is tested against a field in the database. It is in such places that implicit normalization of characters takes place. And it is in that position that one might have to make changes for UTF-8.

The statement :-

print "Content-type: text/html\n\n";

will get changed to add a charset attribute onto the Content-Type entity (as they are called) to specify UTF-8.
0
 
LVL 16

Expert Comment

by:maneshr
Comment Utility
ppblue,

Did you get the solution you were looking for?

What solution, if any, did you use?

Let us know.
0
 
LVL 2

Author Comment

by:ppblue
Comment Utility
maneshr,

No solution to the problem yet. I am using now a second (different) script to display the whole db, but I have not found a "simple" solution for searching them. I may have messed up the script during configuration, as it seemed to work.
0
 
LVL 16

Expert Comment

by:maneshr
Comment Utility
ppblue,

"..No solution to the problem yet..."

Hmm....Do you think it would make sense to delete this question & post a fresh new one providing additional details?

Since the time you posted this question, i am sure you would have a better insight to the problem.

Also, posting a new question will bring it to the top of the heap in the topic area.

my 0.02 cents.

Thanks.
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
Why is this question "pending deletion" without a response to my last input, namely setting the returning character set correctly?
0
 
LVL 2

Author Comment

by:ppblue
Comment Utility
BigRat,

I really appreciate your willingness to help, but I do not understand much of either Perl or Python, though I may modify some script parts when I am told exactly what to do. What I mean is I need to know how to modify the script concretely, I cannot modify the code myself. I took your answer to mean that you cannot help me with my specific request. If you can tell me the string or strings to change, I will try. I really need search capability for the database for Far Eastern chracters.
Currently, I just put the matter off until there are sufficient entries, by outputting the whole database to an HTML file, because with that script I could modify encoding settings myself.
The status quo is that I need to search a flatfile db with Far Eastern characters, either searching the text file or the HTML file will do, preferably without modifying the input to the db. The search file I mentioned above is for searching the db directly (the original). The HTML output at http://www.trans-direct.net/tad/jp/freel/jp_freel_db.html (no content yet) is done by a display script. This is an interim solution only.
0
 
LVL 27

Accepted Solution

by:
BigRat earned 200 total points
Comment Utility
What I meant was that I'm no Python programmer. What I did say was that I could look at the search code. I see from the HTML pages that you use Shift-JIS as character encoding in the page, which probably means that the entire operation will be conducted in this character set, which does not cover the Big-5.

You need the help of a professional programmer to redesign the system properly and this here is not the correct forum for that. So you can delete the question if you want. I was just a bit upset with the deletion without any response to my last comment.

Good luck!
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

It is becoming increasingly popular to have a front-page slider on a web site. Nearly every TV website,  magazine or online news has one on their site, and even some e-commerce sites have one. Today you can use sliders with Joomla, WordPress or …
Batch, VBS, and scripts in general are incredibly useful for repetitive tasks.  Some tasks can take a while to complete and it can be annoying to check back only to discover that your script finished 5 minutes ago.  Some scripts may complete nearly …
The viewer will learn how to count occurrences of each item in an array.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now