sam15
asked on
unicode_columnTypes
I have a 9i database in WE8ISO8859P1
A new requirement came in to store some data like book titles in Russian, arabic, and other foreign languages.
What would be the best thing to do?
a) change the whole database to UTF-8. that would to create a new DB and migrate data and test everything - lot of work i expect
b) create new columns with NVARCHAR2(4000) or NCLOB type. This would require many code changes in the client code.
c) Change exisitng columns that need to store foreign data from VARCHAR2 to NVARCHAR2.
Would c) work. Is not this the best and easiest solution. would there be any consequences in data stored in DB or pl/sql web client (SQL) or power builder client accessing the data.
A new requirement came in to store some data like book titles in Russian, arabic, and other foreign languages.
What would be the best thing to do?
a) change the whole database to UTF-8. that would to create a new DB and migrate data and test everything - lot of work i expect
b) create new columns with NVARCHAR2(4000) or NCLOB type. This would require many code changes in the client code.
c) Change exisitng columns that need to store foreign data from VARCHAR2 to NVARCHAR2.
Would c) work. Is not this the best and easiest solution. would there be any consequences in data stored in DB or pl/sql web client (SQL) or power builder client accessing the data.
ASKER
Option b was creating new columns so basically I would a new NVARCHAR2 column to store foreign titles while the regular VARCHAR2 column stores the english version. That makes things complex when coding.
I think you meant option c. If oracle has no ALTER column type i can still create new NVARCHAR2 column and copy the data. would that cause any issues for the client when you have SQL statement selecting several columns some are VARCHAR2 and some are NVARCHAR2.
The client can be browser or power builder .PB clients i think run in unicode.
I think you meant option c. If oracle has no ALTER column type i can still create new NVARCHAR2 column and copy the data. would that cause any issues for the client when you have SQL statement selecting several columns some are VARCHAR2 and some are NVARCHAR2.
The client can be browser or power builder .PB clients i think run in unicode.
So just to be clear, option B would be to have both a VARCHAR2 and a NVARCHAR2 column in the table storing the same data with the client application figuring out which column to use depending on the data element? That would seem like an awful lot of work to try to keep that straight.
It is certainly possible to build an application along these lines. But all your applications would need to be able to handle NVARCHAR2 columns. That depends on the client language (and I'm no PowerBuilder guru). I know that it's quite possible to handle NVARCHAR2 columns in Java, for example, but it is something that trips up a lot of developers. And third party tools will often have difficulty handling NVARCHAR2 data.
It is certainly possible to build an application along these lines. But all your applications would need to be able to handle NVARCHAR2 columns. That depends on the client language (and I'm no PowerBuilder guru). I know that it's quite possible to handle NVARCHAR2 columns in Java, for example, but it is something that trips up a lot of developers. And third party tools will often have difficulty handling NVARCHAR2 data.
ASKER
what would be the difference if the whole DB is set at UTF-8 and the column at NVARCHAR2 where national character set at NVARCHAR2.
it is similar thing. the data will be stored and rendred in multibytes to the client.
it is similar thing. the data will be stored and rendred in multibytes to the client.
If the database character set is UTF-8 (presumably AL32UTF8), all the data would still be stored in a VARCHAR2 column. So client applications don't need to know how to handle NVARCHAR2 columns, you don't have to deal with string constants that may not be representable in the database character set, etc.
ASKER
yes it will but it is myltibyte storage instead of single byte storage.
I am not sure what you mean by how clients to handle data.
The client usually converts the data to whatever the character set of the client i thought.
Are you saying the client will get confused when you send single byte field data and multibyte field data?
I am not sure what you mean by how clients to handle data.
The client usually converts the data to whatever the character set of the client i thought.
Are you saying the client will get confused when you send single byte field data and multibyte field data?
I'm saying that the client applications have to be aware that certain columns are NVARCHAR2 and certain columns are VARCHAR2 independent of whether the VARCHAR2 columns happen to contain single or multi-byte characters. They have to bind the data a bit differently, they have to handle string constants a bit differently, etc.
A Java application, for example (barring some real corner cases), will always request Unicode data from the database regardless of the database character set. So Java applications inherently handle multi-byte data. But handling NVARCHAR2 columns creates a number of problems for a Java developer. The calls that the Java app needs to make to tell Oracle to treat one bind variable as an NVARCHAR2 and another as a VARCHAR2 can be a bit verbose and many libraries that a Java developer may want to use won't be aware of how to do that. So while it is perfectly possible to write a Java application that correctly handles databases where some columns are VARCHAR2 and others are NVARCHAR2, you're introducing a number of extra hurdles and potentially eliminating some libraries.
The same goes with third party tools. Even tools that happily support multi-byte data may not have been coded to know how to handle NVARCHAR2 columns correctly. Some will, of course, but you're setting yourself up down the line to not be able to use a tool that you would like to use because it supports Unicode data but not Oracle's national character set implementation (tools that are designed to support many different databases are often the slowest to adopt this sort of Oracle-specific functionality).
A Java application, for example (barring some real corner cases), will always request Unicode data from the database regardless of the database character set. So Java applications inherently handle multi-byte data. But handling NVARCHAR2 columns creates a number of problems for a Java developer. The calls that the Java app needs to make to tell Oracle to treat one bind variable as an NVARCHAR2 and another as a VARCHAR2 can be a bit verbose and many libraries that a Java developer may want to use won't be aware of how to do that. So while it is perfectly possible to write a Java application that correctly handles databases where some columns are VARCHAR2 and others are NVARCHAR2, you're introducing a number of extra hurdles and potentially eliminating some libraries.
The same goes with third party tools. Even tools that happily support multi-byte data may not have been coded to know how to handle NVARCHAR2 columns correctly. Some will, of course, but you're setting yourself up down the line to not be able to use a tool that you would like to use because it supports Unicode data but not Oracle's national character set implementation (tools that are designed to support many different databases are often the slowest to adopt this sort of Oracle-specific functionality).
ASKER
Oh, i think i see what you say.
in pl/sql for examlpe, i have to change the variable type from VARCHAR2(n) to NVARCHAR2(N) unless i use dynamic data types like
my_Var table.column%type.
I am wondering if i have sql statement like this, would it fetch data in multibyte or singlebyte.
select varchar2_column,nvarchar2_ column from table;
in pl/sql for examlpe, i have to change the variable type from VARCHAR2(n) to NVARCHAR2(N) unless i use dynamic data types like
my_Var table.column%type.
I am wondering if i have sql statement like this, would it fetch data in multibyte or singlebyte.
select varchar2_column,nvarchar2_
The character set that is returned to the client depends on the client's NLS settings (which introduces another set of wrinkles since those settings don't work exactly the same way for NVARCHAR2 and VARCHAR2 columns). Either the VARCHAR2 or the NVARCHAR2 column could be returned as a single-byte or as a multi-byte string.
There is relatively little that needs to be done for PL/SQL to support NVARCHAR2 and VARCHAR2 columns because data never gets converted from one character set to another and Oracle internally handles all the issues with binding the right data type for a bind variable. If you want to hard code values that cannot be represented in the database character set (say, you want your code to return an NVARCHAR2 error message), that introduces some complexity.
If you're writing a Java application, however, every time you bind a variable that is not the default (which you can set to VARCHAR2 or NVARCHAR2), you have to issue an additional call to setFormOfUse to tell the JDBC driver to treat the column as whichever option is not the default. So every time you pass an NVARCHAR2 bind variable from the client application, the client code has to make an extra call to tell Oracle that the column is NVARCHAR2 that doesn't have to be made if everything is stored in a VARCHAR2 column. If your library/ framework/ etc. doesn't know how to do that, it won't work with NVARCHAR2 columns. And if you happen to have applications where developers are generating SQL statements by concatenating strings together rather than using bind variables, things get even more complicated because SQL statements have to be representable in the database character set.
There is relatively little that needs to be done for PL/SQL to support NVARCHAR2 and VARCHAR2 columns because data never gets converted from one character set to another and Oracle internally handles all the issues with binding the right data type for a bind variable. If you want to hard code values that cannot be represented in the database character set (say, you want your code to return an NVARCHAR2 error message), that introduces some complexity.
If you're writing a Java application, however, every time you bind a variable that is not the default (which you can set to VARCHAR2 or NVARCHAR2), you have to issue an additional call to setFormOfUse to tell the JDBC driver to treat the column as whichever option is not the default. So every time you pass an NVARCHAR2 bind variable from the client application, the client code has to make an extra call to tell Oracle that the column is NVARCHAR2 that doesn't have to be made if everything is stored in a VARCHAR2 column. If your library/ framework/ etc. doesn't know how to do that, it won't work with NVARCHAR2 columns. And if you happen to have applications where developers are generating SQL statements by concatenating strings together rather than using bind variables, things get even more complicated because SQL statements have to be representable in the database character set.
ASKER
I misunderstood their reqt.
I found out they need to keep the english book titles, narators, authors but want to add a second set of foreign data for these titels so foreign users can read that and search by it.
it sounds the easiest thing is adding a FOREIGN_DATA_TABLE and adding set of columns like
TITLE NVARCHAR2(4000) or NCLOB
NARRATOR_FNAME NVARCHAR(100)
NARRATOR_LNAME NVARCHAR2(100)
The other english data stays in WEISO88591P1.
I can use oracle CONVERT function here if i want to convert it to UTF8 before it goes to client.
What do you think of this solution?
I found out they need to keep the english book titles, narators, authors but want to add a second set of foreign data for these titels so foreign users can read that and search by it.
it sounds the easiest thing is adding a FOREIGN_DATA_TABLE and adding set of columns like
TITLE NVARCHAR2(4000) or NCLOB
NARRATOR_FNAME NVARCHAR(100)
NARRATOR_LNAME NVARCHAR2(100)
The other english data stays in WEISO88591P1.
I can use oracle CONVERT function here if i want to convert it to UTF8 before it goes to client.
What do you think of this solution?
You wouldn't want to use the CONVERT function. The Oracle networking layer will automatically convert data to the character set the client requests (a Java application, for example, would want UTF-16 data rather than UTF-8). But your application code would need to know how to bind NCHAR/ NVARCHAR2 parameters and fetch NCHAR/ NVARCHAR2 data which is going to require some client configuration changes. For PowerBuilder, the code changes to support NCHAR/ NVARCHAR2 columns are documented on the Sybase site. Other libraries/ languages/ clients will have similar requirements (assuming they support NCHAR/ NVARCHAR2).
ASKER
so your saying creating a table similar to above using national character set and wiritng the client (regardless of language) to handle it will enable us to store, maniuplate and retrieve foreign data.
there is no need to conver the whole database to Unicode format.
correct?
there is no need to conver the whole database to Unicode format.
correct?
There is no need to convert the entire database to Unicode.
In my experience, though, you'll save yourself time and frustration over the long run by converting the entire database to Unicode. It is all but inevitable that over time the amount of non-English data is going to grow and the number of columns that will store non-English data will grow. You'll find that some library or reporting tool that you want to use either doesn't support columns using the national character set or that configuring it to do so is highly esoteric. Some developers will struggle with character set issues because they didn't bind to an NVARCHAR2 column correctly, etc. None of these may be fatal. But over the long run, my bias would be that converting the entire database to Unicode, while more work immediately, will pay dividends over the life of your data.
In my experience, though, you'll save yourself time and frustration over the long run by converting the entire database to Unicode. It is all but inevitable that over time the amount of non-English data is going to grow and the number of columns that will store non-English data will grow. You'll find that some library or reporting tool that you want to use either doesn't support columns using the national character set or that configuring it to do so is highly esoteric. Some developers will struggle with character set issues because they didn't bind to an NVARCHAR2 column correctly, etc. None of these may be fatal. But over the long run, my bias would be that converting the entire database to Unicode, while more work immediately, will pay dividends over the life of your data.
ASKER
so you agree with creating one unicode table and handling it with client?
Converting the whole database i think would need a lot of testing. I am thinking a 2 month project for 300 table database with 600,000 lines of code. what do you think?
Converting the whole database i think would need a lot of testing. I am thinking a 2 month project for 300 table database with 600,000 lines of code. what do you think?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Option b can work. You'll need to move the data, which is going to be similar to the work required for option a if you're trying to convert any fraction of the columns. You can't just do an ALTER that changes the data type of a column from VARCHAR2 to NVARCHAR2. You would need to create a new NVARCHAR2 column, copy the data, and then remove the old VARCHAR2 column and rename the new column to the old column name. Plus, NVARCHAR2 columns introduce a reasonable amount of additional complexity in the client code. I don't know off the top of my head whether PowerBuilder even supports NVARCHAR2 columns (it may depend on the version).