Solved

Reading file stored in utf8 format

Posted on 2004-10-04
7
1,424 Views
Last Modified: 2012-08-13
Hi,
   I would like to know if this code must be changed ?  Our character set was changed to utf-8.
   With the change, some programs do not work correctly...
   For example, I developed a program that Is reading a file from blob field, and I use code like this..
...
lv_len_blob:=dbms_lob.getlength(lv_blob);
...
dbms_lob.read(lv_blob,ofsetlen,p_seek_point,lv_data);
...
lv_line:=utl_raw.cast_to_varchar2(lv_data);
...
  I would like to know, if this code is compatible with utf-8

Thanks.
Marcelo
0
Comment
Question by:mosorio
7 Comments
 
LVL 47

Expert Comment

by:schwertner
ID: 12217787
Binary LOBs are not for storing character data.
There are diferences between binary and character data.
I found this on the net:

A flowchart to see if you have incorrect characters in the database.
-------------------------------------------------------------------
 
* If following this flowchart is taking you further then step 2
  forget about going directly to another character set, even UTF8 or so.
  First make sure that you correct you current *incorrect* setup even if this
  means going first to a character set that you don't want....
 
* If you are following this note because after a migraton to a new version *without* change of characterset
  you have problems with some characters then please follow this note using you OLD NLS situation
  (=the NLS_LANG used on clients before the migration to the new version of Oracle)
 
* In this flowchart we use next to common sense/logic also the character set scanner,
 
  IMPORTANT:
  Please *DO* install the version 1.2 or higher from TechNet for you version.
  http://technet.oracle.com/software/tech/globalization/content.html
  and use this.  
 
   Use Scanner Utility before Altering the Database Character Set
  This tool allows you to isolate problematic data and will give further    confirmation of the situation.
 
* If following this note results in changing a database running on a unix system to a
  "MSWIN" characterset then this is NOT a problem.
  You can use on Unix an database with a "Windows" characterset like WE8MSWIN1252.
  The only restriction is that you cannot use EBCDIC charactersets (like used on AS400 ea)
  on ASCII based platforms (like used on Unix and Windows) (or inverse) for the
  database characterset.
 
Please do follow this *step by step*:
-------------------------------------
 
0)a)Find out the character set of you client.  
   To find the client's character set see [NOTE:179133.1] for windows or [NOTE:158577.1],
   "chapter 4.1.1 Identify the character set /codepage used by your clients."
   for other platforms and write down the correct Character set name
   like Oracle uses before going further.  
   This is the "correct" client's character set.
 
  b)Find out the current NLS_LANG used on your client, see note [NOTE:158577.1]  
   "chapter 4.3 How can I Check the Client's NLS_LANG Setting?"
   For the good order: the NLS_LANG on your *server* has no impact here  
   unless you're using your server as a client (ex: you log in with a telnet
   session on your Unix box and start *on the server* an application)
 
1)a)IF on your client the current NLS_LANG is set to the correct
   character set of the client go to 2)
 
  b)IF on your client the current NLS_LANG is *NOT* set to the correct
   character set of the client go to 3)
   
2)a)IF the CLIENT NLS_LANG setting is the same as the database character set  
   then you have a correct setup.
 
   You find the current database character set with this select:
   select * from nls_database_parameters where parameter like '%CHARACTERSET%';
 
   Double check with the character set scanner if there is no incorrect
   data in you system with this syntax:
      csscan FULL=Y FROMCHAR=<character_set> TOCHAR=<character_set> LOG=charcheck CAPTURE=Y
   where <character_set> is the character set you found out in step 0).
 
   This will create 3 files :
 
   charcheck.out a logging of the output of csscan
   charcheck.txt a Database Scan Summary Report  
   charcheck.err a log file that normally should contain the rowid's of
                 the rows of the tables reported in charcheck.txt
 
   If you have NO entry's in charcheck.txt under the Convertible or Exceptional columns
   then you can now see how to go to a new character set by using the flow in
   [NOTE:225912.1]  Changing the Database Character Set - an Overview
 
   *stop*
 
  b)IF the CLIENT NLS_LANG setting is NOT the same as the database character set
   then there will be conversion between the client and the server but it's a
   100% correct setup. Of course you need to make sure that you use for your  
   database a character set that supports ALL symbols used by your client.
   Other wise it's possible that you see some characters replaced by the  
   replacement symbol ((? or i??). This is technically correct but something that you might  
   not want to happen. See [NOTE:158577.1] NLS_LANG Explained for more info.
 
   You find the current database character set with this select:
   select * from nls_database_parameters where parameter like '%CHARACTERSET%';
 
   Double check with the character set scanner if there is no incorrect
   data in you system with this syntax:
      csscan FULL=Y FROMCHAR=<character_set> TOCHAR=<character_set> LOG=charcheck CAPTURE=Y
   where <character_set> is the character set you found out in step 0).
 
   This will create 3 files :
 
   charcheck.out a logging of the output of csscan
   charcheck.txt a Database Scan Summary Report  
   charcheck.err a log file that normally should contain the rowid's of
                 the rows of the tables reported in charcheck.txt
 
   If you have NO entry's in charcheck.txt under the Convertible or Exceptional columns
   then you can now see how to go to a new character set by using the flow in
   [NOTE:225912.1]  Changing the Database Character Set - an Overview
 
   you only need to be sure that all characters know by the client are
   also known in the database character set as discussed in [NOTE:158577.1]
 
   *stop*
 
3)a)IF the CLIENT NLS_LANG setting is the same as the database character set then you have stored
   INCORRECT data in the database.
 
   You find the current database character set with this select:
   select * from nls_database_parameters where parameter like '%CHARACTERSET%';
 
   go to 4)
 
  b)IF the CLIENT NLS_LANG setting is NOT the same as the database character set then you  
   have stored INCORRECT CONVERTED data in the database.  
 
   You find the current database character set with this select:
   select * from nls_database_parameters where parameter like '%CHARACTERSET%';
 
   go to 5)
   
4)a)IF ALL you CLIENTS have the SAME character set then proceed to 6) for a solution
 
  b)IF you have 2 or more CLIENTS with a DIFFERENT character set then proceed to 7)
   
5)a)IF ALL you CLIENTS have the SAME character set then proceed to 8) for a solution
 
  b)IF you have 2 or more CLIENTS with a DIFFERENT character set then proceed to 9)
   
6) The solution is to use the "alter database" command, this changes the current  
   database character set definition WITHOUT changing the actual data.
   follow the steps in  
   [NOTE:66320.1]  Changing the Database Character Set or the Database National Character Set
   with these differences:
   instead of using  "ALTER DATABASE CHARACTER SET <new_character_set>;" in step 3)4.
   use ALTER DATABASE CHARACTER SET INTERNAL_USE <new_character_set>;  
   <new_character_set> is the character set you found out in step 0).
   This corrects the problem on database side.
 
   Before doing this you can double check with the character set scanner if
   the new character set you found is the correct one with this syntax:
      csscan FULL=Y FROMCHAR=<new_character_set> TOCHAR=<new_character_set> LOG=charcheck CAPTURE=Y
   where <new_character_set> is the character set you found out in step 0).
 
   This will create 3 files :
 
   charcheck.out a logging of the output of csscan
   charcheck.txt a Database Scan Summary Report  
   charcheck.err a log file that normally should contain the rowid's of
                 the rows of the tables reported in charcheck.txt
 
   If you have NO entry's in charcheck.txt under the Convertible or Exceptional columns
   then you can continue.
 
   If you want to "test-drive" this test it on a *restore* of a hot or cold
   backup. DO NOT test this by creating a new db and using export/import.
   DO NOT, again, DO NOT use the "INTERNAL_USE" clause if you are NOT sure.
   TEST this first.
 
   Now you still need to adapt ALL the clients to use the CORRECT
   NLS_LANG as found in step 0).
 
   Once this done you have a correct environment.
   Now you can see how to go to a new character set by using the flow in
   [NOTE:225912.1]  Changing the Database Character Set - an Overview
 
   *stop*
   
7) Here you need to do it in 2 or 3 times depending on the number of clients with
   *different* character sets (not the numbers of clients (!)).
 
   You need to identify the schema's used by each client with a different character set
   and perform the next steps *for each client one by one* and then export the data.
   ( so you backup the database, do the change for character set 1, export that data ,
   then you restore your database and restart over for the next type of client)
   This will result in X number of dump files containing the corrected data and ready
   to be imported.
 
   These are the steps:
 
   Double check with the character set scanner if the new character set you found
   is the correct one with this syntax:
      csscan FULL=Y FROMCHAR=<new_character_set> TOCHAR=<new_character_set> LOG=charcheck CAPTURE=Y
   where <new_character_set> is the character set you found out in step 0).
 
   This will create 3 files :
 
   charcheck.out a logging of the output of csscan
   charcheck.txt a Database Scan Summary Report  
   charcheck.err a log file that normally should contain the rowid's of
                 the rows of the tables reported in charcheck.txt
 
   If you have NO entry's in charcheck.txt under the Convertible or Exceptional columns
   for the schema involved then you can continue.
 
   The solution is to use the "alter database" command, this changes the current  
   database character set definition WITHOUT changing the actual data.
   Follow the steps in  
   [NOTE:66320.1]  Changing the Database Character Set or the Database National Character Set
   with these differences:
   Instead of using  "ALTER DATABASE CHARACTER SET <new_character_set>;" in step 3)4.
   use ALTER DATABASE CHARACTER SET INTERNAL_USE <new_character_set>;  
   <new_character_set> is the character set you found out in step 0).
   Now export the dataset with as nls_lang on the prompt set to  
   NLS_LANG=AMERICAN_AMERICA.<new_character_set>
   Restore your database.
   Start over for the next schema....
 
   If you want to "test-drive" this test it on a *restore* of a hot or cold
   backup. DO NOT test this by creating a new db and using export/import.
   DO NOT, again, DO NOT use the "INTERNAL_USE" clause if you are NOT sure.
   TEST this first.
 
   Once all dump files are created, then create a NEW database with a  
   character set that contains ALL the character sets you found
   (most of the time it will be UTF8).
   Import every dump now with on the prompt the NLS_LANG set to
   NLS_LANG=AMERICAN_AMERICA.<new_character_set> where  
   <new_character_set> is the same character set used for the export of that
   dump file.
 
   See also [NOTE:15095.1] Export/Import and NLS Considerations.
 
   Now you still need to adapt ALL the clients to use the CORRECT
   NLS_LANG as found in step 0).
 
   Once this done you have a correct environment.
   Now you can see how to go to a new character set by using the flow in
   [NOTE:225912.1]  Changing the Database Character Set - an Overview
 
   *stop*
 
8) Here you have again 2 possible scenarios:
   A) the wrong NLS_LANG is a close match to the correct character set of the client.
      It contains *less* code points then the correct clients character set
      AND there are no identical code points being used for an other character.
      (every code point in the character set used in the NLS_LANG means/represents  
       the same character as the correct one for that client)
 
      An example of this is an Windows 1252 client with the NLS_LANG set
      to WE8ISO8859P1 and an *non*-WE8ISO8859P1 database character set.
      ALL the code points passed to oracle from the 1252 OS environment
      that are *NOT* know in the P1 character set are simply stored as the  
      default replacement character (? or i?? ).  
      (These are all code points between 128 and 159 in this example.
      See http://www.microsoft.com/globaldev/reference/cphome.mspx
      to compare the windows with the ISO character sets.)
 
      So in fact you have an 1252 client acting like a P1, hence there
      are characters that you cannot store but the data *in* the database
      is correct, so all you need to do is to adapt the NLS_LANG on the  
      client and look for "?" or "i??" replacement characters in your data.
 
      Double check with the character set scanner if there is no incorrect
      data in you system with this syntax:
      csscan FULL=Y FROMCHAR=<character set> TOCHAR=<character set> LOG=charcheck CAPTURE=Y
      where <character set> is the current character set of your database.
 
      This will create 3 files :
 
      charcheck.out a logging of the output of csscan
      charcheck.txt a Database Scan Summary Report  
      charcheck.err a log file that normally should contain the rowid's of
                 the rows of the tables reported in charcheck.txt
 
      If you have NO entry's in charcheck.txt under the Convertible or Exceptional columns
      then you have a correct environment.
      Now you can see how to go to a new character set by using the flow in
      [NOTE:225912.1]  Changing the Database Character Set - an Overview
 
      But if you have any doubt, contact us with all the information you have.
      For an Arabian windows(WE8MSWIN1256) with a nls_lang set to WE8ISO8859P6
      this is *not* true for example, if you compare those you will see they
      match point 8)B).
 
      *stop*
 
   B) the NLS_LANG is no close match to the correct character set of the client.
      It contains *more* code points then the correct client character set
      OR there are identical code points being used for an other character.
      (you have code point xx in the correct one that means "F" and the  
      same code point xx that means "L" in the character set used in the wrong  
      NLS_LANG setting)
 
      If you have this, contact us with all the info you found out and ask to route the problem  
      to the NLS / globalization competency for assistance.
 
      *stop*
 
9) contact us with all the info you found out and ask to route the problem  
   to the NLS / globalization competency for assistance.
 
      *stop*
 
 
 
* If there is no csscan tool V1.2 or higher for you version or platform
  then you can also see if a table is containing characters not defined in your
  character set with a syntax like:
 
  select rowid , <col> from <table> where <col> <> convert(convert(<col>,'UTF8'),'<your character set>','UTF8');
 
  Note: This will NOT work if your current database character set is UTF8.
 
  If you have a UTF8 database use this (will only work from 9i onwards):
  select rowid , <col> from <table> where <col> <> convert(convert(<col>,'AL16UTF16'),'<your character set>','AL16UTF16');
 
  Those 2 selects will return all rows who contain invalid codepoints for the "<your character set>" used.
  This is the same as the "exceptional" column in the csscan output.
 
  The next select lists the "convertible" AND "exceptional" code points between your
  current character set and an other character set:
 
  select rowid , <col> from <table> where <col> <> convert(<col>,'<new character set>');

0
 
LVL 9

Expert Comment

by:konektor
ID: 12217957
you should convert all data stored in bynary columns to new characterset. u cand do this way : use some external utility to save data from blobs and raws into files convert files from your oreginal characterset to utf-8 and store them back to blobs/raws ...
0
 

Author Comment

by:mosorio
ID: 12218200
To konector ...

then, I must understand that the change, can  produce some errors ?

I was thinking that I must to use some specific libraries from oracle.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 23

Expert Comment

by:seazodiac
ID: 12218682
Mosorio:

when you said: Our character set is changed to UTF8 format.

you mean the OS character set , right?

because the database charset change has nothing to do with this code.


but if you have change Charset in OS level, this code will not work.

I think you probably need to make sure the database charset is the same or a superset of UTF8 at least.
0
 

Author Comment

by:mosorio
ID: 12292859
to konector :

my code looks like this...

   dbms_lob.read(lv_blob,ofsetlen,p_seek_point,lv_data);
   lv_line:=CONVERT(utl_raw.cast_to_varchar2(lv_data),'UTF8');

but the binary data already stored in utf8...

ofsetlen is a number :=4000;
lv_data is a raw(4000)....
lv_line is a varchar2(4000)

if I change ofsetlen to 2000 , my program works...

with 2001 the program don't work ...                                  

why ???






0
 

Accepted Solution

by:
modulo earned 0 total points
ID: 12653399
PAQed with no points refunded (of 500)

modulo
Community Support Moderator
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
oracle 10G 5 48
PL/SQL Search for multiple strings 5 22
history tablespace temp usage 2 16
Oracle Next Available Number 2 19
Cursors in Oracle: A cursor is used to process individual rows returned by database system for a query. In oracle every SQL statement executed by the oracle server has a private area. This area contains information about the SQL statement and theā€¦
From implementing a password expiration date, to datatype conversions and file export options, these are some useful settings I've found in Jasper Server.
Via a live example show how to connect to RMAN, make basic configuration settings changes and then take a backup of a demo database
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now