Character set conversions in Database

Posted on 1998-07-30
Last Modified: 2008-02-01
I have created a database in Access, which I access through the JDBC-ODBC bridge from a Java application. I have the following problem: Access seems to store the Greek characters as bytes using the encoding "Cp1253". I have managed to correctly convert these bytes to String and read the contents of the database. But I don't know how to convert the input Strings and store them to the database. I know that String.getBytes("Cp1253") will give me a correctly converted byte array, but how can this array be used in a SQL statement for a field that expects strings?
Question by:incah
  • 4
  • 3

Expert Comment

ID: 1229172
What happens if you simply insert the string into the database without converting it to byte array? What do you mean by describing how Access stores Greek characters? How does access know that those are Greek characters? In Java you can indeed convert between Unicode and byte array representation using encoding name. But can you specify encoding name dealing with Access? If not, the it will store whatever you send it. If the field accepts String you have to send a String, if it accepts binary data, you can send byte array, but it seems that it is up to you how to interpret the data.

Expert Comment

ID: 1229173
I am very sorry to disappoint you. The jdbc-odbc bridge is implemented in native code. In java, the characters are in UNICODE (2 BYTES PER CHAR). When a Native code (C is used), function is called, in the native code, the UNICODE Strings are converted to C  Strings by ignoring the upper byte of each of the two byte characters !!. So, if your greek encoding is 8 bit, then it is fine otherwise it will be neatly truncated (specially in case of japaneese).

You can try one thing. You can set your regional settings in ControlPanel -->RegionalSettings default locale to Greek and reboot and try running the program. This may work sometimes.


Author Comment

ID: 1229174
The database contains records (with Greek characters in their fields) that were not inserted with my Java application, but with Access. These are not read correctly unless I read them as a byte array (of the encoding "Cp1253") and then convert it to String.

Now let's assume that I want to change the content of a field,no matter what this field initially contained.
I created a query like this:
query="UPDATE di00010 SET di0nam='"+t2.getText()+"' WHERE di0cod=1";
t2 is a JTextField and t2.getText() returns a Unicode String with the input of the user (this is in Greek). Please note that if I print this query to the standard output I don't see Greek characters correctly but this maybe expected behaviour, as the DOS boxes use the encoding "Cp737"
I expected (as you do too) that if I stored in the database a Unicode String and then got it back (with a SELECT statement), I would take the same string that I stored. But this is not happening. I get garbage instead. So I assumed that an implicit conversion is done in the background when the SQL UPDATE statement is executed. And the only logical conversion here would be to the Windows default encoding of the system used, ie to "Cp1253".

I can't tell you what would have happened if I inserted the records directly from the Java app. The execution of an "INSERT INTO" statement always fails in my system and I don't yet know why.
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.


Author Comment

ID: 1229175
To evijay:

As you can see in my previous comment I imagined that something like this was happening. That's why I need to do the conversion myself and store a byte array of the correct encoding in the database.But how can I use a byte array in a SELECT statement?
Or should I change the field type of the database to some other type than text?

Expert Comment

ID: 1229176
Please note that greek characters in Unicode take a range from 0 thru 0x300.

The mapping which should be used for converting a Unicode greek character to a 8 bit CP1253 character is

        "\u20AC\uFFFD\u201A\u0192\u201E\u2026\u2020\u2021" +       // 0x80 - 0x87
        "\uFFFD\u2030\uFFFD\u2039\uFFFD\uFFFD\uFFFD\uFFFD" +       // 0x88 - 0x8F
        "\uFFFD\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +       // 0x90 - 0x97
        "\uFFFD\u2122\uFFFD\u203A\uFFFD\uFFFD\uFFFD\uFFFD" +       // 0x98 - 0x9F
        "\u00A0\u0385\u0386\u00A3\u00A4\u00A5\u00A6\u00A7" +       // 0xA0 - 0xA7
        "\u00A8\u00A9\uFFFD\u00AB\u00AC\u00AD\u00AE\u2015" +       // 0xA8 - 0xAF
        "\u00B0\u00B1\u00B2\u00B3\u0384\u00B5\u00B6\u00B7" +       // 0xB0 - 0xB7
        "\u0388\u0389\u038A\u00BB\u038C\u00BD\u038E\u038F" +       // 0xB8 - 0xBF
        "\u0390\u0391\u0392\u0393\u0394\u0395\u0396\u0397" +       // 0xC0 - 0xC7
        "\u0398\u0399\u039A\u039B\u039C\u039D\u039E\u039F" +       // 0xC8 - 0xCF
        "\u03A0\u03A1\uFFFD\u03A3\u03A4\u03A5\u03A6\u03A7" +       // 0xD0 - 0xD7
        "\u03A8\u03A9\u03AA\u03AB\u03AC\u03AD\u03AE\u03AF" +       // 0xD8 - 0xDF
        "\u03B0\u03B1\u03B2\u03B3\u03B4\u03B5\u03B6\u03B7" +       // 0xE0 - 0xE7
        "\u03B8\u03B9\u03BA\u03BB\u03BC\u03BD\u03BE\u03BF" +       // 0xE8 - 0xEF
        "\u03C0\u03C1\u03C2\u03C3\u03C4\u03C5\u03C6\u03C7" +       // 0xF0 - 0xF7
        "\u03C8\u03C9\u03CA\u03CB\u03CC\u03CD\u03CE\uFFFD" +       // 0xF8 - 0xFF
        "\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007" +       // 0x00 - 0x07
        "\b\t\n\u000B\f\r\u000E\u000F" +       // 0x08 - 0x0F
        "\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017" +       // 0x10 - 0x17
        "\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F" +       // 0x18 - 0x1F
        "\u0020\u0021\"\u0023\u0024\u0025\u0026\'" +       // 0x20 - 0x27
        "\u0028\u0029\u002A\u002B\u002C\u002D\u002E\u002F" +       // 0x28 - 0x2F
        "\u0030\u0031\u0032\u0033\u0034\u0035\u0036\u0037" +       // 0x30 - 0x37
        "\u0038\u0039\u003A\u003B\u003C\u003D\u003E\u003F" +       // 0x38 - 0x3F
        "\u0040\u0041\u0042\u0043\u0044\u0045\u0046\u0047" +       // 0x40 - 0x47
        "\u0048\u0049\u004A\u004B\u004C\u004D\u004E\u004F" +       // 0x48 - 0x4F
        "\u0050\u0051\u0052\u0053\u0054\u0055\u0056\u0057" +       // 0x50 - 0x57
        "\u0058\u0059\u005A\u005B\\\u005D\u005E\u005F" +       // 0x58 - 0x5F
        "\u0060\u0061\u0062\u0063\u0064\u0065\u0066\u0067" +       // 0x60 - 0x67
        "\u0068\u0069\u006A\u006B\u006C\u006D\u006E\u006F" +       // 0x68 - 0x6F
        "\u0070\u0071\u0072\u0073\u0074\u0075\u0076\u0077" +       // 0x70 - 0x77
        "\u0078\u0079\u007A\u007B\u007C\u007D\u007E\u007F";       // 0x78 - 0x7F

This should be used by the odbc driver while doing the conversion from java string to c string.  The problem is, in JNI we have two functions

makeCstring and makePlatformCString and as I notice, the JDBC-ODBC Bridge is using the former function (makeCString) which simply ignores the upper byte of unicode character.  so, unicode greek character \u0192 will be converted into 0x92 in C code, and upper byte is ignored. This is bad and this occurs for every function which passes strings from java to native code. It would have been nice if they have used makePlatformCString instead of makeCString.
As almost all the  JDBC Statement takes string argument, the default JDBC-ODBC Bridge will not help.

The solution is that you have to use a thirdparty JDBC-ODBC bridge say ( JDBC/KONA or JDirect) etc.,. in which you can specify to the driver, the charaset in which the database is.!!


Author Comment

ID: 1229177
To evijay:

Thanks to the map you gave me, I managed to do the conversion myself by substituting the Greek characters with a character between 0x80-0xFF (so that the first byte is always 0x00), write the data to the database, and then retrieve them and do the reverse
substitution. This worked !!!

Please post an an answer (anything) and I will grade it.

PS: Where do you find information like these you gave me?


Accepted Solution

evijay earned 150 total points
ID: 1229178
Now, jdbc-odbc bridge - the latest one supports charset property

See this abstract from the webpage

What's new with the JDBC-ODBC Bridge?

    The Bridge has been re-implemented using the Java Native Interface API, for improved performance and stability.

    A jdbc:odbc: connection can now have a charSet property, to specify a Character Encoding Scheme other than the client default. For possible values,
    see the JDK 1.1 Internationalizaton specification on the Javasoft Web Site.

    The Bridge now assumes that ODBC drivers are able to handle multi-threaded access. This will improve performance. If you need to use multi-threading and
    your ODBC package does not support it, your client program will have to implement locking.


Author Comment

ID: 1229179
Thanks again!

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
This video teaches viewers about errors in exception handling.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question