Using the Easysoft ODBC-Oracle Driver with Unicode Data

Contents

What is Unicode?

A good definition can be found at http://www.unicode.org:

"Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption."

Types of Unicode

The previous definition appears to be saying that Unicode is a single way of encoding different character sets. And that is the hope of the Unicode Consortium that produced the above definition. However, the current reality is not as perfect as we would hope.

The Unicode Consortium has associated a unique number against every character in use worldwide. So for example U+0041 is the code for Latin capital letter A, and U+1033A is the code for Gothic letter KUSMA.

This all looks good, and work is underway to ensure that all characters are defined in such a way. What is not obvious from this however is the method used to encode the Unicode character. This is where many of the practical problems with the use of Unicode arise.

The two main methods of encoding Unicode characters are Unicode Transformation Format (UTF) and Universal Character Set (UCS). Examples of these encoding forms include:

So now we can see that from the original ideal of one standard number (i.e. Unicode code point) for every character, we have a multitude of encoding types.

Unicode encoding forms use either a fixed or a variable number of bytes to represent each character. For example, UCS-2 is a fixed-width encoding where each character is 2 bytes in size. UTF-8 is a variable-width encoding where one character can be 1, 2, 3 or 4 bytes.

A result of the variable length character of some UTF encoding forms is that there is the potential for illegal sequences of bytes.

Unicode and ODBC

Since the 3.51 release of the Microsoft Windows ODBC driver manager, ODBC has included support for Unicode. This support takes the form of Unicode data types (e.g. SQL_WCHAR), and Unicode versions of the ODBC API that take and return Unicode data (e.g. SQLPrepareW). In terms of ODBC, Unicode means UCS-2—a fixed-width single 16-bit character representation.

Because UTF-8 type encoding can be considered as a sequence of bytes, many of the ODBC functions that take single byte characters can work with UTF-8 data. However, it must be remembered that ODBC does not know that the data is UTF-8. It is just a sequence of single characters, and ODBC does not know about the relationship between multi-byte sequences and single characters.

This can lead to problems. For example, if a column in a database is defined as CHAR(20), the user would expect that to provide storage for twenty characters. However if the database does not understand UTF-8, this means that only twenty bytes may be stored, and to take the worst case of a sequence of four byte characters, in reality only five characters can actually be stored. The problem is potentially worst if the database is aware of UTF-8, and a column defined as CHAR(20) can actually contain twenty 4 byte sequences. Imagine an application using ODBC to return data from the table. The application may query the database to find the definition of the column, and will discover that its length is reported as twenty. So, it allocates space for twenty bytes and calls the database to return the data. Only the first five 4-byte sequences can be returned, as there is only space for that number of bytes.

Also, because ODBC is not aware of the multi-byte sequences, it cannot maintain the boundaries between valid byte sequences. It may be that the twenty character buffer provided has space for the first nine characters because they are encoded in two byte sequences, but the tenth character requires a three byte sequence. The ODBC specification gives no guideline as to what should happen here. The driver cannot return a short buffer and return the next character later, as this would lead the driver to believe that all the data was returned. If it returns the first two bytes from the three byte encoding, the last character in the buffer will be a illegal UTF-8 encoding sequence.

The best way to use Unicode with ODBC is to stick to the original design, use 8-bit ASCII encoding for the normal API calls, and UCS-2 for the W calls.

Unicode and Unix

Let’s look at a small sample program:

#include <stdio.h>

main()
{
    char seq[] = { '>', 0x55, 0x6E, 0x69, 0x63, 0x6F, 0x64, 0x65,
                   '<', 0x00 };
    int i;

    i = puts( seq );

    printf( "put %d characters\n", i );
}

When compiled and run, this program takes the byte sequence, and displays it on the terminal, Then it displays the number of characters displayed (including one for the newline).

So, running this program produces (as expected):

>Unicode<
put 10 characters

Now we will create a sequence of characters that use values beyond the 127 point where ASCII and UTF-8 differ. The sequence is 0x0170, 0x0144, 0x0129, 0x0107, 0x014D, 0x0111, 0x0115. Converting that sequence to UTF-8 gives a sequence of bytes that we can insert into the test program.

#include <stdio.h>

main()
{
    char seq[] = { '>', 0xC5, 0xB0, 0xC5, 0x84, 0xC4, 0xA9, 0xC4,
                0x87, 0xC5, 0x8D, 0xC4, 0x91, 0xC4, 0x95, '<', 0x00 };
    int i;

    i = puts( seq );

    printf( "put %d characters\n", i );
}

We can see how a character code such as 0x0170 becomes two bytes 0xC5 and 0xB0. This is an example of a two byte UTF-8 sequence.

Now let’s run the program:

>       <
put 17 characters

What we see now is that the characters have not been displayed, and the call to puts returns the number of bytes that have been displayed. But looking at the distance (in character positions) between the > and < characters, we see that there is space for 7 characters. So at least part of the system has understood that the fourteen bytes represent seven characters. But something else in the system has not been able to render those seven characters as visible display elements. The name sometimes given to the visual representation of a character is a glyph. This term originates from printing, and is defined as "the shape given in a particular typeface to a specific grapheme or symbol".

So now we have seen that a particular character can take on several forms, all being the same character, but appearing to be very different. We start with the character code 0x0170. This can be represented in a given encoding as a sequence of bytes, in the case of UTF-8 as 0xC5, 0xB0. And that encoding may be displayed (or not in this case) as a particular visual symbol or glyph.

Just to show how different systems can display the same sequence of characters in a different way, we ran the above program on another Linux machine. This time we used a 2.3 version of glibc, instead of the 2.2 used previously. The same program now shows the Unicode characters.

>Űńĩćōđě<
put 17 characters

Unicode and Oracle®

When dealing with Unicode, there are a number of layers each of which plays a part in the support for Unicode in the Oracle® database.

Since version 8 of Oracle®, there are a number of Unicode encodings that can be associated with a database. The terminology used for these will be seen to use terms introduced earlier in this document.

Some of these encodings include:

An Oracle® database can have two of these Unicode encoding forms associated with it:

The CHAR type may be set to use UTF8 or AL32UTF8 but not AL16UTF16. The NCHAR type may be set to use UTF8 or AL16UTF16, but not AL32UTF8.

Non-Unicode Character Sets

Oracle® also has a large number of non-Unicode character sets. While at first sight, these may be assumed to enable Unicode-like encoding, they do not. The character sets only provide support for a limited set of characters that fall within a particular group. These character sets are not limited to 255 characters, and may use 16-bit coding, but it must be remembered these are not strictly Unicode sets.

Character Set Naming Conventions

The naming convention used by Oracle® for character sets is:

<region><number of bits used to represent a character><standard character set name>[S|C]

The region describes the geographical area the character set covers. For example, US7ASCII is US, WE8ISO8859P1 is Western European, JA16SJIS is Japanese and AL16UTF16 covers All Languages (and so it is true Unicode).

The number of bits may be either the fixed bit width in the case of fixed length sets or the maximum sequence length in the case of variable length encoding. For example, WE8ISO8859P1 is 8-bit, US7ASCII is 7-bit, JA16SJIS is 16-bit and AL32UTF8 is a maximum of 32 bits per character.

The standard character set name does not define what "standard" the name comes from, but it intended to be descriptive. For example, WE8ISO8859P1 is ISO8859 Part 1, AL16UTF16 is UTF16 and JA16SJIS is Japanese 16-bit Shifted Japanese Industrial Standard (SJIS).

The optional S or C specifies whether the character set may be only used on the Client or Server. We will consider client character sets later in this document.

What Character Sets does my Database Use?

One way to find the character sets in use in a particular database is to query the schema with the following:

select * from NLS_DATABASE_PARAMETERS
where PARAMETER in ( 'NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET' )

This will produce a result something like this:

+-------------------------------+---------------------------------------+
| PARAMETER                     | VALUE                                 |
+-------------------------------+---------------------------------------+
| NLS_CHARACTERSET              | WE8ISO8859P1                          |
| NLS_NCHAR_CHARACTERSET        | AL16UTF16                             |
+-------------------------------+---------------------------------------+

Oracle® Client Character Sets

In addition to the database character sets, the Oracle® client also has a number of National Language Support (NLS) settings, including character set. There are three ways that the client NLS settings can be specified. These are:

In the context of Unicode, the important variable is the NLS_LANG. The format of this variable is:

NLS_LANG = language_territory.charset

The components have the following meaning:

Examples of NLS_LANG settings are:

NLS_LANG = AMERICAN_AMERICA.WE8MSWIN1252

NLS_LANG = FRENCH_CANADA.WE8ISO8859P1

NLS_LANG = JAPANESE_JAPAN.JA16EUC

All parts of the definition are optional; any item that is not specified uses its default value. If you specify territory or character set, then you must include the preceding delimiter [underscore (_) for territory, period (.) for character set]. Otherwise, the value is parsed as a language name.

For example, to set only the territory portion of NLS_LANG, use the following format: NLS_LANG=_JAPAN.

Within the scope of this document, the relevant part of the above definition is the charset.

Character Set Conversions

As now can be seen, it is possible to have different character sets in use at the same time in the server and client parts of a session. And this means that conversions may need to take place between the client and server (and vice-versa) as data is transferred. Because it is possible to have encodings that while valid in one character set are invalid in another, it is possible for data loss to occur in the conversion process. For example, if the client character set was JA16EUC, but the database character set was US7ASCII, many of the valid characters contained in the client character set would have no meaningful representation in the database character set, so attempting to insert such characters would cause data corruption. By default, Oracle® will not warn about the loss of data during a conversion. However, by setting the NLS_NCHAR_CONV_EXCP property to TRUE, either in the initialisation parameters for the database or via ALTER SESSION, the database will report an error if data loss occurs.

When converting from character set A to character set B, characters that are not available in character set B are converted to replacement characters. Replacement characters are often specified as ? or ¿ or as a character that is related to the unavailable character. For example, ä (a with an umlaut) can be replaced by a. Replacement characters are defined by the target character set.

Oracle®, ODBC and Unicode

Now we have all the information needed, we can put it all together and look how Oracle®, ODBC and Unicode works in practice. We have seen that any attempt to display data that may contain Unicode encoding is likely to be confused by the local operating system attempting to convert the character encoding to a local glyph. So this means that convenient text-based utilities like isql provided with unixODBC can not be used with any accuracy to determine Unicode issues. Because of this, we will use sample C code to show how it all works.

To show the operation of Unicode, we will use a database that uses UTF8 (AL32UTF8) as its CHAR character set, and AL16UTF16 as its NCHAR character set.

We create a test table as follows:

CREATE TABLE UCTEST (
         I INTEGER,
        C1 CHARACTER VARYING( 100 ),
        C2 NATIONAL CHARACTER VARYING( 100 )
);

So we have both a CHAR and NCHAR field to insert into.

First, let’s set the client character set to be UTF8 as we know that this is what will be needed to insert Unicode.

NLS_LANG=AMERICAN_AMERICA.UTF8
export NLS_LANG

Then we will use sample code to insert the 7-bit characters we used before into the table, both using 8-bit SQL_C_CHAR and 16-bit SQL_C_WCHAR types:

#include <stdio.h>
#include <sql.h>
#include <sqlucode.h>
/*
* insert a number of character values into a database
*/
main()
{
   SQLHANDLE henv, hdbc, hstmt;
   SQLRETURN ret;
   SQLCHAR cval[] = { '>', 0x55, 0x6E, 0x69, 0x63,
                      0x6F, 0x64, 0x65, '<' };
   SQLWCHAR uval[] = { '>', 0x0055, 0x006E, 0x0069, 0x0063,
                       0x006F, 0x0064, 0x0065, '<' };
   SQLLEN len1, len2, len3;
   SQLINTEGER ival;

   ret = SQLAllocHandle( SQL_HANDLE_ENV, SQL_NULL_HANDLE, &henv );
   ret = SQLSetEnvAttr( henv, SQL_ATTR_ODBC_VERSION, (SQLPOINTER) 3, 0 );

   ret = SQLAllocHandle( SQL_HANDLE_DBC, henv, &hdbc );
   ret = SQLConnect( hdbc, "ORACLE", SQL_NTS, "system",
                     SQL_NTS, "master", SQL_NTS );

   ret = SQLAllocHandle( SQL_HANDLE_STMT, hdbc, &hstmt );

   ret = SQLPrepare( hstmt,
                     "insert into UCTEST values ( ?, ?, ? )",
                     SQL_NTS );

   ret = SQLBindParameter( hstmt, 1, SQL_PARAM_INPUT, SQL_C_LONG,
                           SQL_INTEGER, 0, 0, &ival, 0, &len1 );
   ret = SQLBindParameter( hstmt, 2, SQL_PARAM_INPUT, SQL_C_CHAR,
                           SQL_CHAR, 100, 0, &cval, sizeof( cval ),
                           &len2 );
   ret = SQLBindParameter( hstmt, 3, SQL_PARAM_INPUT, SQL_C_WCHAR,
                           SQL_WCHAR, 100, 0, &uval, sizeof( uval ),
                           &len3 );

   ival = 1;
   len1 = 4;
   len2 = sizeof( cval );
   len3 = sizeof( uval );

   ret = SQLExecute( hstmt );

   ret = SQLFreeHandle( SQL_HANDLE_STMT, hstmt );
   ret = SQLDisconnect( hdbc );
   ret = SQLFreeHandle( SQL_HANDLE_DBC, hdbc );
   ret = SQLFreeHandle( SQL_HANDLE_ENV, henv );
}

The code inserts low (<127) value characters, both as 8- and 16-bit encodings.

After we have run the test code, we can use isql to see what has been inserted. (isql is safe to use at this point as we are only using 7-bit codes. The casts are just used to reduce the width of the displays.)

SELECT CAST( I AS VARCHAR( 5 )) AS I, CAST( C1 AS VARCHAR( 20 )) AS C1, 
       CAST( C2 AS VARCHAR( 20 )) AS C2 FROM UCTEST
+------+---------------------+---------------------+
| I    | C1                  | C2                  |
+------+---------------------+---------------------+
| 1    | >Unicode<           | >Unicode<           |
+------+---------------------+---------------------+

Now let’s insert the extended characters we used before, both as UTF-8 and UCS-2.

The code to do this is the same as before, but the definition of the two arrays of character encodings now contains codes above 127.

SQLCHAR cval[] = { '>', 0xC5, 0xB0, 0xC5, 0x84, 0xC4, 0xA9, 0xC4,
                    0x87, 0xC5, 0x8D, 0xC4, 0x91, 0xC4, 0x95, '<' };
SQLWCHAR uval[] = { '>', 0x0170, 0x0144, 0x0129,
                    0x0107, 0x014D, 0x0111, 0x0115, '<' };

After we have run this code, when we try to look at the table by using isql, we may get strange characters displayed. This is because we are expecting the operating system to map extended characters to glyphs, and it may not be able to do this.

So we now write another test sample to read the data from the table:

#include <stdio.h>
#include <sql.h>
#include <sqlucode.h>
/*
* read a number of character values from a database
*/

main()
{
   SQLHANDLE henv, hdbc, hstmt;
   SQLRETURN ret;
   SQLCHAR cval[ 200 ];
   SQLWCHAR uval[ 200 ];
   SQLINTEGER ival;
   SQLLEN len;
   int i;

   ret = SQLAllocHandle( SQL_HANDLE_ENV, SQL_NULL_HANDLE, &henv );
   ret = SQLSetEnvAttr( henv, SQL_ATTR_ODBC_VERSION, (SQLPOINTER) 3, 0 );

   ret = SQLAllocHandle( SQL_HANDLE_DBC, henv, &hdbc );
   ret = SQLConnect( hdbc, "ORACLE", SQL_NTS, "system",
                     SQL_NTS, "master", SQL_NTS );

   ret = SQLAllocHandle( SQL_HANDLE_STMT, hdbc, &hstmt );

   ret = SQLExecDirect( hstmt,
                        "select I, C1, C1, C2, C2 from UCTEST",
                        SQL_NTS );

   while(( ret = SQLFetch( hstmt )) != SQL_NO_DATA )
     {
            printf( "reading row: " );
            ret = SQLGetData( hstmt, 1, SQL_C_LONG, &ival, 4, &len );
            printf( "%d, ", ival );

            ret = SQLGetData( hstmt, 2, SQL_C_CHAR, &cval, 200, &len );
            printf( "\n\tcolumn length %d { ", len );
            for ( i = 0; i < len; i ++ ) {
                    printf( "%02X ", cval[ i ] );
            }
            printf( "}, ", len );

            ret = SQLGetData( hstmt, 3, SQL_C_WCHAR, &uval, 200, &len );
            printf( "\n\tcolumn length %d { ", len );
            for ( i = 0; i < len / sizeof( SQLWCHAR ); i ++ ) {
                    printf( "%04X ", uval[ i ] );
            }
            printf( "}, ", len );

            ret = SQLGetData( hstmt, 4, SQL_C_CHAR, &cval, 200, &len );
            printf( "\n\tcolumn length %d { ", len );
            for ( i = 0; i < len; i ++ ) {
                    printf( "%02X ", cval[ i ] );
            }
            printf( "}, ", len );

            ret = SQLGetData( hstmt, 5, SQL_C_WCHAR, &uval, 200, &len );
            printf( "\n\tcolumn length %d { ", len );
            for ( i = 0; i < len / sizeof( SQLWCHAR ); i ++ ) {
                    printf( "%04X ", uval[ i ] );
            }
            printf( "}, ", len );

            printf( "\n" );
    }
    ret = SQLFreeHandle( SQL_HANDLE_STMT, hstmt );
    ret = SQLDisconnect( hdbc );
    ret = SQLFreeHandle( SQL_HANDLE_DBC, hdbc );
    ret = SQLFreeHandle( SQL_HANDLE_ENV, henv );
}

This code reads the data from the tables, both as 8- and 16-bit characters.

Running the code allows us to see what has been inserted into the database:

reading row: 1, 
   column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, 
   column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, 
   column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, 
   column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, 
reading row: 2, 
   column length 16 { 3E C5 B0 C5 84 C4 A9 C4 87 C5 8D C4 91 C4 95 3C }, 
   column length 18 { 003E 0170 0144 0129 0107 014D 0111 0115 003C }, 
   column length 16 { 3E C5 B0 C5 84 C4 A9 C4 87 C5 8D C4 91 C4 95 3C }, 
   column length 18 { 003E 0170 0144 0129 0107 014D 0111 0115 003C },

Here we can see that the data has been inserted as both UTF-8 and UCS-2, and is returned in the same way. It can also be seen that the ODBC-Oracle Driver and Oracle® client will convert from one form to the other.

The code samples work because the client knows what character set to use. If we set the client character set to a non-Unicode set, for example:

NLS_LANG = AMERICAN_AMERICA.US7ASCII

Then run the test code again, we get a very different result:

reading row: 1, 
   column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, 
   column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, 
   column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, 
   column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, 
reading row: 2, 
   column length 9 { 3E 55 6E 3F 63 3F 64 3F 3C }, 
   column length 18 { 003E 0055 006E 003F 0063 003F 0064 003F 003C }, 
   column length 9 { 3E 55 6E 3F 63 3F 64 3F 3C }, 
   column length 18 { 003E 0055 006E 003F 0063 003F 0064 003F 003C },

We can see that all the characters less than 127 (row 1) are still fine, but all the extended characters (row 2) are now mapped to either the nearest the system can find, or the replacement character ‘?’ (0x3F). Again, we can see this in isql. It is safe to use isql with NLS_LANG set to use a 7-bit character set, we know we will only get 7-bit characters, but they may not be the ones in the database.

SELECT CAST( I AS VARCHAR( 5 )) AS I, CAST( C1 AS VARCHAR( 20 )) AS C1, 
       CAST( C2 AS VARCHAR( 20 )) AS C2 FROM UCTEST
+------+---------------------+---------------------+
| I    | C1                  | C2                  |
+------+---------------------+---------------------+
| 1    | >Unicode<           | >Unicode<           |
| 2    | >Un?c?d?<           | >Un?c?d?<           |
+------+---------------------+---------------------+

Summary

Unicode and Oracle® can often be a subject of confusion. However, as this document has shown, once the basic principles of Unicode and the processes involved are understood, the application of Unicode, Oracle® and ODBC can be simple and problem free.

Key points:

Article Feedback

* Did this content help you?
* Please select one option based on your first choice:

(* Required Fields)


Oracle is a registered trademark of Oracle Corporation and/or its affiliates.