Some Common Unicode Problems and Solutions using Perl DBD::ODBC and MS SQL Server

Introduction

In this article, I have tried to collect some useful information to help you use Unicode in MS SQL Server from Perl DBD::ODBC. Many of the problems listed came from questions on the dbi-user list, perlmonks or emails direct to me.

I have tried very hard to make all the examples work on Windows and Unix but the facts are:

In writing this, I discovered a bug in DBD::ODBC (when inserting into char/varchar columns) which affects all versions from when Unicode was introduced up until 1.46_1 when it was fixed. I've tried to highlight the issue in the following examples.

Terminology

In this document, I repeatedly use some terminology that needs further explanation.

Some DBD::ODBC Background/History

DBD::ODBC has not always supported Unicode. It all started with a patch from Alexander Foken. Around version 1.14, Alexander's original patch was adapted to include optional Unicode support for Unix.

For DBD::ODBC, building without Unicode support really means build as DBD::ODBC worked before Unicode support was added, to maintain backwards compatibility.

The initial Unicode support was for Windows only and allowed you to send/retrieve nchar/nvarchar columns as SQL_WCHARs meaning DBD::ODBC:

Since then Unicode support has grown to include:

For full documentation on Unicode in DBD::ODBC see the Unicode section in the DBD::ODBC pod.

Using non-Unicode Aware Components

The ODBC API has two main sets of APIs; the ANSI API (SQLxxxA APIs) and the Unicode API (SQLxxxW APIs). By default, DBD::ODBC uses the Unicode API on Windows and the ANSI API on non-Windows platforms. There are good historical reasons for this beyond the scope of this article. If you want to read/write/update/select Unicode data with DBD::ODBC and MS SQL Server from non-Windows platforms you need:

How Can You Tell What Support for the Unicode ODBC API You Have Already Got?

What Happens if I Try To Use Unicode from Perl DBD::ODBC and a Component in the Chain Does Not Support the Unicode APIs?

The simple answer is you will not be able to insert/update/delete/select Unicode data from MS SQL Server, but to be honest this is too simplistic and it is worth looking at some examples.

Example 1: Simple Insert/Select with non-Unicode Built DBD::ODBC

# ex1.pl
use 5.008001;
use strict;
use warnings;
use DBI qw{:utils};

my $unicode = "\x{20ac}";       # Unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

eval {$h->do(q/drop table unicode_test/)};
$h->do(q/create table unicode_test (a nvarchar(20))/);

my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
$s->execute($unicode);

my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
my $data = $r->[0];
print "DBI describes data as: ", data_string_desc($data), "\n";
print "Data Length: ", length($data), "\n";
print "hex ords: ";
foreach my $c(split(//, $data)) {
    print sprintf("%x,", ord($c));
}
print "\n";

which outputs:

DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
Data Length: 3
hex ords: e2,82,ac,

and, as you can see, we attempted to insert a Unicode Euro symbol and when we selected it back, we got 3 characters and 3 bytes instead of 1 character and 3 bytes, and it is confirmed by the fact that the Perl data contains the UTF-8 encoding for a Euro.

An explanation of what happened above:

  1. The column was created as an nvarchar, so MS SQL Server should be happy to accept Unicode characters for the column data.
  2. DBD::ODBC prepared the SQL and asked the ODBC driver to describe the parameter. DBD::ODBC was told it was a UNICODE VARCHAR of length 20 characters. However, it bound the parameter as a value type of SQL_C_CHAR and a parameter type of SQL_C_WCHAR, so the driver interpreted each byte as a character.
  3. When we read the data back, we got the bytes back as Perl had encoded the Euro internally (UTF-8).

You might be asking yourself at this point why DBD::ODBC bound the data as a value type of SQL_C_CHAR and the answer is backwards compatibility i.e., that is what it did for a long time before support for the Unicode API was added.

So what if we force DBD::ODBC to bind the data as SQL_WCHAR?

Example 2 Insert/Select with non-Unicode Built DBD::ODBC Forcing SQL_WCHAR

The code for this is nearly identical to the above except we add a bind_param call and import :sql_types from DBI.

use 5.008001;
use strict;
use warnings;
use DBI qw{:utils :sql_types};

my $unicode = "\x{20ac}";       # Unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

eval {$h->do(q/drop table unicode_test/)};
$h->do(q/create table unicode_test (a nvarchar(20))/);

my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
$s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
$s->execute($unicode);

my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
my $data = $r->[0];
print "DBI describes data as: ", data_string_desc($data), "\n";
print "Data Length: ", length($data), "\n";
print "hex ords: ";
foreach my $c(split(//, $data)) {
    print sprintf("%x,", ord($c));
}
print "\n";

and the output is:

DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
Data Length: 3
hex ords: e2,82,ac,

Exactly the same as before. Why? The TYPE argument passed to bind_param sets the SQL Type (the parameter type) and not the value type in a SQLBindParameter call.

Reading Properly Written Unicode in non-Unicode Built DBD::ODBC

Now what if the Unicode Euro was inserted correctly by something else and we want to read it using a non-Unicode built DBD::ODBC?

Example 3: Unicode from non-Unicode Built DBD::ODBC

We have got a valid unicode Euro symbol in the database in an nvarchar column (do not worry about how for now, this is just showing what happens when the data in the database is correct, but you use the wrong method to get it).

use 5.008001;
use strict;
use warnings;
use DBI qw{:utils};

my $unicode = "\x{20ac}";       # Unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
my $data = $r->[0];
print "DBI describes data as: ", data_string_desc($data), "\n";
print "Data Length: ", length($data), "\n";
print "hex ords: ";
foreach my $c(split(//, $data)) {
    print sprintf("%x,", ord($c));
}
print "\n";

which outputs:

DBI describes data as: UTF8 off, non-ASCII, 1 characters 1 bytes
Data Length: 1
hex ords: 80

To be honest, what you get back in data here very much depends on the ODBC driver and platform. On Windows, you would probably get the above because 0x80 is the Windows-1252 character for a Euro. (If it had been something not in Windows-1252, it would probably have returned a question mark.) With some Unix MS SQL Server ODBC drivers you could get any of the following (and perhaps more):

The point of the illustration is that you do not really want to do any of the above unless you have absolutely no choice.

You might be saying to yourself, yes but you can set a type in the bind_col method so you can control how the data is returned to you. Mostly that is not true for just about all Perl DBDs I know, and with DBD::ODBC although you can override the default type in a bind_col call, you can only do it for decimals and timestamps.

Using Varchar Columns Instead of Nvarchar Columns for Unicode Data

If you are using DBD::ODBC before 1.46_1, do not do this. There is a bug in DBD::ODBC before 1.46_1 that means it does not look at the Perl scalars you are binding for input and it always binds them using the type the driver describes the column as (which will always be SQL_CHAR for a varchar column).

Generally speaking, you should use nchar/nvarchar columns when you need to support multiple languages in the same column, although even that is not always necessary e.g., you can support English, German and Italian in one Windows codepage. A better recommendation would be to use n columns for user provided data that is unconstrained and varchar columns for data that is constrained and you control e.g. a number plate, serial number etc.

However, in the spirit of describing why, let us look at some examples. These examples assume we are now using a Unicode built DBD::ODBC (see above) and you have a Unicode-aware ODBC driver.

So we return to our first simple example, but now run it with a Unicode built DBD::ODBC, use a varchar column and try 2 different bind types (the default and an overriden one):

Example 4 Simple Insert/Select with Unicode Built DBD::ODBC Using Varchar

use 5.008001;
use strict;
use warnings;
use DBI qw{:utils :sql_types};

my $unicode = "\x{20ac}";       # unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

eval {$h->do(q/drop table unicode_test/)};
$h->do(q/create table unicode_test (a varchar(20))/);

my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
$s->execute($unicode);
$s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
$s->execute($unicode);

my $r = $h->selectall_arrayref(q/select a from unicode_test/);
foreach my $r (@$r) {
    my $data = $r->[0];
    print "DBI describes data as: ", data_string_desc($data), "\n";
    print "Data Length: ", length($data), "\n";
    print "hex ords: ";
    foreach my $c(split(//, $data)) {
        print sprintf("%x,", ord($c));
    }
    print "\n";
}

which outputs:

DBI describes data as: UTF8 on, non-ASCII, 3 characters 6 bytes
Data Length: 3
hex ords: e2,201a,ac,
DBI describes data as: UTF8 on, non-ASCII, 1 characters 3 bytes
Data Length: 1
hex ords: 20ac,

Here again, you will get different results depending on platform and driver.

I imagine this is really going to make you wonder what on earth has happened here. Bear in mind, in Perl, the Euro is internally encoded in UTF-8 as 0xe2,0x82,0xac.

In the first insert, DBD::ODBC does what it always did and asked the database what the column type was, the database returned SQL_CHAR and the Euro was bound as a SQL_CHAR (the bug). In the second case, we overrode DBD::ODBC and told it to bind the data as SQL_WVARCHAR.

When we retrieved the data, DBD::ODBC bound the column as SQL_WCHAR (which it always does in an unicode build).

As far as MS SQL Server is concerned, this is a varchar column, you wanted to insert 3 characters of codes 0xe2, 0x82 and 0xac and it is confirmed that this is what is in the database when we read them back as binary data. However, where did character with code 0x201a come from? When DBD::ODBC read the data back, it bound the column as SQL_C_WCHAR and hence asked SQL Server to convert the characters in the varchar column to wide (UCS-2 or UTF-16) characters and guess what, character 82 in the Windows-1252 character set (which I was using when running this code) is "curved quotes" with Unicode value 0x201A. 0xe2 and 0xac in Windows-1252 are the same character code in Unicode.

In the second row, we bound the data as SQL_WCHAR for insert (and the database maps those Unicode characters to the codepage for the table) and we bound the data as SQL_WCHAR for select. The driver maps the codepage characters retrieved back to Unicode, and we get back what we inserted (so long as all the characters we inserted fit in the current codepage). However, had we tried to insert a character not in the Windows-1252 codepage, SQL Server would substitute that character with a '?'. We should not have had to override the bind type here and that was the bug in DBD::ODBC pre 1.46_1.

This last point is really important. On Windows, the driver knows your codepage and can do this, but on Unix it is not uncommon for you to have to tell the driver what codepage you want to use for the conversion. E.g., with the Easysoft SQL Server ODBC Driver you could set "Client_cset = WINDOWS=1252" or you could add "use_lcid=Yes" in your connection and the driver will work out the correct codepage for you.

Here is a Windows specific version of the above test with a few more enhancements:

#
# A simple demonstration of why you cannot use char and varchar columns
# in MS SQL Server to store Unicode. char and varchar columns use a codepage
# and Unicode characters inserted into them are converted to the codepage.
# If a conversion does not exist in the codepage the characters which don't
# convert will become '?'
#
# Show the diference between binding as SQL_CHAR and SQL_WCHAR.
#
# See http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2
#
use 5.008.001;
use strict;
use warnings;
use DBI qw(:sql_types);
use Data::Dumper;
use Win32::API;

# chcp on my machines normally gives 850
# http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true
#
# we want windows-1252 so run: chcp 1252 first

#use open qw( :encoding(Windows-1252) :std );
binmode STDOUT, ":encoding(cp1252)";

# get active codepage and ensure it is cp1252
# http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using
Win32::API::More->Import("kernel32", "UINT GetConsoleOutputCP()");
my $cp = GetConsoleOutputCP();
print "Current active console code page: $cp\n";
if ($cp != 1252) {
    print "Please change to codepage 1252 - run chcp 1252\n";
    die "Incompatible active codepage - please change to codepage 1252 by running chcp 1252\n";
}

my $h = DBI->connect() or die $DBI::errstr;
$h->{RaiseError} = 1;
$h->{PrintError} = 1;

eval {$h->do(q/drop table varchar_test/)};
$h->do(q/create table varchar_test(a varchar(20) collate Latin1_General_CI_AS)/);

# note codepage 1252 is 255 chrs including the euro at 0x80
# windows-1252 does not contain U+0187 but it does contain
# the euro symbol (U+20ac), the curved quotes (U+201A),
# Latin Small Letter F with hook (U+192), dagger (U+2020)
# mixing code pages in SQL Server is not recommended
my $insert = $h->prepare(q/insert into varchar_test (a) values(?)/);
my $data = "\x{20ac}\x{201A}\x{192}\x{2020}\x{187}" ;
{
    use bytes;
    print "encoded length of our data is:", length($data), "\n";
    print "encoded data in hex is:";
    foreach my $b(split(//, $data)) {
        print sprintf("%x,", ord($b));
    }
    print "\n";
}
# this execute will discover the column is varchar and bind the perl scalar
# as SQL_CHAR meaning the UTF-8 encoded data in the perl scalar
# will be inserted as separate characters not all of which will even
# be translateable to the current codepage.
$insert->execute($data);
# Now we force DBD::ODBC to insert the parameter as SQL_WVARCHAR
$insert->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
$insert->execute($data);

print "\nNotice in the first row (which was inserted as SQL_CHAR), the UTF-8 stored\n"
print "\nin the perl scalar is mostly stored as individual characters but then you will be\n"
print "\nwondering why few of the characters seem to come back as Unicode. Windows sees individual\n"
print "\ncharacters in the UTF-8 sequence as characters in the windows-1252 codepage and the UTF-8 sequence\n"
print "\ncontains some characters in windows-1252 which map back to Unicode chrs. e.g., the UTF-8 sequence for the\n"
print "\neuro is e2, 82, ac and windows see the 82 as the curved quotes in windows-1252 but when you ask for it back\n"
print "\nas wide/Unicode characters it maps it to U+201a (curved quotes Unicode point)\n";
print "\nNotice how in the second row the last character is a ?. That is because U+0187 does not exist in windows-1252\n"
print "\ncodepage our column is using\n";
my $r = $h->selectall_arrayref(q/select a from varchar_test/);
print Dumper($r);
foreach my $row (@$r) {
    print $row->[0], "\n";
}
$h->disconnect;

The Correct Way to do Unicode with DBD::ODBC and SQL Server and Why

When retrieving rows on Windows or Unix (and using DBD::ODBC built for Unicode support), your char, varchar, nchar and nvarchar columns should all be correct. Even when you use char and varchar that use a codepage, because DBD::ODBC asks for the data as SQL_WCHAR, SQL Server will convert any character in the codepage to a Unicode codepoint and DBD::ODBC will encode them as UTF-8 and mark them Unicode to Perl.

When inserting Unicode, DBD::ODBC will normally just do the right thing i.e., use SQL_WCHAR for nchar/nvarchar columns, but if your column is a char/varchar then prior to 1.46_1 it may do the wrong thing by default. Until 1.46_1, DBD::ODBC ignored your Perl data and bound it as the type the driver reported for the parameter and in 1.46_1 and beyond, DBD::ODBC looks at your scalar for the parameter first to see it has the utf8 flag on it.

Surrogate pairs (or Unicode Code Points Above U+FFFF)

ODBC supports Unicode in the form of SQL_WCHAR types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2 (or at least that is how it looked when Windows and MS SQL Server were using UCS-2).

In anticipation of things changing, when Unicode support was added to DBD::ODBC, the internal code actually converts Perl UTF-8 encoded strings into UTF-16, which for characters in the Basic Multilingual Plane is identical to UCS-2. As there were no Unicode drivers supporting supplementary characters (above U+0FFFF) this could not be proven as good decision. However, at worst, it meant code using Unicode outside the Basic Multilingual Plane would just insert what looked to the database as more characters.

Older Versions of MS SQL Server and Surrogate Pairs

As it turned out, the decision in DBD::ODBC to use UTF-16 was exactly what MS initially did and versions of MS SQL Server like 2000, 2005 and 2008 are described as surrogate neutral rather than surrogate aware. MS had this to say at the time:

Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs, even though UCS-2 is not aware of surrogates. SQL Server treats the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate-neutral or surrogate-safe, meaning that there is no intrinsic ability to interact with the data, but at least the data can be stored without loss.

However, there are a few things you should be aware of when using these older MS SQL Server versions that are only surrogate-neutral:

Newer Versions of MS SQL Server and Surrogate Pairs

Newer versions of SQL Server (2012 and later, version >= 11) support surrogate pairs, but you must set the collation to a one ending in "_SC" e.g., Latin1_General_100_CI_AS_SC. When you do this string functions will recognise surrogate pairs and all of the problems listed above for older SQL Servers are fixed.

Is my SQL Server Surrogate-Neutral or Surrogate-Aware?

Here is a small script you can use to test whether your SQL Server is surrogate-neutral or surrogate-aware:

# Test to see if your SQL Server is surrogate-aware or just surrogate-neutral
use 5.008.001;
use strict;
use DBI qw(:utils);

my $h = DBI->connect() or die $DBI::errstr;
$h->{PrintError} = 0;

eval {$h->do(q/drop table surrogate_pairs/)};

# It is possible to set the collation at instance or database level.
# Set it on the column to make sure that, initially, we are using a
# non supplementary character collation.
$h->do(q/create table surrogate_pairs (a nvarchar(20) collate Latin1_General_100_CI_AI)/);

my $insert = $h->prepare(q/insert into surrogate_pairs values(?)/);

# Insert test supplementary character
print "Inserting Unicode character U+2070E into db\n";
$insert->execute("\x{2070E}");

# now read it back and see what we get
print "\nNote when we select this character back it is still 1 Unicode character and 4 bytes and the ord is correct at\n"
print "\n0x2070E. This is because DBD::ODBC received a buffer of SQL_WCHAR chrs back from SQL Server which it then\n"
print "\ndecoded as UTF-16 which recognises the surrogate pair. This is why SQL Server using this collation (or older\n"
print "\nSQL Servers)\n"
print "\nare known as surrogate-neutral.\n";
my $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
print data_string_desc($r->[0]), "\n";
print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";

# This is a non _SC collation, so the length function returns "2".
print "\nNote in the following that len(a) returns 2 not 1 as SQL Server has not recognised this as a surrogate pair.\n";
$r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
print "length in database is: ", $r->[0], "\n";

# now try to alter the table to change the collation to Latin1_General_100_CI_AS_SC
# which only later SQL Servers (>= version 11, i.e., 2012) can do.
# Unfortunately older SQL Servers don't error if you try to change the collation
# to one it does not support so we cannot test by just trying to change to a
# surrogate aware collation.
$h->do(q/alter table surrogate_pairs alter column a nchar(20) collate Latin1_General_100_CI_AS_SC/);

$r = $h->selectrow_arrayref(q/SELECT SERVERPROPERTY('ProductVersion')/);
my $version = $r->[0];
print "\nYou SQL Server is version: $version\n\n";

if ((split(/\./, $version))[0] >= 11) {
    print "Your SQL Server is surrogate-aware\n";
    $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
    print data_string_desc($r->[0]), "\n";
    print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";

    print "\nNote in the following that len(a) returns 1 as SQL Server in this collation recognises surrogate pairs\n";
    $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
    print "length in database is: ", $r->[0], "\n";
} else {
    print "You SQL Server is surrogate-neutral but not surrogate-aware\n";
}
$h->disconnect;

References

Resources

Article Feedback

* Did this content help you?
* Please select one option based on your first choice:

(* Required Fields)