Easysoft ODBC-SQL Server Driver

How do I insert Unicode supplementary characters into SQL Server from Perl?

Article:
01043
Last Reviewed:
22nd January 2013
Revision:
1

Unicode is an international character set standard that is capable of representing characters in all written languages. The Unicode standard specifies a numeric value (code point) and a name for each of its characters. The most frequently used characters have code point values that will fit into a 16-bit word in memory and on disk. (A word is the native unit of storage on a particular machine.) Characters whose code point values are larger than 0xFFFF require two consecutive 16-bit words. These characters are called supplementary characters, and the two consecutive 16-bit words are called surrogate pairs.

There are a number of Unicode encoding forms. For example, UTF-8, UTF-16, UTF-32 and UCS-2. An encoding form defines how a Unicode code point is stored as a sequence of bytes.

UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports supplementary characters.

ODBC supports Unicode in the form of Unicode data types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2.

The Perl DBD::ODBC module, when built with Unicode support (the -u switch specified when configuring DBD::ODBC, i.e. perl Makefile.PL -u) is an ODBC application that uses the Unicode ODBC APIs and data types and passes UTF-16 encoded data to these APIs. As mentioned, for characters whose code point values are smaller than 0xFFFF (the Basic Multilingual Plane), UTF-16 is identical to UCS-2.

Because Perl DBD::ODBC uses UTF-16, it is possible to use the module to insert supplementary characters. However, this does not mean that the target database will treat them as a single Unicode character.

SQL Server 2008 and earlier does not treat supplementary characters as a single Unicode character. For example, its length function returns 2 for a supplementary character. SQL Server 2012 introduced a set of supplementary character collations (which have the suffix "_SC") that support supplementary characters.

To illustrate this, the following Perl script uses the SQL Server ODBC driver to insert a supplementary character into a SQL Server database. Initially, the collation for the target column is a non _SC collation. If SQL Server instance is 2012 or later, the collation is changed to an _SC one. The length of the supplementary character is reported before and after changing the collation:

$ perl supplementary_character.pl
2
1
#!/usr/bin/perl -w
use strict;
use DBI;

# Replace datasource_name with the name of your data source.
# Replace database_username and database_password
# with the SQL Server database username and password.
my $data_source = q/dbi:ODBC:datasource_name/;
my $user = q/database_username/;
my $password = q/database_password/;

my $dbh = DBI->connect($data_source, $user, $password)
    or die "Can't connect to $data_source: $DBI::errstr";

# It is possible to set the collation at instance or database level.
# Set it on the column to make sure that, initially, we are using a
# non supplementary character collation.
$dbh->do(q/create table unicode (a nchar(20) collate Latin1_General_100_CI_AI)/);

# Insert test supplementary character taken from:
# http://www.i18nguy.com/unicode/supplementary-test.html
my $sth = $dbh->prepare(q/insert into unicode values(?)/);
$sth->execute("\x{2070E}");

# This is a non _SC collation, so the length function returns "2".
$sth = $dbh->prepare(q/select len(a) from unicode/);
$sth->execute();

print $sth->fetchrow_array . "\n";

# _SC collations were introduced in SQL Server 2012.
$sth = $dbh->prepare(q/SELECT SERVERPROPERTY('ProductVersion')/);

$sth->execute;

my ($version) = $sth->fetchrow_array;
my @major_version = split (/\./,$version);

# If we are connecting to a SQL Server 2012 instance or later, change the
# column collation to an _SC one. The column length is now "1".
if ($major_version[0] >= 11) {
   $dbh->do(q/alter table unicode alter column a nchar(20) collate Latin1_General_100_CI_AS_SC/);
   $sth = $dbh->prepare(q/select len(a) from unicode/);
   $sth->execute();
   print $sth->fetchrow_array . "\n";

}

$sth->finish;

See Also
Applies To

Knowledge Base Feedback

* Did this content help you?
* Please select one option based on your first choice:

(* Required Fields)