Mandalika's scratchpad: Sun Studio C/C++: Support for UTF-16 String Literals

Characters in general are stored as ASCII codes. ASCII uses one byte (8 bits) to store each character, and hence can store a maximum of 256 (2⁸) characters. Due to this limitation, ASCII is not capable of accomodating new characters to support other languages like Chinese, Japanese, Indic languages etc. The existing character set (ASCII) is adequate, as long as we restrict ourselves to english alphabet and most commonly used punctuation and technical symbols. But this is not the case, if we want to extend our applications to support quite a number of international languages.

Unicode

This problem can be alleviated by increasing the number of bits used to store each character. Unicode standard was evolved to specify the representation of text in modern software products and standards; and as a result data will be transported through many different systems without corruption.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language is. The Unicode standard defines three encoding forms that allow the same data to be transmitted in a byte (8 bits), word (16 bits) or double word (32 bits) oriented format. All these encoding forms can be efficiently transformed into one another without any loss of data. Of these three encoding forms, UTF-16 is extremely popular because most of the heavily used characters fit into one 16-bit code unit ie., 65,536 (2¹⁶) characters; and it occupies only 2 bytes in memory.

To represent Unicode characters, the char data type is not suitable; and hence we cannot use the string routines supplied with standard C library, on Unicode data. To represent UTF-16 characters, unsigned short data type can be used, since it occupies 2 byte storage. Also it avoids the need for introducing a new data type into compilers.

Sun's support for UTF-16 string literals

16-bit character string literals are not part of C/C++ standard yet. So, to address the needs of customers who develop/support internationalized applications, Sun introduced limited support for string literals of 16-bit (UTF-16 and UCS2) characters, as a language extension in Sun Studio Compiler Collection 8.

By default, the C/C++ compiler doesn't recognize the 16-bit character string literal. To make it recognize the 16-bit character string literals and to convert 'em to UTF-16 strings in the object file, we need to use the compiler switch -xustr=ascii_utf16_ushort. Since U"ASCII_String" syntax may become standard (in the future), Sun adopted this syntax to form 16-bit character string literals. Note that a non ASCII character in a character or string literal is an error.

eg., U"SomeString";

A literal character (U'c') has type const unsigned short, and a literal string (U"string") has type array of const unsigned short

Be aware that there is no library of supporting routines for such strings or characters. The users have to write their own string handling and I/O routines. One obvious reason for the lack of supporting library is being non-standard; and it is not easy to predict what will eventually be adopted by the standards committee(s). In the worst case, Sun may end up supporting a library that conflicts with a standard library.

Here's an example:

% cat unicode.c
const unsigned short *dummy = U"dummy";
const unsigned short unicodestr[] = U"UnicodeString";

const unsigned short *greet() {
        return U"Hello!";
};

This code has to be compiled with -xustr=ascii_utf16_ushort option, for the compiler to recognize and convert these string literals to UTF-16 strings in the object file. Note that the compiler option -xustr=ascii_utf16_ushort is the same for both C and C++ compilers

C

% cc -w -c unicode.c
"unicode.c", line 1: undefined symbol: U
"unicode.c", line 1: non-constant initializer: op "NAME"
"unicode.c", line 1: syntax error before or at: "dummy"
"unicode.c", line 2: non-constant initializer: op "NAME"
"unicode.c", line 2: syntax error before or at: "UnicodeString"
"unicode.c", line 5: syntax error before or at: "Hello!"
cc: acomp failed for unicode.c

% cc -w -c -xustr=ascii_utf16_ushort unicode.c

% ls -l unicode.o
-rw-rw-r--   1 gmandali ccuser      1376 Jul 29 18:19 unicode.o

C++

% CC -c unicode.c
"unicode.c", line 1: Error: U is not defined.
"unicode.c", line 1: Error: Badly formed expression.
"unicode.c", line 2: Error: U is not defined.
"unicode.c", line 2: Error: Badly formed expression.
"unicode.c", line 5: Error: U is not defined.
"unicode.c", line 5: Error: Badly formed expression.
6 Error(s) detected.

% CC -c -xustr=ascii_utf16_ushort unicode.c

% ls -l unicode.o
-rw-rw-r--   1 gmandali ccuser      1256 Jul 29 18:20 unicode.o

__________________
Technorati tags: Sun Studio | C | C++

Mandalika's scratchpad

Pages

Friday, July 29, 2005

Sun Studio C/C++: Support for UTF-16 String Literals