Unicode Explained

Reviewed by Major Keary

Unicode or Unicon? That is aquestion outside the scope of this review, but it is worth noting that two decades ago ISO found itself in a position of having multiple standards for the same character sets—a situation brought about by different "streams of interest": data processing and library automation. A decision was taken to develop a single code that would accommodate every known writing system; it would be a four-byte code, thus enabling each and every alphabet, syllabary, or character set to be represented in individual, stand-alone sections of a four-byte encoding system. Alas, DIS 10646 was shot down by—primarily American—vested interests and, in effect, we have been left with a two-byte system: Unicode. It has been criticised on a number of issues that are discussed in Unicode Explained.

A two-byte system provides for 65,536 code places, not all of which are available for allocation. That, said the Unicode lobby, is sufficient to cover all known writing systems. The trick, they claimed, was to unify the han (of Chinese origin) characters. In the event, Unicode now has over 100,000 characters, which it accommodates by embracing a hybrid four-byte arrangement.

Readers may also be interested to know that so-called ASCII was not an American initiative. In 1960 ISO decided to form a technical committee on computers and information processing, TC 97, and set up working groups. Various international organisations were asked to participate, including the European Computer Manufacturers Association (ECMA) and the American Standards Association (ASA), now known as ANSI. The American response was to set up the X3 Committee, made up of ten manufacturers' representatives, eleven general interest members, and ten user group members. Among the manufacturers was Minneapolis-Honeywell, and it is their records of X3's work and proceedings that illuminate the history of events. Those records are now part of the archives of the Charles Babbage Institute, University of Minnesota, Minneapolis.

ECMA was the first to come up with an agreed standard: "Technical Committee TC1 of ECMA met for the first time in December 1960 to prepare standard codes for Input/Output purposes. On 30th April 1965, Standard ECMA-6 was adopted by the General Assembly of ECMA." [Standard ECMA-6]

American proprietory interests were still bickering over things like inclusion of the reverse solidus. A first version of ASCII was agreed upon by X3 in 1963, but it did not include lower case letters (as specified by ISO and provided for in ECMA-6); inclusion of both upper and lower case letters did not occur in ASCII until 1967.

The definitive standard is ISO 646—a seven-bit encoding—first published in 1973 after agreement between all the players had been negotiated. As at 1988 there were thirty-seven national variants of ISO 646—some countries having more than one version (Yugoslavia had four).

Well, since then we have come to Unicode, and the problem is: how to apply it? The first thing that has to be grasped is that Unicode is not a gigantic 'font' or character set, and is not about glyphs (the particular shape or form of a given character/symbol/ideograph, etc.). It is a system of specifying code positions for characters; the final output depends on what occupies the codespace in a selected font.

The question, "how do I apply Unicode?", does not have a simple answer. The reason for that is best articulated in Unicode Explained, a text that is quite different from the Unicode Consortium's somewhat abstract official documentation that focuses on specifications. Unicode Explained deals with the issues of application. It also deals with problems that users may encounter, even to the extent to advice on "for example, in a contract on building some software you might wish to specify that the product will conform to the Unicode standard …".

The intended audience includes anyone who is an "… end user of multilingual or specialised text-related applications … [such as one who] work[s] with texts containing mathematical or special symbols, or uses a multilingual database"; anyone who may be faced with text conversion tasks; a developer of internationalised software; and teachers or students of computer science. It is not necessary to have a programming background to use the book, even though the final chapters are about programming for Unicode compatibility.

Don't expect a panacea. It's one thing to have a specification, but quite another to have it written into software. There has been a considerable time lag in providing compatibility in applications such as word processors, database software, and web authoring programs. Unicode compliance has not been on many lists of must-have features and gets few index entries in the Linux literature—nor does Linux appear in the index of Unicode Explained. However, Linux does have two Unicode commands: unicode_start and unicode_stop. Even though the book discusses Unicode in the context of MS Windows programs, what it has to say is relevant to anyone developing open source software. A useful supplementary resource for Linux-specific information is contained in Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux (http://cl.cam.ac.uk/~mgk25/unicode.html), which includes links to further information.

In essence, there is no easy road to Unicode nirvana. The 'standard' changes; some software vendors are either slow to respond, or they take deviant positions; and there has been no real pressure to embrace Unicode. However, the tide is flowing and developers should make themselves aware of Unicode, its current state of development, and how various programming tools can—or will—handle it.

Unicode Explained is a primary source of information that is essential to an understanding of the 'standard'. The book puts the specification into the context of real-world application, throwing its net of coverage wide to include discussions of the structure of Unicode, character properties, the various encodings, Internet issues, and Unicode characters in programming.

Not a book of solutions per se, but a resource essential to finding solutions. Recommended as a library acquisition.

Jukka Korpela: Unicode Explained
ISBN 0-596-10121-X
Published by O'Reilly, 658 pp., RRP AU$ 110.00

The Australian distributor of Unicode Explained is Woodslane (www.woodslane.com.au)


Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <b> <dd> <dl> <dt> <i> <img> <li> <ol> <u> <ul> <pre> <br> <blockquote> <hr> <code><sup><sup><p><em><strong> <h2> <cite> <code> <tt> <h1><table><tr><th><td>
  • Lines and paragraphs break automatically.

More information about formatting options

This question is for testing whether you are a human visitor and to prevent automated spam submissions.