CP951 — Big5HKSCS encoding on Windows

CP950, which is the implementation of Big5 character mapping inside traditional Chinese Windows, is well known among Chinese Windows users (actual mapping table available here). In contract, CP951, which is the counterpart for Big5HKSCS encoding (containing Hong Kong specific character in addition to Big5) is almost unknown to casual Windows users (even including Hong Kong users!), partly because it is rarely mentioned (no known documentation on internet, not even inside Microsoft web site), and another reason is that no effort is spent on advertising Microsoft HKSCS package (香港增補字符集).

Put it simply, CP951 is available after you install HKSCS support on top of traditional Chinese Windows, which is necessary for displaying Hong Kong characters under Windows. After installing and activating the package, the code page used will silently change from CP950 to CP951. This caused some compatibility problem with other Big5 extensions, like so-called Unicode 補完計劃 which also replaces CP950 code page.

CP951 code page defined across Windows 2000 - XP is the same: it is equivalent to HKSCS-2001 as stated in microsoft website, BUT without all compatibility point defined in this table published by Hong Kong government. Besides, it maps all non-BMP characters to unicode PUA (U+E000 - U+F8FF) instead of unicode SIP (U+20000 - U+2FFFF). In alternative words, it is HKSCS-2001 under ISO/IEC 10646-1:2000 standard (equivalent to unicode 3.0) without GCCS compatibility.

HKSCS-2004 support since Vista is a bit different: the support only exists in Unicode form. That is, no more Big5HKSCS encoding, as well as less compatibility problem.

But getting useful information needs some effort, as there is no tools to read .nls file directly. Here is the procedure to assemble the text information of CP951 mapping:

  1. Download the HKSCS support file and extract c_951.nls from zip file. (Also download here)
  2. Use the following Unix command to get useful information (thanks to Konstantin Kazarnovsky’s NLS-file structure page):
    od -v -w2 -A n -t x2 -j 1056 -N 64511 c_951.nls | awk 'BEGIN {a=0x8100} {printf "0x%X\t", a++; print "U+" toupper($1)}' | grep -v 'U+003F' > cp951.txt
    (download cp951.txt here)

The resulting text file contains Big5HKSCS → Unicode mapping of c_951.nls; with similar method the Unicode → Big5HKSCS mapping can be extracted as well. This document will be updated when necessary Unix command is found. As an interesting side note, there is a field in nls file header which says the code page number. Despite its file name (c_951.nls), the internal code page number in .nls file is 950, since it is supposed to be a drop-in replacement of CP950.

Leave a Reply

Powered by WP Hashcash