README-unicode

Modula-3 support for full-range Unicode characters.  

The language has these changes:

ORD(LAST(WIDECHAR)) = 16_10FFFF, the entire code point range specified
by the Unicode standard.

Wide character and text literals can have Unicode escapes of the form
\Uhhhhhh or \uhhhhhh, where each h is a hexadecimal digit, in either 
upper or lower case.  It is a static error if the code point value 
exceeds the range of Unicode-sized WIDECHAR.  When the compiler is
configured for 16-bit WIDECHAR, values outside the 16-bit range are
converted to the Unicode "replacement" character, VAL(16_FFFD,WIDECHAR), 
with a warning.

The subtype and assignability rules are relaxed as if CHAR and
WIDECHAR were the same base type.  This allows assignments between
these and subranges thereof, as with subranges of a single base type.
Runtime range checks are performed when necessary, in the same way.

Implementation changes.

BITSIZE(WIDECHAR) = 32. 

Encoding, decoding, and streams.

Package libunicode contains new procedures for handling encoding of
various encodings of characters.  It will compile only by a compiler
configured for Unicode-range WIDECHAR.  However, its m3makefile does
not attempt to compile it unless Unicode_WIDECHAR (see below) is 
non-empty, so build scripts work even with a compiler configured for
16-bit WIDECHARs.  

There are 9 different encodings possible, including the 5 defined in
the Unicode standard, the two that older Modula-3 systems use, and two
transitional UCS encodings.  See UniEncoding.i3 for their definitions.

Interface UniCodec.i3 provides lower-level, single character encoding
and decoding procedures for the various encodings.

Interfaces UniWr.i3 and UniRd.i3 are for entire streams.  These act as
filters on a preexisting stream.  They are connected at open time to a
stream, then used as a stream substitute.  They are designed to be as
close as reasonable to Wr.i3 and Rd.i3.  Many, though not all, calls
on procedures in Wr and Rd can be straightforwardly replaced by
same-named procedures in the new interfaces.

The aforementioned interfaces do their own synchronization, and thus
provide atomic operations.  The Unsafe* interfaces provide equivalent
functions, but do not synchronize, expecting their callers to do it, or
ensure it is unnecessary.

Consistency of WIDECHAR size.

It would be chaotic if code compiled with different sizes of WIDECHAR
were to exchange values thereof.  The compiler prevents such code from
being linked together, without attempting to check whether WIDECHAR
values are actually exchanged.  Within a package, it automatically
recompiles the entire package, if it was previously compiled with a
different size WIDECHAR.  When doing so, it displays a message to this
effect.  Between different packages, this is not possible, so it just
detects the difference, displays a message, and stops.  Older
compilers will not complete successfully in either of these cases, but
the message will not be informative:

Configuring the size of WIDECHAR 

By default, the compiler will make WIDECHAR 16-bit.  To change this,
add the line Unicode_WIDECHAR="TRUE" (or any other non-empty text
value) to any Quake code that will be interpreted before compilation
starts.  The easiest place is in cm3.cfg, in the bin directory where
the compiler executable is installed, usually /usr/local/cm3/bin.
Assigning "" to Unicode_WIDECHAR or leaving it undefined will revert
to 16-bit WIDECHAR.

Because of its insistence that all linked-together code have the same
size WIDECHAR, when the compiler is reconfigured for a different size
WIDECHAR, it is necessary to recompile all libraries used, starting
with m3core, which every program implicitly uses.  The size can be
overridden by command line options -widechar-uni or widechar-16.
However, this is likely to be tedious because of the consistency
requirement.

Compatibility of WIDECHARs in pickles.

Programs with different WIDECHAR size can interchange pickles
containing WIDECHARs in either direction, if both are linked to a
post-Unicode libm3core, which understands both sizes of WIDECHAR.  A
code point outside the 16-bit range, when read from a pickle into a
program compiled with 16-bit WIDECHAR is converted to the Unicode
"replacement" character, VAL(16_FFFD,WIDECHAR).

Pickles written by pre-Unicode libm3core and containing WIDECHARs
(they will be 16-bit) can be read by post-Unicode libm3core.  The
reverse is not true.  A pre-Unicode libm3core reading a pickle that
was written by a program compiled with Unicode-sized WIDECHAR will
raise an exception, even if the pickle contains no actual WIDECHARs.

Network object compatibility.

The compatibility rules for WIDECHARs transferred by network object
calls are the same as for pickles.  But remember that network object
calls involve two-way transfer of data, and thus will, in general,
require the two-way compabibility of post-Unicode libm3core and
post-Unicode m3netobj to work.