-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode #48
Comments
I'll study your text ASAP and call back. |
@igor-liferenko has written UTF-8 code for |
can i contribute on this issue?? |
Why not? It's open source. At the moment, however, I have no time to dig into this issue. I'm still processing the contributions in |
Hi The UTF-8 implementation is in files comm-utf8.ch, cweav-utf8.ch, ctang-utf8.ch, ASCII.w and mapping.w https://github.com/igor-liferenko/cweb (the last change in ctang-utf8.ch assumes that your compiler supports utf8) |
i am a beginner can you guide me from where should i start? |
First run Second read the changes for “TeX Live”, i.e., the |
@igor-liferenko Is collation (according to the Unicode collation algorithm) still an open question? |
Simply store characters in xchr array in the necessary order. |
It is in file mapping.w |
CWEB now runs with LuaTeX. Time permitting, I'll try to write a set of |
There are a couple open issues about UTF-8 and Unicode. I was going to write this as a comment on one of them, but I wanted to make a new issue to address Unicode support in general.
(I'm happy to begin working on Unicode implementation, as soon as the issues mentioned below are discussed.)
I have been contemplating what it would take to integrate Unicode into
CWEB
.There are several things to consider. I am assuming that UTF-8 is the only input/output encoding that need be supported.
What should the internal representation of characters be?
char
without fear; however, at some point a certain amount of decoding is required. The full extent depends on how much error checking we want to do and on the preferred action ofCTANGLE
. As an asthetic choice,eight_bits
or a new, synonymous typeoctet
could be substituted forchar
when the value is an octet of UTF-8 input.uint_fast32_t
or whatever (probably hidden behind acode_point
typedef) instead ofchar
. The other major issue is that ASCII characters, which constitute the majority of typical C text, unconditionally occupy four times more storage than is necessary. But this isn't the greatest concern nowadays. It is convenient that every character takes up a single value.The programs often advance to the next character in a string by incrementing a pointer by 1. If UTF-8 is chosen as the internal representation, then all such increments will have to be adjusted to compensate. Using UTF-32 would avoid this problem.
In summary, storing characters in UTF-32 form takes up more space, forces encoding/decoding, and requires altering most declarations related to characters; storing characters in UTF-8 saves space and allows declarations to remain unchanged, but most operations on characters would have to be changed.
Encoding or decoding could happen at the following points:
CTANGLE
is reading@'
…'
. We probably want to extend the notation so that it “expands” into the ordinal value of any single character in the string, provided that that character corresponds to one code point. (Thus no notice is taken of combining characters.)CTANGLE
is converting names for output, if it must transliterate (see the heading “Transliteration” below).It might be easier to do encoding/decoding manually, not by trying to use any of C's “wide character” facilities. (Frankly, I find them obnoxious. Also, many uses of C input/output functions would have to be changed.)
One good thing about UTF-8 is that it is quite naturally expressed in octal, so
CWEB
's preference could be maintained through the transition.Unicode character data
In any case, the hardest part about supporting Unicode beyond simple encoding and decoding is dealing with the Unicode character database. Unicode 13.0 assigns (gives meaning to) 143 859 out of 1 114 112 possible code points. Every character has many properties that describe it.
Unicode distributes a bunch of plain text files that contain the property data for all characters. Unfortunately, there is no file that consolidates all information into one place, except for the Unicode XML database.
I'm going to ignore the task of reading the data in for now. The more interesting problem is this: How do we store information about every character? A full implementation of Unicode would be forced to have a way to get the value of any property, but
CWEB
needs only a limited set.Width.
CWEB
's error reporting routine indicates the current position in the buffer by printing it out like this:The problem is that the code assumes that all characters occupy the same amount of horizontal space. In reality, some characters have no width, some are wider than one column, etc. The amount of effort it would take to get this correct probably far outweighs the utility of the feature. But it's certainly possible; GCC handles cursor position in Unicode input just fine.
Transliteration.
For
CTANGLE
, we must be able to associate some string of text with a character, defining its transliteration. All that's needed is achar *
.C99 and C++98 added a syntactic feature called a “universal character name”, which is basically a four- or eight-digit hexadecimal character code embedded in regular source text. For example,
a\u200Bb
gives youab
, where the two characters are separated by a zero-width space. According to Annex D of the C standard and lex.name.allowed in the C++ standard, this is a perfectly valid identifier. However, both languages prohibit many characters to appear as universal character names in identifiers. It is tempting to changeCTANGLE
's default transliteration to insert an equivalent universal character name, but the restrictions complicate matters.Normalization.
Some strings of Unicode characters are effectively identical while not being exactly (i.e., numerically) equal. For example, a precomposed character like “ü” (U+00FC LATIN SMALL LETTER U WITH DIAERESIS) should usually be treated identically to its decomposed counterpart “ü” (U+0075 LATIN SMALL LETTER U and U+0308 COMBINING DIAERESIS).
Therefore Unicode defines (in UAX 15) a process of normalization, which converts strings to a canonical form. There are a few kinds of normalization, depending on whether you want to tend towards decomposing characters or towards composing characters and how you want to handle compatibility characters.
Several properties are associated with normalization, including Canonical_Combining_Class (a nonnegative integer below 256), Decomposition_Type (one of sixteen values), and Decomposition_Mapping (a string of at most eighteen code points).
It would probably be best for
CWEB
to normalize all strings before entering them into the character/byte memory.Identifiers.
If we want “extended characters” to be allowed in identifiers, we need to know exactly which code points can begin an identifier and which code points can continue an identifier. Luckily there are properties just for this, thanks to UAX 31. Specifically, if a character has the property XID_Start, it can begin an identifier, and if a character has the property XID_Continue, it can be a part of an identifier.
(There are also ID_Start and ID_Continue. The X variants are for normalized text only.)
Collation.
Here's the big one. The entirety of
CWEAVE
's Phase III is devoted to sorting and outputting an index. Sorting the index involves putting names in order, according to a collating sequence; in the current version ofCWEAVE
, the collation is represented by thecollate
array. Unicode collation is much more complex, due to the expanded character set.Full details of the Unicode collation algorithm can be found in UTS 10. It is based on four levels of comparison between strings. The specification requires that strings be normalized before comparison.
Collation needs a collation element table to work. The Default Unicode Collation Table (DUCET) can be found here; like the rest of the Unicode data, it is stored in a plain text file. In the DUCET, only three of the four levels of comparison are used, in order to allow implementations to extend the order for whatever internal reason. Other collation element tables exist for specific languages or conventions.
Storing the data.
In general, we want a way to map a twenty-one-bit number (probably held in a thirty-two-bit integer) to some data structure containing the character properties we are interested in. Storing all the needed information straightforwardly in a statically-allocated array would occupy about 45 megabytes on a sixty-four-bit system. I'm counting
char *
)uint8_t
)short
)char *
orcode_point *
depending on the internalrepresentation of characters)
bool
)bool
)struct { uint16_t a, b, c, d; }
)We would have the transliteration string be
NULL
if no transliteration was given; thenCTANGLE
would compute it automatically.I think that more attributes must be stored for normalization, so 45 megabytes is really a lower bound.
There are many ways of compressing this, of course. Full Unicode implementations typically use a kind of trie for looking up properties, because the entire set of properties for a
single character takes up a lot of space. Compression is also possible because long runs of characters tend to share properties.
Since
CWEAVE
doesn't do transliteration, and sinceCTANGLE
doesn't do collation, the two areas of storage could put into a union.Actually getting the data.
I glossed over this earlier, but it's important. How can
CWEB
read the character information into memory? There is far too much to compile directly into the programs; should it be read at initialization? Ideally we could do what TeX does and save the program's state after initialization, but I'm not sure if there is a good, portable way.The property information we want is found in the files
UnicodeData.txt
,DerivedCoreProperties.txt
,allkeys.txt
, andDerivedNormalizationProps.txt
. Thus ifCWEAVE
orCTANGLE
are starting up from scratch, they must read in four very large files.Alternatively, we could write a program to extract only the relevant data from the relevant files and write it in an especially compact form to a new file, which would be read by
CWEAVE
andCTANGLE
. I think that the most recent version of such a file should be distributed withCWEB
, but I can certainly see arguments to the contrary.[The program could be a more general utility (serving as another example of
CWEB
) that creates a compressed file containing a specified set of properties for each character. For instance, you might want to know only the names and aliases of characters; you can open the program, enter “name,alias
”, and it would output a file accordingly.]Or use a library.
I'm against this option. One of
CWEB
's appeals is that it is very easy to set up. It has no dependencies except on the C standard library; all you need is a C compiler to runCWEB
. Existing Unicode implementations are bulky and annoying, and they wouldn't fit in with the rest ofCWEB
.The text was updated successfully, but these errors were encountered: