-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Uchar module to the standard library. #80
Conversation
|
||
@since 4.03 *) | ||
|
||
type t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be abstract or a private int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I personally have no problem in doing
match Uchar.to_int u with
| 0x000A -> ...
...
I don't know what's the stance of the dev team about using so called "Language extensions" in the stdlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would still need to write Uchar.to_int u
or write a coercion if t
was defined as private int
. Having t = private int
is more an optimization: if the compiler knows that Uchar.t
is always represented by an immediate value, the code generator can skip calls to caml_modify
and/or float array checks.
I like this idea of only adding standard types in the compiler library. It makes interoperability much easier and still doesn't require Inria people to support and maintain such complicated things as comprehensive unicode support... I don't see any drawback to this PR. |
I think this is an excellent idea! |
(** [equal u u'] is [u = u']. *) | ||
|
||
val compare : t -> t -> int | ||
(** [compare u u'] is [Pervasives.compare u u']. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a hash function? Just an alias for to_int
, but it is useful for application with Hashtbl.Make
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, added a hash function.
I agree it is a nice idea to add the abstract datatype in the standard library, and only that. What is the opinion of other unicode ocaml library makers? @yoriyuki @alainfrisch |
I do not see the point to add Uchar module without standard Unicode string data type and literals. They are needed for the precisely same reason to Uchar, interoperability between Unicode processing libraries. We do not need normalization etc. inside the stdlib. To that said, adding Uchar is a good step toward more satisfactory Unicode support in OCaml. I have only minor comments.
|
Le lundi, 14 juillet 2014 à 12:57, Yoriyuki Yamagata a écrit :
I disagree with that, if you introduce an Unicode string data type and literals, then you most likely also want pattern matching on them. And if you want pattern matching on them you need to take normalization into account, in particular you want to be able to specify in which normalisation form your literal is supposed to be, otherwise it is useless, deceiving and could even be the source of a new class of potential security bugs. Formal unicode string literals without normalisation would be irresponsible IMHO. It is currently perfectly possible to write unnormalized UTF-8 literals in OCaml which is entirely sufficient for many programs out there and a function away to translate into the representation of your particular library at the cost of a negligible initial runtime cost. Introducing the Uchar module greatly enhance the possibility of modular implementations of Unicode and allow for exemple ulex to talk to uunf with strong invariants guaranteed by the abstraction.
"Noncharacter code points are reserved for internal use, such as for sentinel values. They should never be interchanged. They do, however, have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange."
"Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters." As such it's better if we have a way to represent these characters since UTF-X decoders can then pass them to the application which is then free to take the appropriate context dependent action.
Best, Daniel [1] https://sympa.inria.fr/sympa/arc/caml-list/2007-10/msg00475.html |
For the latter two points, I now concur. I am not against to merging your patch. For the first point,
If you mean that comparison and pattern matching should be always respect to canonical equivalence, and all string literals should be in normal forms, then I disagree with you. Code-point comparison has a place, like comparison which is used in binary trees, say, OCaml's Set. String literals in non-normalized form have also in place, for example, passing strings to legacy encodings. Unicode security is complex issue. Leave it to the programmer and we should satisfy that the necessary tools are provided by the compiler and libraries.
Using the raw byte string which is encoded by UTF-8, as an alternative to proper Unicode string, is a troubling tendency. UTF-8 encoding can be broken, and creates serious security issues. It is much worse than your normalization apocalypse. But, this topic (whether we need a standard Unicode string or not) is not related to your patch. If you want to continue the discussion, let us move to caml-list, |
Le lundi, 14 juillet 2014 à 16:08, Yoriyuki Yamagata a écrit :
That's exactly not what I said. First I never talked about comparison at all, pattern matching is about equality and what I was precisely suggesting is that the equality you'd like (i.e. the underlying unicode equivalence) depends on context, which is why literals should be able to indicate the normal form you want them to be in, in order to be useful in pattern matching. You could say we want the literal notation without the pattern matching but that would feel odd as this would mismatch all other literal notations we have in the language.
That's precisely the aim of this proposal.
I don't think so, you are not supposed and can't use them blindly: if you do any processing with them you must have them go through some validating function (which will detect malformed sequences) if only to be able to normalize them so that you can match them against normalized user provided input. Best, Daniel |
Comparison has a broader meaning, which includes equality test, I think. Although my example of Set is using comparison in narrow sense, there is a plenty of the case which code-point equality test are used. (say, hash table) As for pattern matching, code-point comparison is enough. If you need canonical equivalence or others, you can preprocess the input and making a normal form for literals by hand or use when clauses.
Of course we must have them validated, but there is no guarantee whether such validation is performed from the type system. Having abstract Unicode string enforces validation, and increases safety. |
Le lundi, 14 juillet 2014 à 17:48, Yoriyuki Yamagata a écrit :
I think you are making this discussion more confusing than it should be. Binary comparison which includes binary equality has its uses, especially when you have normalized your inputs including your string literals and you actually know in which normal form they are.
Well it's enough if you want people to write broken Unicode programs. Making a normal form by hand is certainly painful and when clauses are impossible: you need to normalize the literal constant of the pattern, otherwise you are just acting on variables which you can already perfectly do right now: let ustr nf s = (* function that validates the UTF-8 encoded s and normalizes to nf *) match ustr `NFD x with Overall I think that unicode string literals without pattern matching and normalization is just a waste of time for everybody. Daniel |
I think you miss my points.
My point here is that, there are cases that binary comparison and equality is enough or even necessary without normalization. First examples of such kinds are data-structures which only requires consistent equality or ordering over Unicode string. The second example is to interact the legacy encoding, which, say, distinguishes Ω (unit) and Greek Ω.
Again, you miss my point. My point is that, by introducing abstract Unicode string type, we can enforce that the internal representation of Unicode string (say, UTF-8) is valid by type system. We need string literal for just a convenience to write down such abstract data type. We do not need pattern matching for this purpose. Beside, if you use UTF-8 encoded byte string to represent Unicode string, a.[0], a.[1]... are bytes of UTF-8 encoded string, not first and second Unicode characters. I think it is conceptually ugly. |
They are certainly not the average case, there may be a few specific cases or some data sets may give you the illusion that this is the case, until you fall on a damned decomposed é. Even if you want to deal with something "relatively simple" like latin1 characters it's not going to be enough, better not lure programmers in fallacies; it seems they have already enough hard time understanding all of this. I think you miss both the social and technical point here.
I perfectly get that point: it has the same basis as this very proposal on which we agree. Sure it would be useful. But then it's much more contentious, for example I expect there will already be disagreement over the actual internal representation (e.g. I would make them immutable arrays of ints, not UTF-8 encoded strings), over what the minimal support should be (as we have right at the moment). Then if you want to introduce literals you will need to hook an UTF-8 decoder in the compiler then you will need to find an actual syntax in the very crowded surface syntax of OCaml, and this for not much gain in my opinion, that is unless we get pattern matching and normalization, which, unlike what you suggest is a basic need in most cases to perform correct unicode processing. I prefer nothing than broken things that will confuse everyone. I prefer small things that improve my coding life than nothing because the change was too invasive.
I don't like the idea of having literals on which you cannot pattern match. This is conceptually ugly.
As I already said on the caml-list indexing Unicode characters is worthless in general. From an abstract character point of view, for layout purposes, etc. direct indexing doesn't bring you anything, so I don't really care about that and in real programs it has never been a problem for me not to have direct indexing. The UTF-8 encoded sources files/strings may not be a perfect solution but it works well enough in real programs. Having that as a basis we can move to consolidate it, step by step.
It's not a concept ! I was not made for that… It's a way to move forward. Progress is made in small steps. I'm already glad we don't have the conceptual mess other languages have with their Unicode support. Again, rather have nothing than broken things. The actual literal notation you'd like is a function call away, from a pragmatic point of view I'd say it is not at the moment (if ever) worth pursuing the idea (that is unless the dev. team is willing to commit to some form of useful unicode string support in the compiler). |
(** [to_int u] is [u] as an integer. *) | ||
|
||
val is_char : t -> bool | ||
(** [is_char u] is [true] iff [u] is a latin1 OCaml character. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was suggested that this function should be named is_valid
because we don't want to encourage to open this module and Uchar.is_char
is ugly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the connection to opening the modules. Why not another name but Uchar.is_valid
wouldn't make sense at all, we are talking about a function that checks whether [u] can be represented by char
. Maybe is_latin1
? That would makes it less consistent with Uchar.of_char
and Uchar.to_char
but why not. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question was rather on is_uchar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ! Makes more sense. Ok'll rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oups sorry for the misleading typo...
Daniel, in your first comment, you put in emphasis "in the standard library". Can you provide some more justification for that? (In particular, with the advent of OPAM simplifying the writing of new libraries, could we put this in a "base Unicode" library that the other Unicode libraries all depend on?) |
Le mardi, 4 novembre 2014 à 11:50, Mark Shinwell a écrit :
We could of course publish this module separately but it would be a real maintenance burden (not code-wise, infrastructure-wise) for such small functionality — 31 loc which are basically cast in stone. In the end every program using some form of unicode character (and which don't these days ?) would end up with this tiny package in their dependency list and the only benefit would be, in my opinion, to introduce noise in the whole infrastructure; e.g. if you take Best, Daniel |
Renamed |
Removed UTF-8 comment as per request. |
I'm in favor of adding this to the stdlib. |
Is there anything blocking this from being merged into trunk now? It would be useful to be able to start depending on it, and putting in a transitionary package into OPAM for older compiler revisions (as we did for |
I wouldn't mind merging it if there was a clear consensus in favor, but right now I'm not sure there is -- apparently it wasn't discussed at the last developer meeting? Maybe you could ask other developers for their opinion. |
It seems this PR goes against the very idea of the stdlib. So let's just close this. |
Reopening. Whilst I appreciate Daniel's frustration, this is a pull request with fairly broad support that I would very much like to see merged. |
(** [compare u u'] is [Pervasives.compare u u']. *) | ||
|
||
val hash : t -> int | ||
(** [hash u] associates a non negative integer to [u]. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew something was wrong with this otherwise stellar pull request: "non negative" should be either "non-negative" or "nonnegative" (in case you find hyphens outrageous). Thank God we caught this early!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Dash added.
Let's merge it now. |
@damiendoligez any reason not to merge it yourself? |
Minor nitpicks: can you add an entry to Changes and update copyright headers to 2015 for new files? |
@alainfrisch @damiendoligez If the Changes and copyright changes are holding a merge, I can submit a separate PR with those changes after this gets in. |
Add Uchar module to the standard library.
That would be very nice to you! |
I don't see why copyright dates should be changed they all correspond to the year when the code was written. |
Yeah ok, what matters is really the Changes file. |
Alloc API change (3/4)
Backport PR#10205 from upstream
…rt-pr10205 Backport PR#10205 from upstream
23a7f73 flambda-backend: Fix some Debuginfo.t scopes in the frontend (ocaml#248) 33a04a6 flambda-backend: Attempt to shrink the heap before calling the assembler (ocaml#429) 8a36a16 flambda-backend: Fix to allow stage 2 builds in Flambda 2 -Oclassic mode (ocaml#442) d828db6 flambda-backend: Rename -no-extensions flag to -disable-all-extensions (ocaml#425) 68c39d5 flambda-backend: Fix mistake with extension records (ocaml#423) 423f312 flambda-backend: Refactor -extension and -standard flags (ocaml#398) 585e023 flambda-backend: Improved simplification of array operations (ocaml#384) faec6b1 flambda-backend: Typos (ocaml#407) 8914940 flambda-backend: Ensure allocations are initialised, even dead ones (ocaml#405) 6b58001 flambda-backend: Move compiler flag -dcfg out of ocaml/ subdirectory (ocaml#400) 4fd57cf flambda-backend: Use ghost loc for extension to avoid expressions with overlapping locations (ocaml#399) 8d993c5 flambda-backend: Let's fix instead of reverting flambda_backend_args (ocaml#396) d29b133 flambda-backend: Revert "Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)" (ocaml#395) d0cda93 flambda-backend: Revert ocaml#373 (ocaml#393) 1c6eee1 flambda-backend: Fix "make check_all_arches" in ocaml/ subdirectory (ocaml#388) a7960dd flambda-backend: Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382) bf7b1a8 flambda-backend: List and Array Comprehensions (ocaml#147) f2547de flambda-backend: Compile more stdlib files with -O3 (ocaml#380) 3620c58 flambda-backend: Four small inliner fixes (ocaml#379) 2d165d2 flambda-backend: Regenerate ocaml/configure 3838b56 flambda-backend: Bump Menhir to version 20210419 (ocaml#362) 43c14d6 flambda-backend: Re-enable -flambda2-join-points (ocaml#374) 5cd2520 flambda-backend: Disable inlining of recursive functions by default (ocaml#372) e98b277 flambda-backend: Import ocaml#10736 (stack limit increases) (ocaml#373) 82c8086 flambda-backend: Use hooks for type tree and parse tree (ocaml#363) 33bbc93 flambda-backend: Fix parsecmm.mly in ocaml subdirectory (ocaml#357) 9650034 flambda-backend: Right-to-left evaluation of arguments of String.get and friends (ocaml#354) f7d3775 flambda-backend: Revert "Magic numbers" (ocaml#360) 0bd2fa6 flambda-backend: Add [@inline ready] attribute and remove [@inline hint] (not [@inlined hint]) (ocaml#351) cee74af flambda-backend: Ensure that functions are evaluated after their arguments (ocaml#353) 954be59 flambda-backend: Bootstrap dd5c299 flambda-backend: Change prefix of all magic numbers to avoid clashes with upstream. c2b1355 flambda-backend: Fix wrong shift generation in Cmm_helpers (ocaml#347) 739243b flambda-backend: Add flambda_oclassic attribute (ocaml#348) dc9b7fd flambda-backend: Only speculate during inlining if argument types have useful information (ocaml#343) aa190ec flambda-backend: Backport fix from PR#10719 (ocaml#342) c53a574 flambda-backend: Reduce max inlining depths at -O2 and -O3 (ocaml#334) a2493dc flambda-backend: Tweak error messages in Compenv. 1c7b580 flambda-backend: Change Name_abstraction to use a parameterized type (ocaml#326) 07e0918 flambda-backend: Save cfg to file (ocaml#257) 9427a8d flambda-backend: Make inlining parameters more aggressive (ocaml#332) fe0610f flambda-backend: Do not cache young_limit in a processor register (upstream PR 9876) (ocaml#315) 56f28b8 flambda-backend: Fix an overflow bug in major GC work computation (ocaml#310) 8e43a49 flambda-backend: Cmm invariants (port upstream PR 1400) (ocaml#258) e901f16 flambda-backend: Add attributes effects and coeffects (#18) aaa1cdb flambda-backend: Expose Flambda 2 flags via OCAMLPARAM (ocaml#304) 62db54f flambda-backend: Fix freshening substitutions 57231d2 flambda-backend: Evaluate signature substitutions lazily (upstream PR 10599) (ocaml#280) a1a07de flambda-backend: Keep Sys.opaque_identity in Cmm and Mach (port upstream PR 9412) (ocaml#238) faaf149 flambda-backend: Rename Un_cps -> To_cmm (ocaml#261) ecb0201 flambda-backend: Add "-dcfg" flag to ocamlopt (ocaml#254) 32ec58a flambda-backend: Bypass Simplify (ocaml#162) bd4ce4a flambda-backend: Revert "Semaphore without probes: dummy notes (ocaml#142)" (ocaml#242) c98530f flambda-backend: Semaphore without probes: dummy notes (ocaml#142) c9b6a04 flambda-backend: Remove hack for .depend from runtime/dune (ocaml#170) 6e5d4cf flambda-backend: Build and install Semaphore (ocaml#183) 924eb60 flambda-backend: Special constructor for %sys_argv primitive (ocaml#166) 2ac6334 flambda-backend: Build ocamldoc (ocaml#157) c6f7267 flambda-backend: Add -mbranches-within-32B to major_gc.c compilation (where supported) a99fdee flambda-backend: Merge pull request ocaml#10195 from stedolan/mark-prefetching bd72dcb flambda-backend: Prefetching optimisations for sweeping (ocaml#9934) 27fed7e flambda-backend: Add missing index param for Obj.field (ocaml#145) cd48b2f flambda-backend: Fix camlinternalOO at -O3 with Flambda 2 (ocaml#132) 9d85430 flambda-backend: Fix testsuite execution (ocaml#125) ac964ca flambda-backend: Comment out `[@inlined]` annotation. (ocaml#136) ad4afce flambda-backend: Fix magic numbers (test suite) (ocaml#135) 9b033c7 flambda-backend: Disable the comparison of bytecode programs (`ocamltest`) (ocaml#128) e650abd flambda-backend: Import flambda2 changes (`Asmpackager`) (ocaml#127) 14dcc38 flambda-backend: Fix error with Record_unboxed (bug in block kind patch) (ocaml#119) 2d35761 flambda-backend: Resurrect [@inline never] annotations in camlinternalMod (ocaml#121) f5985ad flambda-backend: Magic numbers for cmx and cmxa files (ocaml#118) 0e8b9f0 flambda-backend: Extend conditions to include flambda2 (ocaml#115) 99870c8 flambda-backend: Fix Translobj assertions for Flambda 2 (ocaml#112) 5106317 flambda-backend: Minor fix for "lazy" compilation in Matching with Flambda 2 (ocaml#110) dba922b flambda-backend: Oclassic/O2/O3 etc (ocaml#104) f88af3e flambda-backend: Wire in the remaining Flambda 2 flags (ocaml#103) 678d647 flambda-backend: Wire in the Flambda 2 inlining flags (ocaml#100) 1a8febb flambda-backend: Formatting of help text for some Flambda 2 options (ocaml#101) 9ae1c7a flambda-backend: First set of command-line flags for Flambda 2 (ocaml#98) bc0bc5e flambda-backend: Add config variables flambda_backend, flambda2 and probes (ocaml#99) efb8304 flambda-backend: Build our own ocamlobjinfo from tools/objinfo/ at the root (ocaml#95) d2cfaca flambda-backend: Add mutability annotations to Pfield etc. (ocaml#88) 5532555 flambda-backend: Lambda block kinds (ocaml#86) 0c597ba flambda-backend: Revert VERSION, etc. back to 4.12.0 (mostly reverts 822d0a0 from upstream 4.12) (ocaml#93) 037c3d0 flambda-backend: Float blocks 7a9d190 flambda-backend: Allow --enable-middle-end=flambda2 etc (ocaml#89) 9057474 flambda-backend: Root scanning fixes for Flambda 2 (ocaml#87) 08e02a3 flambda-backend: Ensure that Lifthenelse has a boolean-valued condition (ocaml#63) 77214b7 flambda-backend: Obj changes for Flambda 2 (ocaml#71) ecfdd72 flambda-backend: Cherry-pick 9432cfdadb043a191b414a2caece3e4f9bbc68b7 (ocaml#84) d1a4396 flambda-backend: Add a `returns` field to `Cmm.Cextcall` (ocaml#74) 575dff5 flambda-backend: CMM traps (ocaml#72) 8a87272 flambda-backend: Remove Obj.set_tag and Obj.truncate (ocaml#73) d9017ae flambda-backend: Merge pull request ocaml#80 from mshinwell/fb-backport-pr10205 3a4824e flambda-backend: Backport PR#10205 from upstream: Avoid overwriting closures while initialising recursive modules f31890e flambda-backend: Install missing headers of ocaml/runtime/caml (ocaml#77) 83516f8 flambda-backend: Apply node created for probe should not be annotated as tailcall (ocaml#76) bc430cb flambda-backend: Add Clflags.is_flambda2 (ocaml#62) ed87247 flambda-backend: Preallocation of blocks in Translmod for value let rec w/ flambda2 (ocaml#59) a4b04d5 flambda-backend: inline never on Gc.create_alarm (ocaml#56) cef0bb6 flambda-backend: Config.flambda2 (ocaml#58) ff0e4f7 flambda-backend: Pun labelled arguments with type constraint in function applications (ocaml#53) d72c5fb flambda-backend: Remove Cmm.memory_chunk.Double_u (ocaml#42) 9d34d99 flambda-backend: Install missing artifacts 10146f2 flambda-backend: Add ocamlcfg (ocaml#34) 819d38a flambda-backend: Use OC_CFLAGS, OC_CPPFLAGS, and SHAREDLIB_CFLAGS for foreign libs (#30) f98b564 flambda-backend: Pass -function-sections iff supported. (#29) e0eef5e flambda-backend: Bootstrap (#11 part 2) 17374b4 flambda-backend: Add [@@Builtin] attribute to Primitives (#11 part 1) 85127ad flambda-backend: Add builtin, effects and coeffects fields to Cextcall (#12) b670bcf flambda-backend: Replace tuple with record in Cextcall (#10) db451b5 flambda-backend: Speedups in Asmlink (#8) 2fe489d flambda-backend: Cherry-pick upstream PR#10184 from upstream, dynlink invariant removal (rev 3dc3cd7 upstream) d364bfa flambda-backend: Local patch against upstream: enable function sections in the Dune build 886b800 flambda-backend: Local patch against upstream: remove Raw_spacetime_lib (does not build with -m32) 1a7db7c flambda-backend: Local patch against upstream: make dune ignore ocamldoc/ directory e411dd3 flambda-backend: Local patch against upstream: remove ocaml/testsuite/tests/tool-caml-tex/ 1016d03 flambda-backend: Local patch against upstream: remove ocaml/dune-project and ocaml/ocaml-variants.opam 93785e3 flambda-backend: To upstream: export-dynamic for otherlibs/dynlink/ via the natdynlinkops files (still needs .gitignore + way of generating these files) 63db8c1 flambda-backend: To upstream: stop using -O3 in otherlibs/Makefile.otherlibs.common eb2f1ed flambda-backend: To upstream: stop using -O3 for dynlink/ 6682f8d flambda-backend: To upstream: use flambda_o3 attribute instead of -O3 in the Makefile for systhreads/ de197df flambda-backend: To upstream: renamed ocamltest_unix.xxx files for dune bf3773d flambda-backend: To upstream: dune build fixes (depends on previous to-upstream patches) 6fbc80e flambda-backend: To upstream: refactor otherlibs/dynlink/, removing byte/ and native/ 71a03ef flambda-backend: To upstream: fix to Ocaml_modifiers in ocamltest 686d6e3 flambda-backend: To upstream: fix dependency problem with Instruct c311155 flambda-backend: To upstream: remove threadUnix 52e6e78 flambda-backend: To upstream: stabilise filenames used in backtraces: stdlib/, otherlibs/systhreads/, toplevel/toploop.ml 7d08e0e flambda-backend: To upstream: use flambda_o3 attribute in stdlib 403b82e flambda-backend: To upstream: flambda_o3 attribute support (includes bootstrap) 65032b1 flambda-backend: To upstream: use nolabels attribute instead of -nolabels for otherlibs/unix/ f533fad flambda-backend: To upstream: remove Compflags, add attributes, etc. 49fc1b5 flambda-backend: To upstream: Add attributes and bootstrap compiler a4b9e0d flambda-backend: Already upstreamed: stdlib capitalisation patch 4c1c259 flambda-backend: ocaml#9748 from xclerc/share-ev_defname (cherry-pick 3e937fc) 00027c4 flambda-backend: permanent/default-to-best-fit (cherry-pick 64240fd) 2561dd9 flambda-backend: permanent/reraise-by-default (cherry-pick 50e9490) c0aa4f4 flambda-backend: permanent/gc-tuning (cherry-pick e9d6d2f) git-subtree-dir: ocaml git-subtree-split: 23a7f73
a09392d Set Menhir version back to 20210419 again (ocaml#89) cc63992 Merge pull request ocaml#88 from mshinwell/flambda-backend-changes-2022-12-27 3e49df3 HACKING.jst.adoc 1866676 Merge flambda-backend changes e012992 Merge pull request ocaml#87 from mshinwell/merge-4.14.1 ac5c7c8 Merge tag '4.14.1' into main 3da21bc add a useful debug printer 83b7c72 Document the debug_printers script 98896e0 Remove a tiny code stutter I came across 99cb5d9 release 4.14.1 b49060f last commit before tagging 4.14.1 fae9aef Add documentation 708e5a9 Add tests c609eee Bootstrap 7f922d0 Polymorphic parameters 51aeb04 Keep generalized structure from patterns when typing let 4b68bb3 Add test of princiaplity from polymorphic type constraints 82c7afe fix wong raise aca252f x86: Force result of Icomp to be in a register (ocaml#11808) 985725b Add dynlink_compilerlibs.mli to .gitignore (ocaml#79) 2b1fa24 Regenerate parser (ocaml#80) 1bb6c79 Merge pull request ocaml#78 from mshinwell/flambda-backend-patches-2022-12-13 9029581 Update otherlibs/dynlink/Makefile 3e4f1b9 Revert toplevel/native/dune to ocaml-jst version 6061e4c Regenerate configure using autoconf 2.71 888d4b1 Back out patch which disables alloc-check in ocaml-jst a6d5796 Fix dynlink build 3e46daf Update .depend files a5c547e Bootstrap a6a9031 Merge flambda-backend changes 0ac7fdd temp fix for linker error (ocaml#77) 1018602 Remove references to 32-bit Cygwin (ocaml#11797) e2d0d9e Enable individual testing with Makefile.jst (ocaml#76) f10cbf6 increment version number after tagging 4.14.1~rc1 11c5ab7 release 4.14.1~rc1 e4c3920 last commit before tagging 4.14.1~rc1 9e598ca Merge pull request ocaml#11793 from dra27/then-than 2a7e501 Use a more relaxed mode for unification in Ctype.subst (ocaml#11771) (ocaml#73) 7b35ef7 Statically initialize `caml_global_data` with a valid value (ocaml#11788) cbd791a Allow immediates to cross modes (ocaml#58) 85a0817 Merge pull request ocaml#11534 from gasche/follow-synonyms-in-show-module-type 699f43c Changes e54e9bc fix the 'stuttering' issue in #show d9799d3 test comments fec3b23 follow synonyms when #show-ing module types 06a1ad7 regression tests for ocaml#11533 (still failing) 549d757 Run "misplaced attributes" check when compiling mlis (ocaml#72) b2b74bf Fix bug in `Mtype.strengthen_lazy` causing spurious typing errors (ocaml#11776) a6c0e75 Ensure that Ctype.nongen always calls remove_mode_variables (ocaml#70) 6c50831 array elements are global (ocaml#67) bc510ed Ensure that types from packed modules are always generalised (ocaml#11732) 4d47036 Fix ocaml#10768 8788ff6 Add/move some documentation 9891a36 Propagate location information to `local_` in expressions 988306d Add support for `global_` and `nonlocal_` constructor arguments (ocaml#50) 6729eb8 Missing CAMLparam in win32's Unix.stat (ocaml#11737) e7dd740 Add debug_printers.ml (ocaml#63) 65f2896 more entries in gitignore (ocaml#62) a9a84d0 Move `global_flag` to `Asttypes` (ocaml#60) fac5896 Minor attribute fixes from flambda-backend 75f402e Note about make install and Makefile.jst (ocaml#56) fb5b1e4 Remove the -force-tmc flag (ocaml#11661) bd87a61 ocamlmklib: use `ar rcs` instead of `ar rc` (ocaml#11670) 83762af Merge pull request ocaml#11622 from Octachron/fix_recursive_types_in_constructor_mismatch ca48730 Merge pull request ocaml#11609 from Octachron/pr11194_unbound_and_printing_context git-subtree-dir: ocaml git-subtree-split: a09392d
As I already made clear in previous discussions on the
caml-list
, I find that OCaml's current support for Unicode is outstanding (au propre comme au figuré).I don't think introducing a Unicode string data structure and a corresponding syntax for literals would be a good thing do to. Since, if one wanted to that in a correct and useful way, it would entail importing a good deal of the Unicode processing machinery (e.g. normalization) in the compiler and I really think it's better to leave that outside the compiler. Unicode processing can perfectly be left to a set of modularized, external libraries. I also think it's actually a good idea to proceed that way as libraries are in a better position to evolve with the standard (e.g. newly encoded characters on Unicode standard updates may imply changes to normalisation results and would entail updates to the compiler).
There is however one thing that I really find missing to get utterly excellent Unicode support in OCaml: an abstract datatype, in the standard library, to represent an Unicode scalar value (by abusing terminology: an Unicode character). An Unicode scalar value is simply an integer in the ranges
0x0000…0xD7FF
or0xE000…0x10FFFF
.Such a data type would allow independent libraries dealing with unicode characters (e.g.
ulex
,camomile
,uutf
,uunf
,uucp
,uucd
) to interchange data without relying onint
s and as such strengthen the abstractions and guarantees a bit; avoid documentation warnings blabla that the givenint
s need to be in the above range, avoid needless (re)checks if data flows among modules, well you get the idea, the basic advantages of data abstraction...This proposal simply adds such a minimal data type along with a few functions which by themselves don't do much except integrating with the standard library; doing real Unicode processing is left to external libraries, as it should be.
One question is whether a
Pervasives.uchar
type equal toUchar.t
should be introduced (not part of this proposal). I don't think it's essential, it could be a nice touch though.