Skip to content

Better Unicode handling plan

Siddharth Agarwal edited this page May 20, 2016 · 3 revisions

The problem

Watchman is currently entirely oblivious to Unicode: on POSIX platforms, it treats all filenames as raw bytes internally, and on Windows it converts to and from UTF-8.

This has served us well so far, but has a number of issues:

  • Keys don't follow this rule: they're always UTF-8 (and in many cases ASCII-only).
  • Warnings are logically text and are typically ASCII, but we can include filenames in them, and we don't know what encoding those filenames are in. This is sometimes referred to as the makefile problem. This problem has no general, portable solution.
  • Python 2 is a big consumer of Watchman, and while it is efficient at representing ASCII text, it has an inefficient Unicode type. It's likely that any attempt to use Unicode strings will cause a noticeable performance regression.
  • Python 3 forces consumers to treat text and bytes as completely separate entities. That causes problems for things like warnings, which are partly text and partly bytes.
  • Reasonable programs in Python 3 will want to either:
  • Receive filenames in results as Unicode strings with surrogateescape, similar to what os.listdir('directory') would return.
  • Receive filenames in results as raw bytes, similar to what os.listdir(b'directory') would return.
  • Reasonable programs in Python 3 will want to either:
  • Receive warnings as a valid Unicode string.
  • Receive warnings as bytestrings, which might or might not make sense in a given encoding (in particular, they might or might not make sense in the local encoding).

The BSER layer in the clients doesn't currently have enough information to figure out that filenames and warnings need to be decoded in different ways. Only the Watchman server has enough context.

The solution

Protocol upgrade

Introduce a new version of BSER, called BSERv2. BSERv2 is the same as BSER, with the following changes:

  • Add a new type representing known-Unicode text encoded as UTF-8. Rationale: These strings should always be treated as Unicode strings, with possibly special treatment for if they're ASCII.
  • The existing string type becomes a "bytestring" type. Rationale: Unicode-oblivious programs like Git and Mercurial want filenames as raw bytes.

All servers and clients that support BSERv2 should also support BSERv1.

Communication and negotiation

Every BSERv2 PDU should be prefixed with the magic string \x00\x02. Whether we need a full-fledged header is currently undecided.

BSERv1 doesn't have a header other than the magic string \x00\x01, so communication about whether a server supports BSERv2 must be done out-of-band.

Watchman server

Inside Watchman's data structures:

  • Add a new type representing known-Unicode text.
  • Add a new type representing warnings and other messages, which can contain non-Unicode text. Rationale: This will be used for warnings and other error messages that are intended to be shown directly to the user. Some programs (like Mercurial) will want raw bytes, while others (like Buck) will want a Unicode string.
  • Retain the current "string" (now bytestring) type for filenames.

Allow clients to specify whether they want warnings as text or as bytes, via some to-be-determined mechanism.

  • If warnings are wanted as text, escape any non-UTF-8 bytes so that strings become UTF-8.

pywatchman on Python 2

While deserializing a BSER PDU:

  1. Continue to deserialize bytestrings as str (bytestrings) by default. Clients can optionally use the value_encoding and value_errors parameters to decode bytes to Unicode strings.
  2. If a text string is ASCII, keep it as a bytestring by default. Rationale: Python 2's unicode type is much less efficient than str, and no more correct for ASCII strings.
  3. If a text string is not ASCII, decode to Unicode by default.
  4. Make the text string behavior controllable via a flag. Allow settings for all_unicode, all_bytes, and ascii_bytes.

pywatchman on Python 3

While deserializing a BSER PDU:

  1. Deserialize bytestrings as str (Unicode strings) with the local encoding and surrogateescape by default. Clients can optionally use the value_encoding and value_errors parameters to change this behavior.
  2. Decode all text strings to str by default. Allow settings for all_unicode and all_bytes.