Text_view

Travis CI (Linux:gcc)

Text_view

A C++ Concepts based character encoding and code point enumeration library.

This project is the reference implementation for proposal P0244 for the C++ standard.

This port of Text_view requires a C++17 conforming compiler that implements ISO/IEC technical specification 19217:2015, C++ Extensions for concepts . A port of Text_view that builds with C++11 conforming compilers is available at Text_view for range-v3.

For discussion of this project, please post and/or subscribe to the text_view@googlegroups.com group hosted at https://groups.google.com/d/forum/text_view

Overview
Current features and limitations
Requirements
Build and installation
Usage
Supported Encodings
Terminology
References

Overview

C++11 added support for new character types (N2249) and Unicode string literals (N2442), but neither C++11, nor more recent standards have provided means of efficiently and conveniently enumerating code points in Unicode or legacy encodings. While it is possible to implement such enumeration using interfaces provided in the standard <codecvt> library, doing to is awkward, requires that text be provided as pointers to contiguous memory, and inefficent due to virtual function call overhead (examples and data required to back up these assertions).

Text_view provides iterator and range based interfaces for encoding and decoding strings in a variety of character encodings. The interface is intended to support all modern and legacy character encodings, though this library does not yet provide implementations for legacy encodings.

An example usage follows. Note that \u00F8 (LATIN SMALL LETTER O WITH STROKE) is encoded as UTF-8 using two code units (\xC3\xB8), but iterator based enumeration sees just the single code point.

using CT = utf8_encoding::character_type;
auto tv = make_text_view<utf8_encoding>(u8"J\u00F8erg is my friend");
auto it = tv.begin();
assert(*it++ == CT{0x004A}); // 'J'
assert(*it++ == CT{0x00F8}); // 'ø'
assert(*it++ == CT{0x0065}); // 'e'

The iterators and ranges that Text_view provides are compatible with the non-modifying sequence utilities provided by the standard C++ <algorithm> library. This enables use of standard algorithms to search encoded text.

it = std::find(tv.begin(), tv.end(), CT{0x00F8});
assert(it != tv.end());

The iterators provided by Text_view also provide access to the underlying code unit sequence.

auto base_it = it.base_range().begin();
assert(*base_it++ == '\xC3');
assert(*base_it++ == '\xB8');
assert(base_it == it.base_range().end());

Text_view ranges satisfy the requirements for use in C++11 range-based for statements with the removed same type restriction for the begin and end expressions provided by P0184R0 as adopted for C++17.

for (const auto &ch : tv) {
  ...
}

Current features and limitations

Text_view provides interfaces for the following:

Encoding and decoding of text for the encodings listed in supported encodings.
Encoding text using C++11 compliant output iterators.
Decoding text using input, forward, bidirectional, and random access iterators that are compliant with standard iterator requirements as specified in the ranges proposal.
Constructing view adapters for encoded text stored in arrays, containers, or std::basic_string, or referenced by another range or view. These view adapters meet the requirements for views in the ranges proposal.

Text_view does not currently provide interfaces for the following:

Transcoding of code points from one character set to another.
Iterators for grapheme clusters or other boundary conditions.
Collation.
Localization.
Internationalization.
Unicode code point properties.
Unicode normalization.

Requirements

Text_view requires a C++ compiler that implements ISO/IEC technical specification 19217:2015, C++ Extensions for concepts As of 2016-08-26, this specification is only supported by gcc release 6.2.0 or later. Additionally, Text_view depends on the cmcstl2 implementation of the ranges proposal for concept definitions.

Build and installation

This section provides instructions for building Text_view and suitable versions of its dependencies.

Building and installing gcc

Text_view requires gcc version 6.2.0 or later. The following commands can be used to perform a suitable build of the current in-development release of gcc on Linux if an installation of gcc 6.2.0 or later is not available. If you have an installation of gcc 6.2.0 or later available, then there is no need to build gcc yourself.

$ svn co svn://gcc.gnu.org/svn/gcc/trunk gcc-trunk-src
$ curl -O ftp://ftp.gnu.org/gnu/gmp/gmp-5.1.1.tar.bz2
$ curl -O ftp://ftp.gnu.org/gnu/mpfr/mpfr-3.1.2.tar.bz2
$ curl -O ftp://ftp.gnu.org/gnu/mpc/mpc-1.0.1.tar.gz
$ cd gcc-trunk-src
$ svn update -r 234230  # Optional command to select a known good gcc version
$ bzip2 -d -c ../gmp-5.1.1.tar.bz2 | tar -xvf -
$ mv gmp-5.1.1 gmp
$ bzip2 -d -c ../mpfr-3.1.2.tar.bz2 | tar -xvf -
$ mv mpfr-3.1.2 mpfr
$ tar -zxvf ../mpc-1.0.1.tar.gz
$ mv mpc-1.0.1 mpc
$ cd ..
$ mkdir gcc-trunk-build
$ cd gcc-trunk-build
$ LIBRARY_PATH=/usr/lib/$(gcc -print-multiarch); export LIBRARY_PATH
$ CPATH=/usr/include/$(gcc -print-multiarch); export CPATH
$ ../gcc-trunk-src/configure \
  CC=gcc \
  CXX=g++ \
  --prefix $(pwd)/../gcc-trunk-install \
  --disable-multilib \
  --disable-bootstrap \
  --enable-languages=c,c++
$ make -j 4
$ make install
$ cd ..

When complete, the new gcc build will be present in the gcc-trunk-install directory.

Building and installing cmcstl2

Text_view only depends on headers provided by cmcstl2 and no build or installation is required. Text_view is known to build successfully with cmcstl2 git revision eb5ecdf79e22eb68c86cb62fd0912559593e5597. The following commands can be used to checkout a known good revision.

$ git clone https://github.com/CaseyCarter/cmcstl2.git cmcstl2
$ cd cmcstl2
$ git checkout eb5ecdf79e22eb68c86cb62fd0912559593e5597

Building and installing Text_view

Text_view has a CMake based build system sufficient to build and run its tests, to validate example code, and to perform a minimal installation following established operating system conventions. By default, files will be installed under /usr/local on UNIX and UNIX-like systems, and under C:\Program Files on Windows. The installation location can be changed by invoking cmake with a -DCMAKE_INSTALL_PREFIX=<path> option. On UNIX and UNIX-like systems, header files will be installed in the include directory of the installation destination, and other files will be installed under share/text_view. On Windows, header files be installed in the text_view\include directory of the installation destination, and other files will be installed under text_view.

Unless cmcstl2 is installed to a common location, it will be necessary to inform the build where it is installed. This is typically done by setting the CMCSTL2_INSTALL_PATH environment variable. As of this writing, cmcstl2 does not provide an installation option, so CMCSTL2_INSTALL_PATH should specify the location where the cmcstl2 source resides (the directory that contains the cmcstl2 include directory).

The following commands suffice to build and run tests and examples, and perform an installation. If the build succeeds, built test and example programs will be present in the test and examples subdirectories of the build directory (the built test and example programs are not installed), and header files, example code, cmake package configuration modules, and other miscellaneous files will be present in the installation directory.

$ vi setenv.sh  # Update GCC_INSTALL_PATH and CMCSTL2_INSTALL_PATH.
$ . ./setenv.sh
$ mkdir build
$ cd build
$ cmake .. [-DCMAKE_INSTALL_PREFIX=/path/to/install/to]
$ cmake --build . --target install
$ ctest

check and check-install CMake targets are also available for automating build and test. The check target performs a build without installation and then runs the tests. The check-install target performs a build, runs tests, installs to a location within the build directory, and then performs tests (verifying that example code builds) on the installation.

The installation includes a CMake based build system for building the example code. To build all of the examples, run cmake specifying the examples directory of the installation as the source directory. Alternatively, each example can be built independently by specifying its source directory as the source directory in a cmake invocation. If the installation was to a non-default installation location (-DCMAKE_INSTALL_PREFIX was specified), then it may be necessary to set CMAKE_PREFIX_PATH to the Text_view installation location (the location CMAKE_INSTALL_PREFIX was set to) or text_view_DIR to the directory containing the installed text_view-config.cmake file, so that the Text_view package configuration file is found. See the CMake documentation for more details.

The following commands suffice to build all of the installed examples.

$ cd /path/to/installation/text_view/examples
$ mkdir build
$ cd build
$ cmake .. [-DCMAKE_PREFIX_PATH=/path/to/installation]
$ cmake --build .
$ ctest

Usage

To use Text_view in your own code, perform a build and installation as described above, add include paths for the text_view/include and cmcstl2 installation locations, add a library search path for the text_view/lib directory, include the text_view header file in your sources, and link the text_view library with your executable.

#include <experimental/text_view>

Text_view installations include a CMake package configuration file suitable for use in CMake based projects. To use it, specify text_view as the <package> argument to find_package in your CMake file and add invocations of target_link_libraries for each relevant target with the <lib> argument set to text-view. This will automatically apply compiler and linker options required to use Text_view to each target. See the CMakeLists.txt files for the utilities under the examples directory for reference. If Text_view was installed to a non-default installation location (-DCMAKE_INSTALL_PREFIX was specified), then it may be necessary to set CMAKE_PREFIX_PATH to the Text_view installation location (the location CMAKE_INSTALL_PREFIX was set to) or text_view_DIR to the directory containing the installed text_view-config.cmake file, so that the Text_view package configuration file is found. It is also possible to use the build directory as a (non-relocatable) installation directory by setting The CMAKE_PREFIX_PATH or text_view_DIR variables appropriately. See the CMake documentation for more details. The CMakeLists.txt files provided with the installed examples exemplify a minimal CMake based build system for a downstream consumer of Text_view.

All interfaces intended for public use are declared in the std::experimental::text namespace. The text namespace is an inline namespace, so all entities are available from the std::experimental namespace itself.

The interface descriptions in the sections that follow use the concept names from the ranges proposal, are intended to be used as specification, and should be considered authoritative. Any differences in behavior as defined by these definitions as compared to the Text_view implementation are unintentional and should be considered indicatative of a defect in either the specification or the implementation.

Header <experimental/text_view> synopsis

namespace std {
namespace experimental {
inline namespace text {

// concepts:
template<typename T> concept bool CodeUnit();
template<typename T> concept bool CodePoint();
template<typename T> concept bool CharacterSet();
template<typename T> concept bool Character();
template<typename T> concept bool CodeUnitIterator();
template<typename T, typename V> concept bool CodeUnitOutputIterator();
template<typename T> concept bool TextEncodingState();
template<typename T> concept bool TextEncodingStateTransition();
template<typename T> concept bool TextErrorPolicy();
template<typename T> concept bool TextEncoding();
template<typename T, typename I> concept bool TextEncoder();
template<typename T, typename I> concept bool TextForwardDecoder();
template<typename T, typename I> concept bool TextBidirectionalDecoder();
template<typename T, typename I> concept bool TextRandomAccessDecoder();
template<typename T> concept bool TextIterator();
template<typename T, typename I> concept bool TextSentinel();
template<typename T> concept bool TextOutputIterator();
template<typename T> concept bool TextInputIterator();
template<typename T> concept bool TextForwardIterator();
template<typename T> concept bool TextBidirectionalIterator();
template<typename T> concept bool TextRandomAccessIterator();
template<typename T> concept bool TextView();
template<typename T> concept bool TextInputView();
template<typename T> concept bool TextForwardView();
template<typename T> concept bool TextBidirectionalView();
template<typename T> concept bool TextRandomAccessView();

// error policies:
class text_error_policy;
class text_strict_error_policy;
class text_permissive_error_policy;
using text_default_error_policy = text_strict_error_policy;

// error handling:
enum class encode_status : int {
  no_error = /* implementation-defined */,
  invalid_character = /* implementation-defined */,
  invalid_state_transition = /* implementation-defined */
};
enum class decode_status : int {
  no_error = /* implementation-defined */,
  no_character = /* implementation-defined */,
  invalid_code_unit_sequence = /* implementation-defined */,
  underflow = /* implementation-defined */
};
constexpr inline bool status_ok(encode_status es) noexcept;
constexpr inline bool status_ok(decode_status ds) noexcept;
constexpr inline bool error_occurred(encode_status es) noexcept;
constexpr inline bool error_occurred(decode_status ds) noexcept;
const char* status_message(encode_status es) noexcept;
const char* status_message(decode_status ds) noexcept;

// exception classes:
class text_error;
class text_encode_error;
class text_decode_error;

// character sets:
class any_character_set;
class basic_execution_character_set;
class basic_execution_wide_character_set;
class unicode_character_set;

// implementation defined character set type aliases:
using execution_character_set = /* implementation-defined */ ;
using execution_wide_character_set = /* implementation-defined */ ;
using universal_character_set = /* implementation-defined */ ;

// character set identification:
class character_set_id;

template<CharacterSet CST>
  inline character_set_id get_character_set_id();

// character set information:
class character_set_info;

template<CharacterSet CST>
  inline const character_set_info& get_character_set_info();
const character_set_info& get_character_set_info(character_set_id id);

// character set and encoding traits:
template<typename T>
  using code_unit_type_t = /* implementation-defined */ ;
template<typename T>
  using code_point_type_t = /* implementation-defined */ ;
template<typename T>
  using character_set_type_t = /* implementation-defined */ ;
template<typename T>
  using character_type_t = /* implementation-defined */ ;
template<typename T>
  using encoding_type_t = /* implementation-defined */ ;
template<typename T>
  using default_encoding_type_t = /* implementation-defined */ ;

// characters:
template<CharacterSet CST> class character;
template <> class character<any_character_set>;

template<CharacterSet CST>
  bool operator==(const character<any_character_set> &lhs,
                  const character<CST> &rhs);
template<CharacterSet CST>
  bool operator==(const character<CST> &lhs,
                  const character<any_character_set> &rhs);
template<CharacterSet CST>
  bool operator!=(const character<any_character_set> &lhs,
                  const character<CST> &rhs);
template<CharacterSet CST>
  bool operator!=(const character<CST> &lhs,
                  const character<any_character_set> &rhs);

// encoding state and transition types:
class trivial_encoding_state;
class trivial_encoding_state_transition;
class utf8bom_encoding_state;
class utf8bom_encoding_state_transition;
class utf16bom_encoding_state;
class utf16bom_encoding_state_transition;
class utf32bom_encoding_state;
class utf32bom_encoding_state_transition;

// encodings:
class basic_execution_character_encoding;
class basic_execution_wide_character_encoding;
#if defined(__STDC_ISO_10646__)
class iso_10646_wide_character_encoding;
#endif // __STDC_ISO_10646__
class utf8_encoding;
class utf8bom_encoding;
class utf16_encoding;
class utf16be_encoding;
class utf16le_encoding;
class utf16bom_encoding;
class utf32_encoding;
class utf32be_encoding;
class utf32le_encoding;
class utf32bom_encoding;

// implementation defined encoding type aliases:
using execution_character_encoding = /* implementation-defined */ ;
using execution_wide_character_encoding = /* implementation-defined */ ;
using char8_character_encoding = /* implementation-defined */ ;
using char16_character_encoding = /* implementation-defined */ ;
using char32_character_encoding = /* implementation-defined */ ;

// itext_iterator:
template<TextEncoding ET,
         ranges::View VT,
         TextErrorPolicy TEP = text_default_error_policy>
  requires TextForwardDecoder<ET, /* implementation-defined */ >()
  class itext_iterator;

// itext_sentinel:
template<TextEncoding ET,
         ranges::View VT,
         TextErrorPolicy TEP = text_default_error_policy>
  class itext_sentinel;

// otext_iterator:
template<TextEncoding ET,
         CodeUnitOutputIterator<code_unit_type_t<ET>> CUIT,
         TextErrorPolicy TEP = text_default_error_policy>
  class otext_iterator;

// otext_iterator factory functions:
template<TextEncoding ET,
         TextErrorPolicy TEP,
         CodeUnitOutputIterator<code_unit_type_t<ET>> IT>
  auto make_otext_iterator(typename ET::state_type state, IT out)
  -> otext_iterator<ET, IT>;
template<TextEncoding ET,
         CodeUnitOutputIterator<code_unit_type_t<ET>> IT>
  auto make_otext_iterator(typename ET::state_type state, IT out)
  -> otext_iterator<ET, IT>;
template<TextEncoding ET,
         TextErrorPolicy TEP,
         CodeUnitOutputIterator<code_unit_type_t<ET>> IT>
  auto make_otext_iterator(IT out)
  -> otext_iterator<ET, IT>;
template<TextEncoding ET,
         CodeUnitOutputIterator<code_unit_type_t<ET>> IT>
  auto make_otext_iterator(IT out)
  -> otext_iterator<ET, IT>;

// basic_text_view:
template<TextEncoding ET,
         ranges::View VT,
         TextErrorPolicy TEP = text_default_error_policy>
  class basic_text_view;

// basic_text_view type aliases:
using text_view = basic_text_view<execution_character_encoding,
                                  /* implementation-defined */ >;
using wtext_view = basic_text_view<execution_wide_character_encoding,
                                   /* implementation-defined */ >;
using u8text_view = basic_text_view<char8_character_encoding,
                                    /* implementation-defined */ >;
using u16text_view = basic_text_view<char16_character_encoding,
                                     /* implementation-defined */ >;
using u32text_view = basic_text_view<char32_character_encoding,
                                     /* implementation-defined */ >;

// basic_text_view factory functions:
template<TextEncoding ET, ranges::InputIterator IT, ranges::Sentinel<IT> ST>
  auto make_text_view(typename ET::state_type state, IT first, ST last)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<ranges::InputIterator IT, ranges::Sentinel<IT> ST>
  requires requires () {
    typename default_encoding_type_t<ranges::value_type_t<IT>>;
  }
  auto make_text_view(typename default_encoding_type_t<ranges::value_type_t<IT>>::state_type state,
                      IT first,
                      ST last)
  -> basic_text_view<default_encoding_type_t<ranges::value_type_t<IT>>, /* implementation-defined */ >;
template<TextEncoding ET, ranges::InputIterator IT, ranges::Sentinel<IT> ST>
  auto make_text_view(IT first, ST last)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<ranges::InputIterator IT, ranges::Sentinel<IT> ST>
  requires requires () {
    typename default_encoding_type_t<ranges::value_type_t<IT>>;
  }
  auto make_text_view(IT first, ST last)
  -> basic_text_view<default_encoding_type_t<ranges::value_type_t<IT>>, /* implementation-defined */ >;
template<TextEncoding ET, ranges::ForwardIterator IT>
  auto make_text_view(typename ET::state_type state,
                      IT first,
                      typename std::make_unsigned<ranges::difference_type_t<IT>>::type n)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<ranges::ForwardIterator IT>
  requires requires () {
    typename default_encoding_type_t<ranges::value_type_t<IT>>;
  }
  auto make_text_view(typename typename default_encoding_type_t<ranges::value_type_t<IT>>::state_type state,
                      IT first,
                      typename std::make_unsigned<ranges::difference_type_t<IT>>::type n)
  -> basic_text_view<default_encoding_type_t<ranges::value_type_t<IT>>, /* implementation-defined */ >;
template<TextEncoding ET, ranges::ForwardIterator IT>
  auto make_text_view(IT first,
                      typename std::make_unsigned<ranges::difference_type_t<IT>>::type n)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<ranges::ForwardIterator IT>
  requires requires () {
    typename default_encoding_type_t<ranges::value_type_t<IT>>;
  }
  auto make_text_view(IT first,
                      typename std::make_unsigned<ranges::difference_type_t<IT>>::type n)
  -> basic_text_view<default_encoding_type_t<ranges::value_type_t<IT>>, /* implementation-defined */ >;
template<TextEncoding ET, ranges::InputRange Iterable>
  auto make_text_view(typename ET::state_type state,
                      const Iterable &iterable)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<ranges::InputRange Iterable>
  requires requires () {
    typename default_encoding_type_t<ranges::value_type_t<ranges::iterator_t<Iterable>>>;
  }
  auto make_text_view(typename default_encoding_type_t<ranges::value_type_t<ranges::iterator_t<Iterable>>>::state_type state,
                      const Iterable &iterable)
  -> basic_text_view<default_encoding_type_t<ranges::value_type_t<ranges::iterator_t<Iterable>>>, /* implementation-defined */ >;
template<TextEncoding ET, ranges::InputRange Iterable>
  auto make_text_view(const Iterable &iterable)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<ranges::InputRange Iterable>
  requires requires () {
    typename default_encoding_type_t<ranges::value_type_t<ranges::iterator_t<Iterable>>>;
  }
  auto make_text_view(const Iterable &iterable)
  -> basic_text_view<default_encoding_type_t<ranges::value_type_t<ranges::iterator_t<Iterable>>>, /* implementation-defined */ >;
template<TextInputIterator TIT, TextSentinel<TIT> TST>
  auto make_text_view(TIT first, TST last)
  -> basic_text_view<ET, /* implementation-defined */ >;
template<TextView TVT>
  TVT make_text_view(TVT tv);

} // inline namespace text
} // namespace experimental
} // namespace std

Concepts

Concept CodeUnit
Concept CodePoint
Concept CharacterSet
Concept Character
Concept CodeUnitIterator
Concept CodeUnitOutputIterator
Concept TextEncodingState
Concept TextEncodingStateTransition
Concept TextErrorPolicy
Concept TextEncoding
Concept TextEncoder
Concept TextForwardDecoder
Concept TextBidirectionalDecoder
Concept TextRandomAccessDecoder
Concept TextIterator
Concept TextSentinel
Concept TextOutputIterator
Concept TextInputIterator
Concept TextForwardIterator
Concept TextBidirectionalIterator
Concept TextRandomAccessIterator
Concept TextView
Concept TextInputView
Concept TextForwardView
Concept TextBidirectionalView
Concept TextRandomAccessView

Concept CodeUnit

The CodeUnit concept specifies requirements for a type usable as the code unit type of a string type.

template<typename T> concept bool CodeUnit() {
  return /* implementation-defined */ ;
}

CodeUnit<T>() is satisfied if and only if std::is_integral<T>::value is true and at least one of std::is_unsigned<T>::value is true, std::is_same<std::remove_cv_t<T>, char>::value is true, or std::is_same<std::remove_cv_t<T>, wchar_t>::value is true.

Concept CodePoint

The CodePoint concept specifies requirements for a type usable as the code point type of a character set type.

template<typename T> concept bool CodePoint() {
  return /* implementation-defined */ ;
}

CodePoint<T>() is satisfied if and only if std::is_integral<T>::value is true and at least one of std::is_unsigned<T>::value is true, std::is_same<std::remove_cv_t<T>, char>::value is true, or std::is_same<std::remove_cv_t<T>, wchar_t>::value is true.

Concept CharacterSet

The CharacterSet concept specifies requirements for a type that describes a character set. Such a type has a member typedef-name declaration for a type that satisfies CodePoint, a static member function that returns a name for the character set, and a static member function that returns a code point value to be used to construct a substitution character to stand in when errors occur during encoding and decoding operations when the permissive error policy is in effect.

template<typename T> concept bool CharacterSet() {
  return CodePoint<code_point_type_t<T>>()
      && requires () {
           { T::get_name() } noexcept -> const char *;
           { T::get_substitution_code_point() } noexcept -> code_point_type_t<T>;
         };
}

Concept Character

The Character concept specifies requirements for a type that describes a character as defined by an associated character set. Non-static member functions provide access to the code point value of the described character. Types that satisfy Character are regular and copyable.

template<typename T> concept bool Character() {
  return ranges::Regular<T>()
      && ranges::Constructible<T, code_point_type_t<character_set_type_t<T>>>()
      && CharacterSet<character_set_type_t<T>>()
      && requires (T t,
                   const T ct,
                   code_point_type_t<character_set_type_t<T>> cp)
         {
           { t.set_code_point(cp) } noexcept;
           { ct.get_code_point() } noexcept
               -> code_point_type_t<character_set_type_t<T>>;
           { ct.get_character_set_id() }
               -> character_set_id;
         };
}

Concept CodeUnitIterator

The CodeUnitIterator concept specifies requirements of an iterator that has a value type that satisfies CodeUnit.

template<typename T> concept bool CodeUnitIterator() {
  return ranges::Iterator<T>()
      && CodeUnit<ranges::value_type_t<T>>();
}

Concept CodeUnitOutputIterator

The CodeUnitOutputIterator concept specifies requirements of an output iterator that can be assigned from a type that satisfies CodeUnit.

template<typename T, typename V> concept bool CodeUnitOutputIterator() {
  return ranges::OutputIterator<T, V>()
      && CodeUnit<V>();
}

Concept TextEncodingState

The TextEncodingState concept specifies requirements of types that hold encoding state. Such types are semiregular.

template<typename T> concept bool TextEncodingState() {
  return ranges::Semiregular<T>();
}

Concept TextEncodingStateTransition

The TextEncodingStateTransition concept specifies requirements of types that hold encoding state transitions. Such types are semiregular.

template<typename T> concept bool TextEncodingStateTransition() {
  return ranges::Semiregular<T>();
}

Concept TextErrorPolicy

The TextErrorPolicy concept specifies requirements of types used to specify error handling policies. Such types are semiregular class types that derive from class text_error_policy.

template<typename T> concept bool TextErrorPolicy() {
  return ranges::Semiregular<T>()
      && ranges::DerivedFrom<T, text_error_policy>()
      && !ranges::Same<std::remove_cv_t<T>, text_error_policy>();
}

Concept TextEncoding

The TextEncoding concept specifies requirements of types that define an encoding. Such types define member types that identify the code unit, character, encoding state, and encoding state transition types, a static member function that returns an initial encoding state object that defines the encoding state at the beginning of a sequence of encoded characters, and static data members that specify the minimum and maximum number of code units used to encode any single character.

template<typename T> concept bool TextEncoding() {
  return requires () {
           { T::min_code_units } noexcept -> int;
           { T::max_code_units } noexcept -> int;
         }
      && TextEncodingState<typename T::state_type>()
      && TextEncodingStateTransition<typename T::state_transition_type>()
      && CodeUnit<code_unit_type_t<T>>()
      && Character<character_type_t<T>>()
      && requires () {
           { T::initial_state() } noexcept
               -> const typename T::state_type&;
         };
}

Concept TextEncoder

The TextEncoder concept specifies requirements of types that are used to encode characters using a particular code unit iterator that satisfies OutputIterator. Such a type satisifies TextEncoding and defines static member functions used to encode state transitions and characters.

template<typename T, typename I> concept bool TextEncoder() {
  return TextEncoding<T>()
      && ranges::OutputIterator<CUIT, code_unit_type_t<T>>()
      && requires (
           typename T::state_type &state,
           CUIT &out,
           typename T::state_transition_type stt,
           int &encoded_code_units)
         {
           { T::encode_state_transition(state, out, stt, encoded_code_units) }
             -> encode_status;
         }
      && requires (
           typename T::state_type &state,
           CUIT &out,
           character_type_t<T> c,
           int &encoded_code_units)
         {
           { T::encode(state, out, c, encoded_code_units) }
             -> encode_status;
         };
}

Concept TextForwardDecoder

The TextForwardDecoder concept specifies requirements of types that are used to decode characters using a particular code unit iterator that satisifies ForwardIterator. Such a type satisfies TextEncoding and defines a static member function used to decode state transitions and characters.

template<typename T, typename I> concept bool TextForwardDecoder() {
  return TextEncoding<T>()
      && ranges::ForwardIterator<CUIT>()
      && ranges::ConvertibleTo<ranges::value_type_t<CUIT>,
                               code_unit_type_t<T>>()
      && requires (
           typename T::state_type &state,
           CUIT &in_next,
           CUIT in_end,
           character_type_t<T> &c,
           int &decoded_code_units)
         {
           { T::decode(state, in_next, in_end, c, decoded_code_units) }
             -> decode_status;
         };

}

Concept TextBidirectionalDecoder

The TextBidirectionalDecoder concept specifies requirements of types that are used to decode characters using a particular code unit iterator that satisifies BidirectionalIterator. Such a type satisfies TextForwardDecoder and defines a static member function used to decode state transitions and characters in the reverse order of their encoding.

template<typename T, typename I> concept bool TextBidirectionalDecoder() {
  return TextForwardDecoder<T, CUIT>()
      && ranges::BidirectionalIterator<CUIT>()
      && requires (
           typename T::state_type &state,
           CUIT &in_next,
           CUIT in_end,
           character_type_t<T> &c,
           int &decoded_code_units)
         {
           { T::rdecode(state, in_next, in_end, c, decoded_code_units) }
             -> decode_status;
         };
}

Concept TextRandomAccessDecoder

The TextRandomAccessDecoder concept specifies requirements of types that are used to decode characters using a particular code unit iterator that satisifies RandomAccessIterator. Such a type satisfies TextBidirectionalDecoder, requires that the minimum and maximum number of code units used to encode any character have the same value, and that the encoding state be an empty type.

template<typename T, typename I> concept bool TextRandomAccessDecoder() {
  return TextBidirectionalDecoder<T, CUIT>()
      && ranges::RandomAccessIterator<CUIT>()
      && T::min_code_units == T::max_code_units
      && std::is_empty<typename T::state_type>::value;
}

Concept TextIterator

The TextIterator concept specifies requirements of iterator types that are used to encode and decode characters as an encoded sequence of code units. Encoding state and error indication is held in each iterator instance and is made accessible via non-static member functions.

template<typename T> concept bool TextIterator() {
  return ranges::Iterator<T>()
      && TextEncoding<encoding_type_t<T>>()
      && TextErrorPolicy<typename T::error_policy>()
      && TextEncodingState<typename T::state_type>()
      && requires (const T ct) {
           { ct.state() } noexcept
               -> const typename encoding_type_t<T>::state_type&;
           { ct.error_occurred() } noexcept
               -> bool;
         };
}

Concept TextSentinel

The TextSentinel concept specifies requirements of types that are used to mark the end of a range of encoded characters. A type T that satisfies TextIterator also satisfies TextSentinel<T> there by enabling TextIterator types to be used as sentinels.

template<typename T, typename I> concept bool TextSentinel() {
  return ranges::Sentinel<T, I>()
      && TextIterator<I>()
      && TextErrorPolicy<typename T::error_policy>();
}

Concept TextOutputIterator

The TextOutputIterator concept refines TextIterator with a requirement that the type also satisfy ranges::OutputIterator for the character type of the associated encoding and that a member function be provided for retrieving error information.

template<typename T> concept bool TextOutputIterator() {
  return TextIterator<I>();
      && ranges::OutputIterator<T, character_type_t<encoding_type_t<T>>>()
      && requires (const T ct) {
           { ct.get_error() } noexcept
               -> encode_status;
         };
}

Concept TextInputIterator

The TextInputIterator concept refines TextIterator with requirements that the type also satisfy ranges::InputIterator, that the iterator value type satisfy Character, and that a member function be provided for retrieving error information.

template<typename T> concept bool TextInputIterator() {
  return TextIterator<T>()
      && ranges::InputIterator<T>()
      && Character<ranges::value_type_t<T>>()
      && requires (const T ct) {
           { ct.get_error() } noexcept
               -> decode_status;
         };
}

Concept TextForwardIterator

The TextForwardIterator concept refines TextInputIterator with a requirement that the type also satisfy ranges::ForwardIterator.

template<typename T> concept bool TextForwardIterator() {
  return TextInputIterator<T>()
      && ranges::ForwardIterator<T>();
}

Concept TextBidirectionalIterator

The TextBidirectionalIterator concept refines TextForwardIterator with a requirement that the type also satisfy ranges::BidirectionalIterator.

template<typename T> concept bool TextBidirectionalIterator() {
  return TextForwardIterator<T>()
      && ranges::BidirectionalIterator<T>();
}

Concept TextRandomAccessIterator

The TextRandomAccessIterator concept refines TextBidirectionalIterator with a requirement that the type also satisfy ranges::RandomAccessIterator.

template<typename T> concept bool TextRandomAccessIterator() {
  return TextBidirectionalIterator<T>()
      && ranges::RandomAccessIterator<T>();
}

Concept TextView

The TextView concept specifies requirements of types that provide view access to an underlying code unit range. Such types satisfy ranges::View, provide iterators that satisfy TextIterator, define member types that identify the encoding, encoding state, and underlying code unit range and iterator types. Non-static member functions are provided to access the underlying code unit range and initial encoding state.

Types that satisfy TextView do not own the underlying code unit range and are copyable in constant time. The lifetime of the underlying range must exceed the lifetime of referencing TextView objects.

template<typename T> concept bool TextView() {
  return ranges::View<T>()
      R& TextIterator<ranges::iterator_t<T>>()
      && TextEncoding<encoding_type_t<T>>()
      && ranges::View<typename T::view_type>()
      && TextErrorPolicy<typename T::error_policy>()
      && TextEncodingState<typename T::state_type>()
      && CodeUnitIterator<code_unit_iterator_t<T>>()
      R& requires (T t, const T ct) {
           { ct.base() } noexcept
               -> const typename T::view_type&;
           { ct.initial_state() } noexcept
               -> const typename T::state_type&;
         };
}

Concept TextInputView

The TextInputView concept refines TextView with a requirement that the view's iterator type also satisfy TextInputIterator.

template<typename T> concept bool TextInputView() {
  return TextView<T>()
      && TextInputIterator<ranges::iterator_t<T>>();
}

Concept TextForwardView

The TextForwardView concept refines TextInputView with a requirement that the view's iterator type also satisfy TextForwardIterator.

template<typename T> concept bool TextForwardView() {
  return TextInputView<T>()
      && TextForwardIterator<ranges::iterator_t<T>>();
}

Concept TextBidirectionalView

The TextBidirectionalView concept refines TextForwardView with a requirement that the view's iterator type also satisfy TextBidirectionalIterator.

template<typename T> concept bool TextBidirectionalView() {
  return TextForwardView<T>()
      && TextBidirectionalIterator<ranges::iterator_t<T>>();
}

Concept TextRandomAccessView

The TextRandomAccessView concept refines TextBidirectionalView with a requirement that the view's iterator type also satisfy TextRandomAccessIterator.

template<typename T> concept bool TextRandomAccessView() {
  return TextBidirectionalView<T>()
      && TextRandomAccessIterator<ranges::iterator_t<T>>();
}

Class text_error_policy

Class text_error_policy is a base class from which all text error policy classes must derive.

class text_error_policy {};

Class text_strict_error_policy

The text_strict_error_policy class is a policy class that specifies that exceptions be thrown for errors that occur during encoding and decoding operations initiated through text iterators. This class satisfies TextErrorPolicy.

class text_strict_error_policy : public text_error_policy {};

Class text_permissive_error_policy

The class_text_permissive_error_policy class is a policy class that specifies that substitution characters such as the Unicode replacement character U+FFFD be substituted in place of errors that occur during encoding and decoding operations initiated through text iterators. This class satisfies TextErrorPolicy.

class text_permissive_error_policy : public text_error_policy {};

Alias text_default_error_policy

The text_default_error_policy alias specifies the default text error policy. Conforming implementations must alias this to text_strict_error_policy, but may have options to select an alternative default policy for environments that do not support exceptions. The referred class shall satisfy TextErrorPolicy.

using text_default_error_policy = text_strict_error_policy;

Error Status

Enum encode_status
Enum decode_status
status_ok
error_occurred
status_message

Enum encode_status

The encode_status enumeration type defines enumerators used to report errors that occur during text encoding operations.

The no_error enumerator indicates that no error has occurred.

The invalid_character enumerator indicates that an attempt was made to encode a character that was not valid for the encoding.

The invalid_state_transition enumerator indicates that an attempt was made to encode a state transition that was not valid for the encoding.

enum class encode_status : int {
  no_error = /* implementation-defined */,
  invalid_character = /* implementation-defined */,
  invalid_state_transition = /* implementation-defined */
};

Enum decode_status

The decode_status enumeration type defines enumerators used to report errors that occur during text decoding operations.

The no_error enumerator indicates that no error has occurred.

The no_character enumerator indicates that no error has occurred, but that no character was decoded for a code unit sequence. This typically indicates that the code unit sequence represents an encoding state transition such as for an escape sequence or byte order marker.

The invalid_code_unit_sequence enumerator indicates that an attempt was made to decode an invalid code unit sequence.

The underflow enumerator indicates that the end of the input range was encountered before a complete code unit sequence was decoded.

enum class decode_status : int {
  no_error = /* implementation-defined */,
  no_character = /* implementation-defined */,
  invalid_code_unit_sequence = /* implementation-defined */,
  underflow = /* implementation-defined */
};

status_ok

The status_ok function returns true if the encode_status argument value is encode_status::no_error or if the decode_status argument is either of decode_status::no_error or decode_status::no_character. false is returned for all other values.

constexpr inline bool status_ok(encode_status es) noexcept;
constexpr inline bool status_ok(decode_status ds) noexcept;

error_occurred

The error_occurred function returns false if the encode_status argument value is encode_status::no_error or if the decode_status argument is either of decode_status::no_error or decode_status::no_character. true is returned for all other values.

constexpr inline bool error_occurred(encode_status es) noexcept;
constexpr inline bool error_occurred(decode_status ds) noexcept;

status_message

The status_message function returns a pointer to a statically allocated string containing a short description of the value of the encode_status or decode_status argument.

const char* status_message(encode_status es) noexcept;
const char* status_message(decode_status ds) noexcept;

Exceptions

Class text_error
Class text_encode_error
Class text_decode_error

Class text_error

The text_error class defines the base class for the types of objects thrown as exceptions to report errors detected during text processing.

class text_error : public std::runtime_error
{
public:
  using std::runtime_error::runtime_error;
};

Class text_encode_error

The text_encode_error class defines the types of objects thrown as exceptions to report errors detected during encoding of a character. Objects of this type are generally thrown in response to an attempt to encode a character with an invalid code point value, or to encode an invalid state transition.

class text_encode_error : public text_error
{
public:
  explicit text_encode_error(encode_status es) noexcept;

  const encode_status& status_code() const noexcept;

private:
  encode_status es; // exposition only
};

Class text_decode_error

The text_decode_error class defines the types of objects thrown as exceptions to report errors detected during decoding of a code unit sequence. Objects of this type are generally thrown in response to an attempt to decode an ill-formed code unit sequence, a code unit sequence that specifies an invalid code point value, or a code unit sequence that specifies an invalid state transition.

class text_decode_error : public text_error
{
public:
  explicit text_decode_error(decode_status ds) noexcept;

  const decode_status& status_code() const noexcept;

private:
  decode_status ds; // exposition only
};

Type traits

code_unit_type_t
code_point_type_t
character_set_type_t
character_type_t
encoding_type_t
default_encoding_type_t

code_unit_type_t

The code_unit_type_t type alias template provides convenient means for selecting the associated code unit type of some other type, such as an encoding type that satisfies TextEncoding. The aliased type is the same as typename T::code_unit_type.

template<typename T>
  using code_unit_type_t = /* implementation-defined */ ;

code_point_type_t

The code_point_type_t type alias template provides convenient means for selecting the associated code point type of some other type, such as a type that satisfies CharacterSet or Character. The aliased type is the same as typename T::code_point_type.

template<typename T>
  using code_point_type_t = /* implementation-defined */ ;

character_set_type_t

The character_set_type_t type alias template provides convenient means for selecting the associated character set type of some other type, such as a type that satisfies Character. The aliased type is the same as typename T::character_set_type.

template<typename T>
  using character_set_type_t = /* implementation-defined */ ;

character_type_t

The character_type_t type alias template provides convenient means for selecting the associated character type of some other type, such as a type that satisfies TextEncoding. The aliased type is the same as typename T::character_type.

template<typename T>
  using character_type_t = /* implementation-defined */ ;

encoding_type_t

The encoding_type_t type alias template provides convenient means for selecting the associated encoding type of some other type, such as a type that satisfies TextIterator or TextView. The aliased type is the same as typename T::encoding_type.

template<typename T>
  using encoding_type_t = /* implementation-defined */ ;

default_encoding_type_t

The default_encoding_type_t type alias template resolves to the default encoding type, if any, for a given type, such as a type that satisfies CodeUnit. Specializations are provided for the following cv-unqualified and reference removed fundamental types. Otherwise, the alias will attempt to resolve against a default_encoding_type member type.

When `std::remove_cv_t<std::remove_reference_t<T>>` is ...	the default encoding is ...
`char`	`execution_character_encoding`
`wchar_t`	`execution_wide_character_encoding`
`char16_t`	`char16_character_encoding`
`char32_t`	`char32_character_encoding`

template<typename T>
  using default_encoding_type_t = /* implementation-defined */ ;

Character sets

Class any_character_set
Class basic_execution_character_set
Class basic_execution_wide_character_set
Class unicode_character_set
Character set type aliases

Class any_character_set

The any_character_set class provides a generic character set type used when a specific character set type is unknown or when the ability to switch between specific character sets is required. This class satisfies the CharacterSet concept and has an implementation defined code_point_type that is able to represent code point values from all of the implementation provided character set types. The code point returned by get_substitution_code_point is implementation defined.

class any_character_set {
public:
  using code_point_type = /* implementation-defined */;

  static const char* get_name() noexcept {
    return "any_character_set";
  }

  static constexpr code_point_type get_substitution_code_point() noexcept;
};

Class basic_execution_character_set

The basic_execution_character_set class represents the basic execution character set specified in [lex.charset]p3 of the C++11 standard. This class satisfies the CharacterSet concept and has a code_point_type member type that aliases char. The code point returned by get_substitution_code_point is the code point for the '?' character.

class basic_execution_character_set {
public:
  using code_point_type = char;

  static const char* get_name() noexcept {
    return "basic_execution_character_set";
  }

  static constexpr code_point_type get_substitution_code_point() noexcept;
};

Class basic_execution_wide_character_set

The basic_execution_wide_character_set class represents the basic execution wide character set specified in [lex.charset]p3 of the C++11 standard. This class satisfies the CharacterSet concept and has a code_point_type member type that aliases wchar_t. The code point returned by get_substitution_code_point is the code point for the L'?' character.

class basic_execution_wide_character_set {
public:
  using code_point_type = wchar_t;

  static const char* get_name() noexcept {
    return "basic_execution_wide_character_set";
  }

  static constexpr code_point_type get_substitution_code_point() noexcept;
};

Class unicode_character_set

The unicode_character_set class represents the Unicode character sets. This class satisfies the CharacterSet concept and has a code_point_type member type that aliases char32_t. The code point returned by get_substitution_code_point is the U+FFFD Unicode replacement character.

class unicode_character_set {
public:
  using code_point_type = char32_t;

  static const char* get_name() noexcept {
    return "unicode_character_set";
  }

  static constexpr code_point_type get_substitution_code_point() noexcept;
};

Character set type aliases

The execution_character_set, execution_wide_character_set, and universal_character_set type aliases reflect the implementation defined execution, wide execution, and universal character sets specified in [lex.charset]p2-3 of the C++ standard.

The character set aliased by execution_character_set must be a superset of the basic_execution_character_set character set. This alias refers to the character set that the compiler assumes during translation; the character set that the compiler uses when translating characters specified by universal-character-name designators in ordinary string literals, not the locale sensitive run-time execution character set.

The character set aliased by execution_wide_character_set must be a superset of the basic_execution_wide_character_set character set. This alias refers to the character set that the compiler assumes during translation; the character set that the compiler uses when translating characters specified by universal-character-name designators in wide string literals, not the locale sensitive run-time execution wide character set.

The character set aliased by universal_character_set must be a superset of the unicode_character_set character set.

using execution_character_set = /* implementation-defined */ ;
using execution_wide_character_set = /* implementation-defined */ ;
using universal_character_set = /* implementation-defined */ ;

Character set identification

Class character_set_id
get_character_set_id

Class character_set_id

The character_set_id class provides unique, opaque values used to identify character sets at run-time. Values of this type are produced by get_character_set_id() and can be passed to get_character_set_info() to obtain character set information. Values of this type are copy constructible, copy assignable, equality comparable, and strictly totally ordered.

class character_set_id {
public:
  character_set_id() = delete;

  friend bool operator==(character_set_id lhs, character_set_id rhs) noexcept;
  friend bool operator!=(character_set_id lhs, character_set_id rhs) noexcept;

  friend bool operator<(character_set_id lhs, character_set_id rhs) noexcept;
  friend bool operator>(character_set_id lhs, character_set_id rhs) noexcept;
  friend bool operator<=(character_set_id lhs, character_set_id rhs) noexcept;
  friend bool operator>=(character_set_id lhs, character_set_id rhs) noexcept;
};

get_character_set_id

get_character_set_id() returns a unique, opaque value for the character set type specified by the template parameter.

template<CharacterSet CST>
  inline character_set_id get_character_set_id();

Character set information

Class character_set_info
get_character_set_info

Class character_set_info

The character_set_info class stores information about a character set. Values of this type are produced by the get_character_set_info() functions based on a character set type or ID.

class character_set_info {
public:
  character_set_info() = delete;

  character_set_id get_id() const noexcept;

  const char* get_name() const noexcept;

private:
  character_set_id id; // exposition only
};

get_character_set_info

The get_character_set_info() functions return a reference to a character_set_info object based on a character set type or ID.

const character_set_info& get_character_set_info(character_set_id id);

template<CharacterSet CST>
  inline const character_set_info& get_character_set_info();

Characters

Class template character

Class template character

Objects of character class template specialization type define a character via the association of a code point value and a character set. The specialization provided for the any_character_set type is used to maintain a dynamic character set association while specializations for other character sets specify a static association. These types satisfy the Character concept and are default constructible, copy constructible, copy assignable, and equality comparable. Member functions provide access to the code point and character set ID values for the represented character. Default constructed objects represent a null character using a zero initialized code point value.

Objects with different character set type are not equality comparable with the exception that objects with a static character set type of any_character_set are comparable with objects with any static character set type. In this case, objects compare equally if and only if their character set ID and code point values match. Equality comparison between objects with different static character set type is not implemented to avoid potentially costly unintended implicit transcoding between character sets.

template<CharacterSet CST>
class character {
public:
  using character_set_type = CST;
  using code_point_type = code_point_type_t<character_set_type>;

  character() = default;
  explicit character(code_point_type code_point) noexcept;

  friend bool operator==(const character &lhs,
                         const character &rhs) noexcept;
  friend bool operator!=(const character &lhs,
                         const character &rhs) noexcept;

  void set_code_point(code_point_type code_point) noexcept;
  code_point_type get_code_point() const noexcept;

  static character_set_id get_character_set_id();

private:
  code_point_type code_point; // exposition only
};

template<>
class character<any_character_set> {
public:
  using character_set_type = any_character_set;
  using code_point_type = code_point_type_t<character_set_type>;

  character() = default;
  explicit character(code_point_type code_point) noexcept;
  character(character_set_id cs_id, code_point_type code_point) noexcept;

  friend bool operator==(const character &lhs,
                         const character &rhs) noexcept;
  friend bool operator!=(const character &lhs,
                         const character &rhs) noexcept;

  void set_code_point(code_point_type code_point) noexcept;
  code_point_type get_code_point() const noexcept;

  void set_character_set_id(character_set_id new_cs_id) noexcept;
  character_set_id get_character_set_id() const noexcept;

private:
  character_set_id cs_id;     // exposition only
  code_point_type code_point; // exposition only
};

template<CharacterSet CST>
  bool operator==(const character<any_character_set> &lhs,
                  const character<CST> &rhs);
template<CharacterSet CST>
  bool operator==(const character<CST> &lhs,
                  const character<any_character_set> &rhs);
template<CharacterSet CST>
  bool operator!=(const character<any_character_set> &lhs,
                  const character<CST> &rhs);
template<CharacterSet CST>
  bool operator!=(const character<CST> &lhs,
                  const character<any_character_set> &rhs);

Encodings

class trivial_encoding_state
class trivial_encoding_state_transition
Class basic_execution_character_encoding
Class basic_execution_wide_character_encoding
Class iso_10646_wide_character_encoding
Class utf8_encoding
Class utf8bom_encoding
Class utf16_encoding
Class utf16be_encoding
Class utf16le_encoding
Class utf16bom_encoding
Class utf32_encoding
Class utf32be_encoding
Class utf32le_encoding
Class utf32bom_encoding
Encoding type aliases

Class trivial_encoding_state

The trivial_encoding_state class is an empty class used by stateless encodings to implement the parts of the generic encoding interfaces necessary to support stateful encodings.

class trivial_encoding_state {};

Class trivial_encoding_state_transition

The trivial_encoding_state_transition class is an empty class used by stateless encodings to implement the parts of the generic encoding interfaces necessary to support stateful encodings that support non-code-point encoding code unit sequences.

class trivial_encoding_state_transition {};

Class basic_execution_character_encoding

The basic_execution_character_encoding class implements support for the encoding used for ordinary string literals limited to support for the basic execution character set as defined in [lex.charset]p3 of the C++ standard.

This encoding is trivial, stateless, fixed width, supports random access decoding, and has a code unit of type char.