Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update libcudf developer guide for strings offsets column #15661

Merged
merged 11 commits into from
May 13, 2024
94 changes: 70 additions & 24 deletions cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# libcudf C++ Developer Guide {#DEVELOPER_GUIDE}
# libcudf C++ Developer Guide

This document serves as a guide for contributors to libcudf C++ code. Developers should also refer
to these additional files for further documentation of libcudf best practices.
Expand Down Expand Up @@ -828,7 +828,7 @@ This iterator returns the validity of the underlying element (`true` or `false`)

The proliferation of data types supported by libcudf can result in long compile times. One area
where compile time was a problem is in types used to store indices, which can be any integer type.
The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
The "indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
used for index types (integers) without requiring a type-specific instance. It can be used for any
iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
`int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a
Expand Down Expand Up @@ -856,6 +856,41 @@ thrust::lower_bound(rmm::exec_policy(stream),
thrust::less<Element>());
```

### Offset-normalizing iterators

Like the [indexalator](#index-normalizing-iterators),
the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
used for offset column types (INT32 or INT64 only) without requiring a type-specific instance.
This is helpful when reading or building [strings columns](#strings-columns).
The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values
for both INT32 and INT64 offsets columns.
Likewise, an `output_offselator` can accept `int64` type values to store into either an
INT32 or INT64 output offsets column created appropriately.

Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
Example input iterator usage:

```c++
// convert the sizes to offsets
auto [offsets, char_bytes] = cudf::strings::detail::make_offsets_child_column(
output_sizes.begin(), output_sizes.end(), stream, mr);
auto d_offsets =
cudf::detail::offsetalator_factory::make_input_iterator(offsets->view());
// use d_offsets to address the output row bytes
```

Example output iterator usage:

```c++
// create offsets column as either INT32 or INT64 depending on the number of bytes
auto offsets_column = cudf::strings::detail::create_offsets_child_column(total_bytes,
offsets_count,
stream, mr);
auto d_offsets =
cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view());
// write appropriate offset values to d_offsets
```

## Namespaces

### External
Expand Down Expand Up @@ -1241,18 +1276,20 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap

Strings are represented as a column with a data device buffer and a child offsets column.
The parent column's type is `STRING` and its data holds all the characters across all the strings packed together
but its size represents the number of strings in the column, and its null mask represents the
validity of each string. To summarize, the strings column children are:

1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
string in a dense data buffer of all characters.
but its size represents the number of strings in the column and its null mask represents the
validity of each string.

With this representation, `data[offsets[i]]` is the first character of string `i`, and the
size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of
this compound column representation of strings.
The strings column contains a single, non-nullable child column
of offset elements that indicates the byte position offset to the beginning of each
string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the
first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`.
The following image shows an example of this compound column representation of strings.

![strings](strings.png)

The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer.
See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows.

## Structs columns

A struct is a nested data type with a set of child columns each representing an individual field
Expand Down Expand Up @@ -1351,46 +1388,55 @@ libcudf provides view types for nested column types as well as for the data elem

### cudf::strings_column_view and cudf::string_view

`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of
any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore
`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As its name implies, this is a
read-only object instance that points to device memory inside the strings column. It's lifespan is
the same (or less) as the column it views.
A `cudf::strings_column_view` wraps a strings column and contains a parent
`cudf::column_view` as a view of the strings column and an offsets `cudf::column_view`
which is a child of the parent.
The parent view contains the offset, size, and validity mask for the strings column.
The offsets view is non-nullable with `offset()==0` and its own size.
Since the offset column type can be either INT32 or INT64 it is useful to use the
offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values.

A `cudf::string_view` is a view of a single string and therefore
is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
data type for a `cudf::column` of type INT32. As its name implies, this is a
read-only object instance that points to device memory inside the strings column.
Its lifespan is the same (or less) as the column it views.
An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes.

Use the `column_device_view::element` method to access an individual row element. Like any other
column, do not call `element()` on a row that is null.

```c++
cudf::column_device_view d_strings;
cudf::strings_column_view scv;
auto d_strings = cudf::column_device_view::create(scv.parent(), stream);
...
if( d_strings.is_valid(row_index) ) {
string_view d_str = d_strings.element<string_view>(row_index);
...
}
```

A null string is not the same as an empty string. Use the `string_scalar` class if you need an
A null string is not the same as an empty string. Use the `cudf::string_scalar` class if you need an
instance of a class object to represent a null string.

The `string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf
functions like `sort` without string-specific code. The data for a `string_view` instance is
The `cudf::string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf
functions like `sort` without string-specific code. The data for a `cudf::string_view` instance is
required to be [UTF-8](#utf-8) and all operators and methods expect this encoding. Unless documented
otherwise, position and length parameters are specified in characters and not bytes. The class also
includes a `string_view::const_iterator` which can be used to navigate through individual characters
includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters
within the string.

`cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column.
`cudf::type_dispatcher` dispatches to the `cudf::string_view` data type when invoked on a `STRING` column.

#### UTF-8

The libcudf strings column only supports UTF-8 encoding for strings data.
[UTF-8](https://en.wikipedia.org/wiki/UTF-8) is a variable-length character encoding wherein each
character can be 1-4 bytes. This means the length of a string is not the same as its size in bytes.
For this reason, it is recommended to use the `string_view` class to access these characters for
For this reason, it is recommended to use the `cudf::string_view` class to access these characters for
most operations.

The `string_view.cuh` header also includes some utility methods for reading and writing
The `cudf/strings/detail/utf8.hpp` header also includes some utility methods for reading and writing
(`to_char_utf8/from_char_utf8`) individual UTF-8 characters to/from byte arrays.

### cudf::lists_column_view and cudf::lists_view
Expand Down
Loading