From 2dd924a9e9030bdaa287b9c23cd5bbed6a7dce68 Mon Sep 17 00:00:00 2001
From: David Wendt <dwendt@nvidia.com>
Date: Mon, 6 May 2024 11:05:20 -0400
Subject: [PATCH 1/5] Update libcudf developer guide for strings offsets column

---
 .../developer_guide/DEVELOPER_GUIDE.md        | 76 ++++++++++++++-----
 1 file changed, 57 insertions(+), 19 deletions(-)
diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index 05f8e4585cc..650b288e20f 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -828,7 +828,7 @@ This iterator returns the validity of the underlying element (`true` or `false`)
 
 The proliferation of data types supported by libcudf can result in long compile times. One area
 where compile time was a problem is in types used to store indices, which can be any integer type.
-The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
+The "indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
 used for index types (integers) without requiring a type-specific instance. It can be used for any
 iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
 `int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a
@@ -856,6 +856,36 @@ thrust::lower_bound(rmm::exec_policy(stream),
                     thrust::less<Element>());
 ```
 
+### Offset-normalizing iterators
+
+Like the [indexalator](#index-normalizing_iterators),
+the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
+used for offsets types (INT32 or INT64 only) without requiring a type-specific instance.
+This helpful when reading or building strings columns. The normalized type is int64 which means
+an `input_offsetsalator` will return int64 type values for both INT32 and INT64 offsets columns.
+Likewise, an `output_offselator` can accept int64 type values to store into either an
+INT32 or INT64 output offsets column created appropriately.
+
+Use the `offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
+Example input iterator usage:
+
+```c++
+  // convert the sizes to offsets
+  auto [offsets, char_bytes] = cudf::strings::detail::make_offsets_child_column(
+    output_sizes.begin(), output_sizes.end(), stream, mr);
+  auto d_offsets =
+    cudf::detail::offsetalator_factory::make_input_iterator(offsets->view());
+  // use d_offsets to address the output row bytes
+```
+
+Example output iterator usage:
+
+```c++
+    auto d_offsets =
+      cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view());
+    // write appropriate offset values to d_offsets
+```
+
 ## Namespaces
 
 ### External
@@ -1242,17 +1272,18 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap
 Strings are represented as a column with a data device buffer and a child offsets column.
 The parent column's type is `STRING` and its data holds all the characters across all the strings packed together
 but its size represents the number of strings in the column, and its null mask represents the
-validity of each string. To summarize, the strings column children are:
+validity of each string.
 
-1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
-   string in a dense data buffer of all characters.
-
-With this representation, `data[offsets[i]]` is the first character of string `i`, and the
-size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of
-this compound column representation of strings.
+The strings column contains a single child column which is a non-nullable column
+of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
+string in a dense data buffer of all characters. With this representation, `data[offsets[i]]` is the
+first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`.
+The following image shows an example of this compound column representation of strings.
 
 ![strings](strings.png)
 
+The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer.
+
 ## Structs columns
 
 A struct is a nested data type with a set of child columns each representing an individual field
@@ -1351,12 +1382,19 @@ libcudf provides view types for nested column types as well as for the data elem
 
 ### cudf::strings_column_view and cudf::string_view
 
-`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of
-any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore
-`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
-data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As its name implies, this is a
-read-only object instance that points to device memory inside the strings column. It's lifespan is
-the same (or less) as the column it views.
+A `cudf::strings_column_view` wraps a strings column which contains a parent
+`cudf::column_view` as a view the strings column and an offsets child `cudf::column_view`.
+The parent column contains the offset, size, and validity mask for the strings column.
+The offsets column is non-nullable with an `offset()==0` and its own size.
+Since the offset column type can be either INT32 or INT64 it is useful to use the
+offset normalizing iterators [offsetalator](#offset-normalizing_iterators) to access individual offset values.
+
+A `cudf::string_view` is a view of a single string and therefore
+is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
+data type for a `cudf::column` of type INT32. As its name implies, this is a
+read-only object instance that points to device memory inside the strings column.
+It's lifespan is the same (or less) as the column it views.
+An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes.
 
 Use the `column_device_view::element` method to access an individual row element. Like any other
 column, do not call `element()` on a row that is null.
@@ -1370,14 +1408,14 @@ column, do not call `element()` on a row that is null.
    }
 ```
 
-A null string is not the same as an empty string. Use the `string_scalar` class if you need an
+A null string is not the same as an empty string. Use the `cudf::string_scalar` class if you need an
 instance of a class object to represent a null string.
 
-The `string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf
-functions like `sort` without string-specific code. The data for a `string_view` instance is
+The `cudf::string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf
+functions like `sort` without string-specific code. The data for a `cudf::string_view` instance is
 required to be [UTF-8](#utf-8) and all operators and methods expect this encoding. Unless documented
 otherwise, position and length parameters are specified in characters and not bytes. The class also
-includes a `string_view::const_iterator` which can be used to navigate through individual characters
+includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters
 within the string.
 
 `cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column.
@@ -1387,7 +1425,7 @@ within the string.
 The libcudf strings column only supports UTF-8 encoding for strings data.
 [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is a variable-length character encoding wherein each
 character can be 1-4 bytes. This means the length of a string is not the same as its size in bytes.
-For this reason, it is recommended to use the `string_view` class to access these characters for
+For this reason, it is recommended to use the `cudf::string_view` class to access these characters for
 most operations.
 
 The `string_view.cuh` header also includes some utility methods for reading and writing

From dd674582fb33baa6d6d300e98cd9d4566a6d4d6a Mon Sep 17 00:00:00 2001
From: David Wendt <dwendt@nvidia.com>
Date: Wed, 8 May 2024 09:28:16 -0400
Subject: [PATCH 2/5] updates

---
 .../developer_guide/DEVELOPER_GUIDE.md        | 34 ++++++++++---------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index 650b288e20f..e7d359a7c43 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -1,4 +1,4 @@
-# libcudf C++ Developer Guide {#DEVELOPER_GUIDE}
+# libcudf C++ Developer Guide
 
 This document serves as a guide for contributors to libcudf C++ code. Developers should also refer
 to these additional files for further documentation of libcudf best practices.
@@ -860,13 +860,13 @@ thrust::lower_bound(rmm::exec_policy(stream),
 
 Like the [indexalator](#index-normalizing_iterators),
 the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
-used for offsets types (INT32 or INT64 only) without requiring a type-specific instance.
-This helpful when reading or building strings columns. The normalized type is int64 which means
-an `input_offsetsalator` will return int64 type values for both INT32 and INT64 offsets columns.
-Likewise, an `output_offselator` can accept int64 type values to store into either an
+used for offset column types (INT32 or INT64 only) without requiring a type-specific instance.
+This helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means
+an `input_offsetsalator` will return `int64` type values for both INT32 and INT64 offsets columns.
+Likewise, an `output_offselator` can accept `int64` type values to store into either an
 INT32 or INT64 output offsets column created appropriately.
 
-Use the `offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
+Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
 Example input iterator usage:
 
 ```c++
@@ -1271,12 +1271,12 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap
 
 Strings are represented as a column with a data device buffer and a child offsets column.
 The parent column's type is `STRING` and its data holds all the characters across all the strings packed together
-but its size represents the number of strings in the column, and its null mask represents the
+but its size represents the number of strings in the column and its null mask represents the
 validity of each string.
 
 The strings column contains a single child column which is a non-nullable column
 of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
-string in a dense data buffer of all characters. With this representation, `data[offsets[i]]` is the
+string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the
 first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`.
 The following image shows an example of this compound column representation of strings.
 
@@ -1382,25 +1382,27 @@ libcudf provides view types for nested column types as well as for the data elem
 
 ### cudf::strings_column_view and cudf::string_view
 
-A `cudf::strings_column_view` wraps a strings column which contains a parent
-`cudf::column_view` as a view the strings column and an offsets child `cudf::column_view`.
-The parent column contains the offset, size, and validity mask for the strings column.
-The offsets column is non-nullable with an `offset()==0` and its own size.
+A `cudf::strings_column_view` wraps a strings column and contains a parent
+`cudf::column_view` as a view of the strings column and an offsets `cudf::column_view`
+which is a child of the parent.
+The parent view contains the offset, size, and validity mask for the strings column.
+The offsets view is non-nullable with an `offset()==0` and its own size.
 Since the offset column type can be either INT32 or INT64 it is useful to use the
-offset normalizing iterators [offsetalator](#offset-normalizing_iterators) to access individual offset values.
+offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values.
 
 A `cudf::string_view` is a view of a single string and therefore
 is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
 data type for a `cudf::column` of type INT32. As its name implies, this is a
 read-only object instance that points to device memory inside the strings column.
-It's lifespan is the same (or less) as the column it views.
+Its lifespan is the same (or less) as the column it views.
 An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes.
 
 Use the `column_device_view::element` method to access an individual row element. Like any other
 column, do not call `element()` on a row that is null.
 
 ```c++
-   cudf::column_device_view d_strings;
+   cudf::strings_column_view scv;
+   auto d_strings = cudf::column_device_view::create(scv.parent(), stream);
    ...
    if( d_strings.is_valid(row_index) ) {
       string_view d_str = d_strings.element<string_view>(row_index);
@@ -1418,7 +1420,7 @@ otherwise, position and length parameters are specified in characters and not by
 includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters
 within the string.
 
-`cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column.
+`cudf::type_dispatcher` dispatches to the `cudf::string_view` data type when invoked on a `STRING` column.
 
 #### UTF-8
 

From 80759ddc9df633c6c7a0fe49e50af01c8557dcf3 Mon Sep 17 00:00:00 2001
From: David Wendt <dwendt@nvidia.com>
Date: Wed, 8 May 2024 10:42:06 -0400
Subject: [PATCH 3/5] fix broken links and wording

---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index e7d359a7c43..f0ad460be55 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -858,7 +858,7 @@ thrust::lower_bound(rmm::exec_policy(stream),
 
 ### Offset-normalizing iterators
 
-Like the [indexalator](#index-normalizing_iterators),
+Like the [indexalator](#index-normalizing-iterators),
 the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
 used for offset column types (INT32 or INT64 only) without requiring a type-specific instance.
 This helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means
@@ -1274,8 +1274,8 @@ The parent column's type is `STRING` and its data holds all the characters acros
 but its size represents the number of strings in the column and its null mask represents the
 validity of each string.
 
-The strings column contains a single child column which is a non-nullable column
-of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
+The strings column contains a single, non-nullable child column
+of offset elements that indicates the byte position offset to the beginning of each
 string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the
 first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`.
 The following image shows an example of this compound column representation of strings.
@@ -1283,6 +1283,7 @@ The following image shows an example of this compound column representation of s
 ![strings](strings.png)
 
 The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer.
+See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows.
 
 ## Structs columns
 
@@ -1430,7 +1431,7 @@ character can be 1-4 bytes. This means the length of a string is not the same as
 For this reason, it is recommended to use the `cudf::string_view` class to access these characters for
 most operations.
 
-The `string_view.cuh` header also includes some utility methods for reading and writing
+The `cudf/strings/detail/utf8.hpp` header also includes some utility methods for reading and writing
 (`to_char_utf8/from_char_utf8`) individual UTF-8 characters to/from byte arrays.
 
 ### cudf::lists_column_view and cudf::lists_view

From 39d127d9ffec26f3585a24eba27c07422ba0d550 Mon Sep 17 00:00:00 2001
From: David Wendt <dwendt@nvidia.com>
Date: Wed, 8 May 2024 13:28:24 -0400
Subject: [PATCH 4/5] add to output offsetalator example

---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index f0ad460be55..a4b58fd5aca 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -861,8 +861,9 @@ thrust::lower_bound(rmm::exec_policy(stream),
 Like the [indexalator](#index-normalizing-iterators),
 the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
 used for offset column types (INT32 or INT64 only) without requiring a type-specific instance.
-This helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means
-an `input_offsetsalator` will return `int64` type values for both INT32 and INT64 offsets columns.
+This is helpful when reading or building [strings columns](#strings-columns).
+The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values
+for both INT32 and INT64 offsets columns.
 Likewise, an `output_offselator` can accept `int64` type values to store into either an
 INT32 or INT64 output offsets column created appropriately.
 
@@ -881,6 +882,10 @@ Example input iterator usage:
 Example output iterator usage:
 
 ```c++
+    // create offsets column as either INT32 or INT64 depending on the number of bytes
+    auto offsets_column = cudf::strings::detail::create_offsets_child_column(total_bytes,
+                                                                             offsets_count,
+                                                                             stream, mr);
     auto d_offsets =
       cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view());
     // write appropriate offset values to d_offsets
@@ -1387,7 +1392,7 @@ A `cudf::strings_column_view` wraps a strings column and contains a parent
 `cudf::column_view` as a view of the strings column and an offsets `cudf::column_view`
 which is a child of the parent.
 The parent view contains the offset, size, and validity mask for the strings column.
-The offsets view is non-nullable with an `offset()==0` and its own size.
+The offsets view is non-nullable with `offset()==0` and its own size.
 Since the offset column type can be either INT32 or INT64 it is useful to use the
 offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values.
 

From 450ff6b7036e7101e5f3e42547bc87c3bdd49a10 Mon Sep 17 00:00:00 2001
From: David Wendt <dwendt@nvidia.com>
Date: Mon, 13 May 2024 09:51:03 -0400
Subject: [PATCH 5/5] put ticks around INT32 and INT64

---
 cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
index a4b58fd5aca..ff80c2daab8 100644
--- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
+++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -860,12 +860,12 @@ thrust::lower_bound(rmm::exec_policy(stream),
 
 Like the [indexalator](#index-normalizing-iterators),
 the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
-used for offset column types (INT32 or INT64 only) without requiring a type-specific instance.
+used for offset column types (`INT32` or `INT64` only) without requiring a type-specific instance.
 This is helpful when reading or building [strings columns](#strings-columns).
 The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values
-for both INT32 and INT64 offsets columns.
+for both `INT32` and `INT64` offsets columns.
 Likewise, an `output_offselator` can accept `int64` type values to store into either an
-INT32 or INT64 output offsets column created appropriately.
+`INT32` or `INT64` output offsets column created appropriately.
 
 Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
 Example input iterator usage:
@@ -1287,7 +1287,7 @@ The following image shows an example of this compound column representation of s
 
 ![strings](strings.png)
 
-The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer.
+The type of the offsets column is either `INT32` or `INT64` depending on the number of bytes in the data buffer.
 See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows.
 
 ## Structs columns
@@ -1332,7 +1332,7 @@ struct column's layout is as follows. (Note that null masks should be read from
 }
 ```
 
-The last struct row (index 3) is not null, but has a null value in the INT32 field. Also, row 2 of
+The last struct row (index 3) is not null, but has a null value in the `INT32` field. Also, row 2 of
 the struct column is null, making its corresponding fields also null. Therefore, bit 2 is unset in
 the null masks of both struct fields.
 
@@ -1393,12 +1393,12 @@ A `cudf::strings_column_view` wraps a strings column and contains a parent
 which is a child of the parent.
 The parent view contains the offset, size, and validity mask for the strings column.
 The offsets view is non-nullable with `offset()==0` and its own size.
-Since the offset column type can be either INT32 or INT64 it is useful to use the
+Since the offset column type can be either `INT32` or `INT64` it is useful to use the
 offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values.
 
 A `cudf::string_view` is a view of a single string and therefore
 is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
-data type for a `cudf::column` of type INT32. As its name implies, this is a
+data type for a `cudf::column` of type `INT32`. As its name implies, this is a
 read-only object instance that points to device memory inside the strings column.
 Its lifespan is the same (or less) as the column it views.
 An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes.