From 2dd924a9e9030bdaa287b9c23cd5bbed6a7dce68 Mon Sep 17 00:00:00 2001 From: David Wendt Date: Mon, 6 May 2024 11:05:20 -0400 Subject: [PATCH 1/5] Update libcudf developer guide for strings offsets column --- .../developer_guide/DEVELOPER_GUIDE.md | 76 ++++++++++++++----- 1 file changed, 57 insertions(+), 19 deletions(-) diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md index 05f8e4585cc..650b288e20f 100644 --- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md +++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md @@ -828,7 +828,7 @@ This iterator returns the validity of the underlying element (`true` or `false`) The proliferation of data types supported by libcudf can result in long compile times. One area where compile time was a problem is in types used to store indices, which can be any integer type. -The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be +The "indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be used for index types (integers) without requiring a type-specific instance. It can be used for any iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a @@ -856,6 +856,36 @@ thrust::lower_bound(rmm::exec_policy(stream), thrust::less()); ``` +### Offset-normalizing iterators + +Like the [indexalator](#index-normalizing_iterators), +the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be +used for offsets types (INT32 or INT64 only) without requiring a type-specific instance. +This helpful when reading or building strings columns. The normalized type is int64 which means +an `input_offsetsalator` will return int64 type values for both INT32 and INT64 offsets columns. +Likewise, an `output_offselator` can accept int64 type values to store into either an +INT32 or INT64 output offsets column created appropriately. + +Use the `offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view. +Example input iterator usage: + +```c++ + // convert the sizes to offsets + auto [offsets, char_bytes] = cudf::strings::detail::make_offsets_child_column( + output_sizes.begin(), output_sizes.end(), stream, mr); + auto d_offsets = + cudf::detail::offsetalator_factory::make_input_iterator(offsets->view()); + // use d_offsets to address the output row bytes +``` + +Example output iterator usage: + +```c++ + auto d_offsets = + cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view()); + // write appropriate offset values to d_offsets +``` + ## Namespaces ### External @@ -1242,17 +1272,18 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap Strings are represented as a column with a data device buffer and a child offsets column. The parent column's type is `STRING` and its data holds all the characters across all the strings packed together but its size represents the number of strings in the column, and its null mask represents the -validity of each string. To summarize, the strings column children are: +validity of each string. -1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each - string in a dense data buffer of all characters. - -With this representation, `data[offsets[i]]` is the first character of string `i`, and the -size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of -this compound column representation of strings. +The strings column contains a single child column which is a non-nullable column +of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each +string in a dense data buffer of all characters. With this representation, `data[offsets[i]]` is the +first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`. +The following image shows an example of this compound column representation of strings. ![strings](strings.png) +The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer. + ## Structs columns A struct is a nested data type with a set of child columns each representing an individual field @@ -1351,12 +1382,19 @@ libcudf provides view types for nested column types as well as for the data elem ### cudf::strings_column_view and cudf::string_view -`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of -any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore -`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the -data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As its name implies, this is a -read-only object instance that points to device memory inside the strings column. It's lifespan is -the same (or less) as the column it views. +A `cudf::strings_column_view` wraps a strings column which contains a parent +`cudf::column_view` as a view the strings column and an offsets child `cudf::column_view`. +The parent column contains the offset, size, and validity mask for the strings column. +The offsets column is non-nullable with an `offset()==0` and its own size. +Since the offset column type can be either INT32 or INT64 it is useful to use the +offset normalizing iterators [offsetalator](#offset-normalizing_iterators) to access individual offset values. + +A `cudf::string_view` is a view of a single string and therefore +is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the +data type for a `cudf::column` of type INT32. As its name implies, this is a +read-only object instance that points to device memory inside the strings column. +It's lifespan is the same (or less) as the column it views. +An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes. Use the `column_device_view::element` method to access an individual row element. Like any other column, do not call `element()` on a row that is null. @@ -1370,14 +1408,14 @@ column, do not call `element()` on a row that is null. } ``` -A null string is not the same as an empty string. Use the `string_scalar` class if you need an +A null string is not the same as an empty string. Use the `cudf::string_scalar` class if you need an instance of a class object to represent a null string. -The `string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf -functions like `sort` without string-specific code. The data for a `string_view` instance is +The `cudf::string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf +functions like `sort` without string-specific code. The data for a `cudf::string_view` instance is required to be [UTF-8](#utf-8) and all operators and methods expect this encoding. Unless documented otherwise, position and length parameters are specified in characters and not bytes. The class also -includes a `string_view::const_iterator` which can be used to navigate through individual characters +includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters within the string. `cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column. @@ -1387,7 +1425,7 @@ within the string. The libcudf strings column only supports UTF-8 encoding for strings data. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is a variable-length character encoding wherein each character can be 1-4 bytes. This means the length of a string is not the same as its size in bytes. -For this reason, it is recommended to use the `string_view` class to access these characters for +For this reason, it is recommended to use the `cudf::string_view` class to access these characters for most operations. The `string_view.cuh` header also includes some utility methods for reading and writing From dd674582fb33baa6d6d300e98cd9d4566a6d4d6a Mon Sep 17 00:00:00 2001 From: David Wendt Date: Wed, 8 May 2024 09:28:16 -0400 Subject: [PATCH 2/5] updates --- .../developer_guide/DEVELOPER_GUIDE.md | 34 ++++++++++--------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md index 650b288e20f..e7d359a7c43 100644 --- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md +++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md @@ -1,4 +1,4 @@ -# libcudf C++ Developer Guide {#DEVELOPER_GUIDE} +# libcudf C++ Developer Guide This document serves as a guide for contributors to libcudf C++ code. Developers should also refer to these additional files for further documentation of libcudf best practices. @@ -860,13 +860,13 @@ thrust::lower_bound(rmm::exec_policy(stream), Like the [indexalator](#index-normalizing_iterators), the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be -used for offsets types (INT32 or INT64 only) without requiring a type-specific instance. -This helpful when reading or building strings columns. The normalized type is int64 which means -an `input_offsetsalator` will return int64 type values for both INT32 and INT64 offsets columns. -Likewise, an `output_offselator` can accept int64 type values to store into either an +used for offset column types (INT32 or INT64 only) without requiring a type-specific instance. +This helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means +an `input_offsetsalator` will return `int64` type values for both INT32 and INT64 offsets columns. +Likewise, an `output_offselator` can accept `int64` type values to store into either an INT32 or INT64 output offsets column created appropriately. -Use the `offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view. +Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view. Example input iterator usage: ```c++ @@ -1271,12 +1271,12 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap Strings are represented as a column with a data device buffer and a child offsets column. The parent column's type is `STRING` and its data holds all the characters across all the strings packed together -but its size represents the number of strings in the column, and its null mask represents the +but its size represents the number of strings in the column and its null mask represents the validity of each string. The strings column contains a single child column which is a non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each -string in a dense data buffer of all characters. With this representation, `data[offsets[i]]` is the +string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of this compound column representation of strings. @@ -1382,25 +1382,27 @@ libcudf provides view types for nested column types as well as for the data elem ### cudf::strings_column_view and cudf::string_view -A `cudf::strings_column_view` wraps a strings column which contains a parent -`cudf::column_view` as a view the strings column and an offsets child `cudf::column_view`. -The parent column contains the offset, size, and validity mask for the strings column. -The offsets column is non-nullable with an `offset()==0` and its own size. +A `cudf::strings_column_view` wraps a strings column and contains a parent +`cudf::column_view` as a view of the strings column and an offsets `cudf::column_view` +which is a child of the parent. +The parent view contains the offset, size, and validity mask for the strings column. +The offsets view is non-nullable with an `offset()==0` and its own size. Since the offset column type can be either INT32 or INT64 it is useful to use the -offset normalizing iterators [offsetalator](#offset-normalizing_iterators) to access individual offset values. +offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values. A `cudf::string_view` is a view of a single string and therefore is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the data type for a `cudf::column` of type INT32. As its name implies, this is a read-only object instance that points to device memory inside the strings column. -It's lifespan is the same (or less) as the column it views. +Its lifespan is the same (or less) as the column it views. An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes. Use the `column_device_view::element` method to access an individual row element. Like any other column, do not call `element()` on a row that is null. ```c++ - cudf::column_device_view d_strings; + cudf::strings_column_view scv; + auto d_strings = cudf::column_device_view::create(scv.parent(), stream); ... if( d_strings.is_valid(row_index) ) { string_view d_str = d_strings.element(row_index); @@ -1418,7 +1420,7 @@ otherwise, position and length parameters are specified in characters and not by includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters within the string. -`cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column. +`cudf::type_dispatcher` dispatches to the `cudf::string_view` data type when invoked on a `STRING` column. #### UTF-8 From 80759ddc9df633c6c7a0fe49e50af01c8557dcf3 Mon Sep 17 00:00:00 2001 From: David Wendt Date: Wed, 8 May 2024 10:42:06 -0400 Subject: [PATCH 3/5] fix broken links and wording --- cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md index e7d359a7c43..f0ad460be55 100644 --- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md +++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md @@ -858,7 +858,7 @@ thrust::lower_bound(rmm::exec_policy(stream), ### Offset-normalizing iterators -Like the [indexalator](#index-normalizing_iterators), +Like the [indexalator](#index-normalizing-iterators), the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be used for offset column types (INT32 or INT64 only) without requiring a type-specific instance. This helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means @@ -1274,8 +1274,8 @@ The parent column's type is `STRING` and its data holds all the characters acros but its size represents the number of strings in the column and its null mask represents the validity of each string. -The strings column contains a single child column which is a non-nullable column -of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each +The strings column contains a single, non-nullable child column +of offset elements that indicates the byte position offset to the beginning of each string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of this compound column representation of strings. @@ -1283,6 +1283,7 @@ The following image shows an example of this compound column representation of s ![strings](strings.png) The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer. +See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows. ## Structs columns @@ -1430,7 +1431,7 @@ character can be 1-4 bytes. This means the length of a string is not the same as For this reason, it is recommended to use the `cudf::string_view` class to access these characters for most operations. -The `string_view.cuh` header also includes some utility methods for reading and writing +The `cudf/strings/detail/utf8.hpp` header also includes some utility methods for reading and writing (`to_char_utf8/from_char_utf8`) individual UTF-8 characters to/from byte arrays. ### cudf::lists_column_view and cudf::lists_view From 39d127d9ffec26f3585a24eba27c07422ba0d550 Mon Sep 17 00:00:00 2001 From: David Wendt Date: Wed, 8 May 2024 13:28:24 -0400 Subject: [PATCH 4/5] add to output offsetalator example --- cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md index f0ad460be55..a4b58fd5aca 100644 --- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md +++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md @@ -861,8 +861,9 @@ thrust::lower_bound(rmm::exec_policy(stream), Like the [indexalator](#index-normalizing-iterators), the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be used for offset column types (INT32 or INT64 only) without requiring a type-specific instance. -This helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means -an `input_offsetsalator` will return `int64` type values for both INT32 and INT64 offsets columns. +This is helpful when reading or building [strings columns](#strings-columns). +The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values +for both INT32 and INT64 offsets columns. Likewise, an `output_offselator` can accept `int64` type values to store into either an INT32 or INT64 output offsets column created appropriately. @@ -881,6 +882,10 @@ Example input iterator usage: Example output iterator usage: ```c++ + // create offsets column as either INT32 or INT64 depending on the number of bytes + auto offsets_column = cudf::strings::detail::create_offsets_child_column(total_bytes, + offsets_count, + stream, mr); auto d_offsets = cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view()); // write appropriate offset values to d_offsets @@ -1387,7 +1392,7 @@ A `cudf::strings_column_view` wraps a strings column and contains a parent `cudf::column_view` as a view of the strings column and an offsets `cudf::column_view` which is a child of the parent. The parent view contains the offset, size, and validity mask for the strings column. -The offsets view is non-nullable with an `offset()==0` and its own size. +The offsets view is non-nullable with `offset()==0` and its own size. Since the offset column type can be either INT32 or INT64 it is useful to use the offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values. From 450ff6b7036e7101e5f3e42547bc87c3bdd49a10 Mon Sep 17 00:00:00 2001 From: David Wendt Date: Mon, 13 May 2024 09:51:03 -0400 Subject: [PATCH 5/5] put ticks around INT32 and INT64 --- cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md index a4b58fd5aca..ff80c2daab8 100644 --- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md +++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md @@ -860,12 +860,12 @@ thrust::lower_bound(rmm::exec_policy(stream), Like the [indexalator](#index-normalizing-iterators), the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be -used for offset column types (INT32 or INT64 only) without requiring a type-specific instance. +used for offset column types (`INT32` or `INT64` only) without requiring a type-specific instance. This is helpful when reading or building [strings columns](#strings-columns). The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values -for both INT32 and INT64 offsets columns. +for both `INT32` and `INT64` offsets columns. Likewise, an `output_offselator` can accept `int64` type values to store into either an -INT32 or INT64 output offsets column created appropriately. +`INT32` or `INT64` output offsets column created appropriately. Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view. Example input iterator usage: @@ -1287,7 +1287,7 @@ The following image shows an example of this compound column representation of s ![strings](strings.png) -The type of the offsets column is either INT32 or INT64 depending on the number of bytes in the data buffer. +The type of the offsets column is either `INT32` or `INT64` depending on the number of bytes in the data buffer. See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows. ## Structs columns @@ -1332,7 +1332,7 @@ struct column's layout is as follows. (Note that null masks should be read from } ``` -The last struct row (index 3) is not null, but has a null value in the INT32 field. Also, row 2 of +The last struct row (index 3) is not null, but has a null value in the `INT32` field. Also, row 2 of the struct column is null, making its corresponding fields also null. Therefore, bit 2 is unset in the null masks of both struct fields. @@ -1393,12 +1393,12 @@ A `cudf::strings_column_view` wraps a strings column and contains a parent which is a child of the parent. The parent view contains the offset, size, and validity mask for the strings column. The offsets view is non-nullable with `offset()==0` and its own size. -Since the offset column type can be either INT32 or INT64 it is useful to use the +Since the offset column type can be either `INT32` or `INT64` it is useful to use the offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values. A `cudf::string_view` is a view of a single string and therefore is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the -data type for a `cudf::column` of type INT32. As its name implies, this is a +data type for a `cudf::column` of type `INT32`. As its name implies, this is a read-only object instance that points to device memory inside the strings column. Its lifespan is the same (or less) as the column it views. An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes.