Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make cudf::size_type 64-bit #3958

Closed
harrism opened this issue Jan 28, 2020 · 52 comments
Closed

[FEA] Make cudf::size_type 64-bit #3958

harrism opened this issue Jan 28, 2020 · 52 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. wontfix This will not be worked on

Comments

@harrism
Copy link
Member

harrism commented Jan 28, 2020

Is your feature request related to a problem? Please describe.
cudf::size_type is currently an int32_t, which limits column size to two billion elements (MAX_INT). Moreover, it limits child column size to the same. This causes problems, for example, for string columns, where there may be fewer than 2B strings, but the character data to represent them could easily exceed 2B characters.

A 32-bit size was originally chosen to ensure compatibility with Apache Arrow, which dictates that Arrow arrays have a 32-bit size, and that larger arrays are made by chunking into individual Arrays.

Describe the solution you'd like

  • Change size_type to be an int64_t.

  • Handle compatibility with Arrow by creating arrow chunked arrays in the libcudf to_arrow interface (not yet created), and combine arrow chunked arrays in the libcudf from_arrow interface. This can be dealt with when we create these APIs.

Describe alternatives you've considered

Chunked columns. This would be very challenging -- supporting chunked columns in every algorithm would result in complex distributed algorithms and implementations, where libcudf currently aims to be communication agnostic / ignorant. In other words, a higher level library handles distributed algorithms.

Additional context

A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(

@harrism harrism added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jan 28, 2020
@kkraus14
Copy link
Collaborator

A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(

🤣

@kkraus14
Copy link
Collaborator

Handle compatibility with Arrow by creating arrow chunked arrays in the libcudf to_arrow interface (not yet created), and combine arrow chunked arrays in the libcudf from_arrow interface. This can be dealt with when we create these APIs.

We have these APIs on the Python side already, so we'd ideally need these APIs sooner rather than later, though I do believe the PyArrow CPU side will handle splitting chunks for us.

@felipeblazing
Copy link

This seems like a really reasonable way to ensure that we are compatible with larger columns almost for free. If we want to revisit later that might make sense but right now I can imagine that the discussions alone for going to chunked representation would be long.

@revans2
Copy link
Contributor

revans2 commented Jan 28, 2020

My biggest concern is the amount of memory that this will take and the performance impact it will have. In a lot of the data sets we encounter the length of the strings is small. often very small. Doing this in many of these cases will effectively double the amount of memory used to store a string, and hence cut the performance of all string operations in half, assuming that they are bound by memory access speed. This will also be true for arrays when they show up. Not to mention that we will use a lot more precious GPU memory to store offsets that we could have used to store actual data.

I just want to be sure that if we do this for performance reasons that we do some performance and capacity mesurements before and after to see what impact this will actually have.

@wmalpica
Copy link

@revans2 that is a valid concern. We could handle that by having a string datatype variant that stores offsets in int32_t. Similar to how we have different TIMESTAMP datatypes, we could implement STRING variants. This would be more of a future optimization that we dont have to implement right now, but we know that we could do this in the future and not necessarily paint ourselves into a corner so to speak, by adopting a 64-bit size_type

@felipeblazing
Copy link

The reason for switching to 64 bit is more about possibility reasons than perf reasons. Right now there are ways to break it having data that is well within the limits of the gpu. I agree there will be a perf penalty which we can evaluate but the other alternatives pretty bleak. Either Don't make your columns too big to the users or taking on another massive refactor project.

@ChuckHastings
Copy link
Contributor

There are a number of cudf functions that use cudf::size_type to create intermediate data structures (e.g. build an array of row offsets in some kernel and then use gather/scatter to create a new table). Doubling the size of cudf::size_type would double the temporary memory used by these methods - essentially halving the size table you could effectively operate on for one GPU.

To get around that you could either change these to use an int32 directly (although as GPU memory sizes increase that would be an artificial limit), or create 32 and 64 bit versions of these internal algorithms and call the 32 bit version if the number of rows < 2 billion. Could make things a bit more complicated.

@nvdbaranec
Copy link
Contributor

Quick scan of the code doesn't turn up much in the way of portability problems. Every *4 or * 4 seems legit. No incorrect usage of 0xffffffff or 0x80000000. The handful of places calling sizeof(size_type) seem legit as well.

Pure mismatches between size_type and int/int32_t are harder to detect without actually compiling testing of course, but at least on the surface level there doesn't appear to be any deep-rooted issues with implicit assumptions that size_type == 4 bytes.

@harrism
Copy link
Member Author

harrism commented Jan 29, 2020

@ChuckHastings do you have a pointer to an example? I think that would be helpful.

@harrism
Copy link
Member Author

harrism commented Jan 29, 2020

To @revans2 's comment, we have actually already run into the case I mentioned of a string column with < 2B strings but > 2B characters. Not saying it's a common case, but it's an ugly situation.

@jrhemstad
Copy link
Contributor

@ChuckHastings do you have a pointer to an example? I think that would be helpful.

Groupby/Join are an example of this. The hash map uses size_type for both the keys and values (since they are row offsets into tables). This would double the size of the hash map for both join and groupby.

@harrism
Copy link
Member Author

harrism commented Jan 29, 2020

How big is the hash map?

@ChuckHastings
Copy link
Contributor

ChuckHastings commented Jan 29, 2020

Join creates the following intermediate products:

  • Hash map, implemented in an array of type pair<cudf::size_type,cudf::size_type> with 2 * (number of rows in the smaller table) elements. The 2x size is the default - which is not currently overridden by the implementation.
  • VectorPair of join indices which are used as input into a call to gather to construct the final output columns. This is 2 * cudf::size_type elements for each row of output in the resulting join (technically we estimate the size, so it could be slightly larger than this).

@harrism
Copy link
Member Author

harrism commented Jan 29, 2020

For the intermediates, could we detect that the column size is < 2B and use int32 for the hash map and VectorPair?

@ChuckHastings
Copy link
Contributor

That's what I was intending by my suggestion above. We could create a template parameter that identifies whether to use int32 or int64 for the hash map and vector pair and then call the correct version depending on the size of the tables.

@jakirkham
Copy link
Member

Just as a note, we will want to update any lines like this one that may type this for higher level code (Python and/or Java).

@revans2
Copy link
Contributor

revans2 commented Feb 5, 2020

Yes, java code would have to be updated to handle this. It is not a big deal though and I am happy to do it.

I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue.

But, just to play devil's advocate... When we first started work on Spark and mentioned that there were algorithms in Spark that would spill to disk because it could operate on data larger than fits in host memory we were told that was not really a priority because we could always partition the data into smaller pieces. Why is that not true here too? Why is this a case of ability when you have that option to partition the data into smaller pieces? Yes, I know that cudf itself is not going to be partitioning the data, ala chunked arrow vectors, but Spark chunks the data itself without any cudf knowledge, and I know Dask can as well.

@jakirkham
Copy link
Member

Yep agreed. Just making sure that point doesn't get lost in the shuffle (pun not intended 😉).

Yep that's true of both frameworks.

There's a tradeoff between using algorithms that operate on local data vs. distributed data. The increased size would let users tune this more appropriately for their workflows to get the desired performance.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 5, 2020

I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue.

I think this is the core issue. If we could have max(int32_t) strings I think everyone would be a happy camper, but having max(int32_t) bytes severely limits the number of elements you can have in a table before needing to partition it where it can adversely impact performance.

@harrism
Copy link
Member Author

harrism commented Feb 6, 2020

The main thing is that libcudf does not want to have to write algorithms that operate on chunked columns. We prefer dask/spark to do the distributed algorithms, and libcudf provide the primitives on which to build them.

If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain. Note that this problem probably affects other nested types too (e.g. lists). @jrhemstad @davidwendt this suggestion goes against the grain of the current design of strings columns / nested columns, where the data array (characters, in the case of strings) is just a child column.

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 6, 2020

If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain.

It's not really possible to change the size_type for only character columns. It's an all or nothing deal.

@davidwendt
Copy link
Contributor

I noticed a few places in the libcudf code where columns are created with INT32 type but then referenced using data<size_type>(). These, of course, will break if size_type is changed from int32_t to int64_t and would have to be fixed.

Regardless, it got me thinking that there is no reason the chars column needs to be created as INT8. There is nothing preventing creating the column as INT64 and then still referencing the data as data<char>(). This would require the offsets column to also be changed from INT32 to INT64 in order to address the larger chars buffer.

With size_type=32-bits, the strings column would still max out to 2B elements but the maximum total bytes size of the column will go from 2GB to 16GB. Considering that strings are immutable and most strings operations require making a copy, I think a 16GB max strings column size is reasonable.

This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g. create_offsets, concatenate, contiguous_split). The majority of the strings operations should be unaffected since they generally access the column's data using the element<string_view>(idx) method (which requires a minor change). The child columns' offset and size members are not used by design.

I cannot speak to how this would effect the Cython code interface though.

@jrhemstad
Copy link
Contributor

I noticed a few places in the libcudf code where columns are created with INT32 type but then referenced using data<size_type>(). These, of course, will break if size_type is changed from int32_t to int64_t and would have to be fixed.

Regardless, it got me thinking that there is no reason the chars column needs to be created as INT8. There is nothing preventing creating the column as INT64 and then still referencing the data as data<char>(). This would require the offsets column to also be changed from INT32 to INT64 in order to address the larger chars buffer.

With size_type=32-bits, the strings column would still max out to 2B elements but the maximum total bytes size of the column will go from 2GB to 16GB. Considering that strings are immutable and most strings operations require making a copy, I think a 16GB max strings column size is reasonable.

This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g. create_offsets, concatenate, contiguous_split). The majority of the strings operations should be unaffected since they generally access the column's data using the element<string_view>(idx) method (which requires a minor change). The child columns' offset and size members are not used by design.

I cannot speak to how this would effect the Cython code interface though.

I've been mulling over the same idea. Some of the concerns I've had:

  • It's dishonest.
    • To an external observer, if a column's type is INT64, one would expect it's elements to correspond to 8 byte integer values. But instead, by convention, we're saying this columns elements are actually 1 byte integers.
    • This is especially problematic if the chars child column was ever released from it's string column owner. There would no longer be any way to differentiate a "real" INT64 column from the type-punned chars child.
  • It requires type-punning
  • Slack bits
    • If the number of characters isn't a multiple of 8, then we'd have undefined slack bits in the last 8 byte element. I can't think of a reason right now if this would actually cause problems, but definitely should be kept in mind.

@davidwendt
Copy link
Contributor

I had the same thoughts but here is my rationalization:

  • I think it is no more dishonest to say a char column is of type INT8 where the integers here are interpreted as char bytes.
  • If the column is released to the wild on its own, it is not much use without the offsets, nulls, etc. from the other components. And there is no way to distinguish an INT8 integer column from an INT8 char column.
  • I believe a device_buffer is always allocated on an 8-byte boundary so there are already extra bits on every column today whose total byte size is not a multiple of 8.

@davidwendt
Copy link
Contributor

From the discussions above, I think supporting large columns (strings as well as other types) will require converting cudf::size_type to int64_t. This (almost) automatically makes it possible to support large fixed-width columns (e.g. ints, floats, timestamps, fixed-point) but may require some special logic for groupby/join (as mentioned above) and perhaps partitioning. Overall, this conversion will not be a trivial task and effects not just Python cudf and Spark but upstream repos like clx and cuml. I noticed that Apache Arrow 2.0 docs mention they support int64 for column sizes and null counts so we would still be Arrow compatible.

I started experimenting with changing cudf::size_type to int64_t in libcudf to first see what compile issues it created. So far, I've had to modify over 70 source files (including gtests and gbenchmarks) where we used size_type and int32_t (and int) interchangeably. Many of these are trivial casting. Non-trivial changes are mostly in the strings code but many exist in other places. Along with the hash/groupby issue mentioned previously, I suspect there is a non-trivial impact on cuIO to create and store larger columns as well.

Finally, I agree with the concern about this change effecting column types that have offsets child columns. Changing the offsets columns to INT64 would create larger memory requirements for the same column data we support today. However, we do not have to fix the offsets column type to either INT32 or INT64. The dictionary column type currently supports indices columns of any integer type. It relies on the libcudf indexalator iterator to normalizing access to the integer data in this child column. This is an approach we could take with the offsets column as well -- allowing it to be either INT32 or INT64 type as appropriate; and without needing to create any new types. This would also be a non-trivial change but would be mostly isolated to the specialization logic for strings, lists, structs. Changes could be minimal to places were iterators are used instead of raw pointers as well.

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 15, 2020

This is an approach we could take with the offsets column as well -- allowing it to be either INT32 or INT64 type as appropriate; and without needing to create any new types. This would also be a non-trivial change but would be mostly isolated to the specialization logic for strings, lists, structs. Changes could be minimal to places were iterators are used instead of raw pointers as well.

This seems like it would be very non-trivial as you'd have to somewhere make the decision when making a new dictionary column about whether you're going to use 32 or 64 bit indices. I don't know how you make that decision without just assuming you always use the largest size, or first introspecting the data to determine what size you need.

The same applies for nested types. At some point you need to make a decision about what size indices you need to use. In general, I don't know how you do that without just always using the biggest size, or running the computation twice and instantiating code paths for all possible required sizes.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Dec 15, 2020

Fixed point types have a similar issue. Knowing it's "fixed point" isn't enough. You need to know the scale value as well, so you can't generalize fixed points types as arithmetic and every place that uses fixed point types needs to be specialized.

I wonder if it would make sense to have an OFFSET type_id that also comes with additional info in the data_type indicating it's size. It would be a big undertaking to get everything that uses offsets to work this way but once you do, you've insulated yourself well.

@davidwendt
Copy link
Contributor

For strings, I've found creating the offsets array by first building the array of sizes and then performing a scan over the sizes to get the offset values is faster than doing a single transform-scan over the sizes-calc-functor (mainly because of the aggressive inlining in thrust). This means if we limit individual strings in a column size type int32 we could first build the sizes column with int32 values and then add a reduce using an int64 accumulator to calculate the total size. So the process introduces an extra reduce step. If the size is small enough, we can scan into the existing int32 array to build the offsets output column. Otherwise, we need to scan into a new int64 array and the original int32 array is then discarded.

Limiting an individual string size to max::int32 seems reasonable since we could not store many of these anyway. But I don't think we could make the same restriction for other types like nested lists.

@jlowe
Copy link
Member

jlowe commented Dec 15, 2020

But I don't think we could make the same restriction for other types like nested lists.

A list column's offsets are tracking the row offset of the immediate child column, and I think it's quite reasonable to have a limit of 2^31 child rows per parent list row, similar to the 2^31 limit on bytes for a single row of a string column. That's a huge improvement from where we are now.

This would enable a list-of-strings column where the list column offsets are 32-bit but the string column offsets are 64-bit. I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice.

@nvdbaranec
Copy link
Contributor

I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice.

I wouldn't expect it to be too bad. Most things that process lists end up just fobbing off the final leaf work to the appropriate type handler (string, etc). So as long as strings inherently handled 32/64 bit offsets, it should be ok. There's oddball edge cases like contiguous_split which glom string and list offsets together as the same type, but that's relatively easy to fix.

@titericz
Copy link

I hit the wall with this string limit (2**32) today when trying to load a large dataset with 185M rows and 2 string features. The dataset comes from Kaggle talkingdata competition. In real life, we can expect much larger datasets, so I believe its very important to expand that limit in cudf .

@kkraus14
Copy link
Collaborator

Hi @titericz, I would highly recommend using a framework built on top of cudf in this situation instead of cudf directly. I.E. using dask-cudf will handle automatically partitioning your data to alleviate this.

@harrism
Copy link
Member Author

harrism commented Mar 21, 2021

To update this issue, we've tentatively decided that we will NOT be changing cudf::size_type in the near term because of the excessive memory use cost this would create for a lot of cases (e.g. every offsets column would be twice as large).

@ttnghia
Copy link
Contributor

ttnghia commented Jun 3, 2021

Is it possible to have a new big_column column type that uses big_offset_type and big_size_type (int64_t) for storing very large data sets? And the typical column can be converted into big_column whenever needed. Of course there would be many technical difficulties for this, but at least there is something (workaround) that can be a solution for handling large datasets...

@harrism
Copy link
Member Author

harrism commented Jun 7, 2021

So we double our library size and compile time?

@EvenOldridge
Copy link

EvenOldridge commented Jun 7, 2021 via email

@jrhemstad
Copy link
Contributor

where datasets are massive. This is a huge blocker for adoption of NVTabular by the bigger companies unless we find a solution.

Overall dataset size doesn't really matter so long as you partition it into small enough pieces.

@ttnghia
Copy link
Contributor

ttnghia commented Jun 7, 2021

Overall dataset size doesn't really matter so long as you partition it into small enough pieces.

Yeah, this is a good (and general) solution. Given this idea, I think it is a huge benefit to the users to add a utility function that outputs a list of column_view from (const) raw buffers. Similar to a column_view constructor that constructs a column_view from const pointers of data and bitmask:

column_view_base(data_type type,
                   size_type size,
                   void const* data,
                   bitmask_type const* null_mask = nullptr,
                   size_type null_count          = UNKNOWN_NULL_COUNT,
                   size_type offset              = 0);

The utility API (let call it std::vector<column_view> column_views_from_large_buffer() or whatever) will output a list of column_view by segmenting the given buffers and attaching a column_view to each segment. Then for any operations, we can use

auto const column_views = column_views_from_large_buffer();
for(auto const& col : column_views) {
    do_whatever(col);
}

Of course, the users can implement this utility function very easily, but it's very helpful if we provide it.

@harrism
Copy link
Member Author

harrism commented Jun 8, 2021

@EvenOldridge We have discussed this pretty thoroughly and we have pretty much decided that making cudf::size_type 64-bit is not the solution. The reason it's not the solution is because it would double memory usage for columns of offsets (which also use cudf::size_type), which are common in nested data types (strings, lists, structs). Since anyone processing massive datasets with cuDF is likely using a distributed computing framework (e.g. Dask or Spark), we feel the solution is
a) to partition large datasets appropriately
b) for cuDF to provide the tools necessary for those frameworks to partition when necessary (e.g. compute the output size of a join before materializing the result).

We should probably open a new issue. I made the error of making the title of this issue a proposed solution rather than a description of the problem.

@EvenOldridge
Copy link

If this can be solved through partitioning that's awesome. We're working on examples that show the difference in perf and the errors that occur on data where the partitions are too large and that will hopefully prevent a lot of errors on the customer side.

But we commonly run into issues with text data and also occasionally with output sets that are larger than the size_type limits that I'd like to have a clear solution for. Doesn't have to be this. @rjzamora was suggesting that it should be possible to overcome that on our end. Still might be worth thinking about at the cudf level too though for those cases.

@antonyscerri
Copy link

I would be interested in the examples for working with partitions and cugraph and cudf to work around the limits. A little unclear how much of this will be exposed to users at the Python level vs having to drop down to C++.

I've been creating blocks in my algorithm using cudf which work for the most part, but with 1.7B directed edges I also have to run the thing twice inverting my columns so i can treat it as undirected because i couldnt simply have a single dataframe with all the edges in both. I've also struggled to get anything going with cudf and dask.

@GregoryKimball
Copy link
Contributor

I’ve decided to close it as “wontfix” for now. The string column size limit has a lot of salience with groups using libcudf because it’s a common first limit that teams encounter when scaling distributed workloads. For text-processing workloads, the 2 GB string column limit will be hit before general out-of-memory errors. However, the string column size limit can generally be worked around by adjusting the data partitioning.

Raising this limit in libcudf has been the subject of a LOT of discussion, only some of which is captured in this issue. For each option, the performance cost and engineering effort is too great to prioritize this issue. Here is a list of ideas that have been evaluated and disqualified:

  1. Change size_type to 64 bit universally. Doubles memory footprint for strings/lists and reduces performance by half for operations constrained by memory bandwidth. The performance cost is much greater than the convenience benefit.
  2. Make it optional for a column to use a 64 bit or 32 bit size_type and dispatch to the appropriate kernel. Doubles compile time and binary size. Compile time is about 14 hours for a single thread in gpuCI as of branch-22.04. We have a parallel effort to track compile time and reduce it whenever possible. The compile time impact and reduced engineering velocity is too great compared to the convenience benefit.
  3. Make it optional for a column to use a 64 bit or 32 bit size_type, but control the instantiation with preprocessor macros (such as). Users that will primarily target large datasets may only want to instantiate the 64-bit indexed path, while users who exclusively deal with smaller data can safely restrict the instantiations to the 32-bit indexed path. This solution would either a) double our release artifacts and testing burden while maintaining similar compile time and binary size for each size_type, or b) leave an untested 64 bit libcudf for advanced users to build for themselves. The operations cost to maintain two release artifacts is too great, and the difficulty of using an unsupported 64 bit version is much greater than using finer workload partitioning.
  4. Add a new "big string" type. This is a special case of (2) that would double compile time and binary size for string functions, and incrementally increase compile time and binary size for type dispatched functions. The compile time impact and complexity added to string functions and nested type operations is too great compared to the convenience benefit.
  5. Change the string type to use INT64 instead of INT8 to increase the child column size from 2GB to 16GB. Change the size_type for strings to 64-bit. Doubles memory footprint and reduces performance for string operations. The performance cost is much greater than the convenience benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests