Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64-bit array sizes #127

Closed
JeffBezanson opened this issue Jul 16, 2011 · 46 comments
Closed

64-bit array sizes #127

JeffBezanson opened this issue Jul 16, 2011 · 46 comments
Assignees

Comments

@JeffBezanson
Copy link
Member

Current plan:

  • Use Size for all array dims and numel(Array). Size will be Int64 on 64-bit platforms and Int32 on 32-bit.
  • Use platform word size integer for integer literals
  • Generally use Int instead of Index and allow any integer type.
@ghost ghost assigned JeffBezanson Jul 16, 2011
@StefanKarpinski
Copy link
Member

Is size_t signed or unsigned? Signed for 64-bit platforms is great because it's intuitive and the size is still so huge that it's no limitation to give up half the range for having negatives. For 32-bit platforms it's a bit more of a tough call. What if you want an array that's bigger than 2147483647 items but still smaller than 4294967295 items? Eh, maybe that's actually not a real issue since unless it's an array of bytes, you can't even address that much memory with a 32-bit pointer. And if you're dealing with things that big, you should probably get a 64-bit machine.

Is the default integer type different then size_t? Are they always the same, or sometimes different? Is this a question of how big C's int/long type is? Maybe we should call it CInt instead then since that's actually what it's useful for.

One alternative to find would be providing an iterator: take a tensor and iterate over all the non-zero (i,j,v) tuples. No memory allocation required, and no one is going to mind 64-bit indices since there's only two of them needed at a time.

@JeffBezanson
Copy link
Member Author

By "default" I actually meant the types of integer literals. On a 64-bit machine "2" should be an Int64, that's all.

@JeffBezanson
Copy link
Member Author

size_t is unsigned, but I don't think it matters. We should use signed. I agree that >2GB arrays on a 32-bit machine is not a real use case.

@StefanKarpinski
Copy link
Member

It's not even >2GB, it's more than 2^31 entries, which means 16GB arrays of 64-bit floats. Definitely not a real use case.

@ViralBShah
Copy link
Member

I prefer Int64 for array dims and numel. Why confuse a julia user with size_t - they probably don't care what it is.

@StefanKarpinski
Copy link
Member

They don't have to care or know what it is. They just use it and it's the right native size. On the other hand, if it's Int64 everywhere, then on 32-bit machines every single indexing operation will have to convert to 32 bits. Seems wasteful and pointless. Then again, I don't care very much about 32-bit platforms, personally.

@StefanKarpinski
Copy link
Member

By "default" I actually meant the types of integer literals. On a 64-bit machine "2" should be an Int64, that's all.

While I see the advantage of having 32-bit machines default to 32-bit integer literals while 64-bit machines default to 64-bit integer literals, I'm quite concerned that having different semantics for basic numerical expressions on different machines:

# 64-bit machine:
julia> 2^32
4294967296

# 32-bit machine:
julia> 2^32
0

Seems very confusing and like a pretty big mistake. Array sizes and indexes are a different story since if you have an array index or size that's big enough to overflow, you won't be able to address all of it on a 32-bit machine, so there's no situation where the difference would come up.

@StefanKarpinski
Copy link
Member

I would be ok with decimal integer literals being 64-bit everywhere though.

Another issue, I think that even if decimal integer literals are always 64-bit, there's a strong argument to be made for having hex integer literals being unsigned and having a bit-size indicated by the number of digits. For example, 0x00 would be a Uint8, while 0x0000 would be Uint16, 0x00000000 would be Uint32, and 0x0000000000000000 would be Uint64. I find that almost every time one writes an integer as hex value, one not only wants it to be unsigned, but also wants it to be of a specific size — basically, hex is what you use when you have a specific pattern of bytes you want to encode as an (unsigned) integer.

@ViralBShah
Copy link
Member

Indices will continue to be 32-bit for the most part, even for 64-bit platforms, because otherwise, it is an unnecessary waste of memory. I guess one possibility is that if numel(array) fits in 32-bit, we can go with 32-bit, otherwise it should be 64-bit.

Would it make sense to make indexType a parameter for the array?

-viral

On Jul 17, 2011, at 10:53 AM, StefanKarpinski wrote:

They don't have to care or know what it is. They just use it and it's the right native size. On the other hand, if it's Int64 everywhere, then on 32-bit machines every single indexing operation will have to convert to 32 bits. Seems wasteful and pointless. Then again, I don't care very much about 32-bit platforms, personally.

Reply to this email directly or view it on GitHub:
#127 (comment)

@JeffBezanson
Copy link
Member Author

I like the hex literal idea; now that you mention it that's kind of something I always wanted.

The memory issue is one reason to allow any kind of integer as an index. That way you can use Int32 if you need it. It's also fine with me if find and other functions that return arrays of indexes return Int32 by default. They should throw an error if numel(a) > 2^31-1, then you have to use find64 instead.

@ViralBShah
Copy link
Member

Can't the return type of find be based on numel(A)? What about the suggestion of indexType as a type parameter for Array?

-viral

On Jul 17, 2011, at 12:37 PM, JeffBezanson wrote:

I like the hex literal idea; now that you mention it that's kind of something I always wanted.

The memory issue is one reason to allow any kind of integer as an index. That way you can use Int32 if you need it. It's also fine with me if find and other functions that return arrays of indexes return Int32 by default. They should throw an error if numel(a) > 2^31-1, then you have to use find64 instead.

Reply to this email directly or view it on GitHub:
#127 (comment)

@ViralBShah
Copy link
Member

BTW, find64 is an incredibly bad idea.

-viral

On Jul 17, 2011, at 12:37 PM, JeffBezanson wrote:

I like the hex literal idea; now that you mention it that's kind of something I always wanted.

The memory issue is one reason to allow any kind of integer as an index. That way you can use Int32 if you need it. It's also fine with me if find and other functions that return arrays of indexes return Int32 by default. They should throw an error if numel(a) > 2^31-1, then you have to use find64 instead.

Reply to this email directly or view it on GitHub:
#127 (comment)

@JeffBezanson
Copy link
Member Author

Why is it so bad?
If find's type is based on numel(a) then we can't infer its type.

@ViralBShah
Copy link
Member

It seems inelegant from a user's perspective. If I write code that uses arrays of different sizes, then at some point, I have to change find to find64, or sprinkle all kinds of case handling all over the place.

What if Array was Array{T,n,idxType} ? When the array is constructed, we will know numel, and hence idxType can be instantiated to the right type. Then, we can enforce the rule that all dimensions, numel, and indices are of idxType for that array.

If we can do this, then I think it is reasonable that 64-bit indexing has to be done when idxType is Int64, and its ok to sacrifice space savings of using Int32 indexing in that case.

If this were possible, it even allows us to create small arrays (16-bit idxType) more efficiently, with less space, although I am not sure if that has a speed advantage.

-viral

On Jul 17, 2011, at 12:58 PM, JeffBezanson wrote:

Why is it so bad?
If find's type is based on numel(a) then we can't infer its type.

Reply to this email directly or view it on GitHub:
#127 (comment)

@JeffBezanson
Copy link
Member Author

That pushes the dynamic behavior somewhere else --- we will never be able to infer the third type parameter of any array. It would be better to have find pick a return type based on numel.

The nasty thing about find is that if you have an array with billions of elements, find will generally need to return billions of elements and they will be 64-bit. Where you most need the space savings, you can't have it.
In fact doesn't matlab return double for this, which is also 64 bits per index?

@ViralBShah
Copy link
Member

Yes, matlab return's doubles, but matlab uses doubles everywhere. I guess we don't really have a good solution here.

-viral

On Jul 17, 2011, at 1:43 PM, JeffBezanson wrote:

That pushes the dynamic behavior somewhere else --- we will never be able to infer the third type parameter of any array. It would be better to have find pick a return type based on numel.

The nasty thing about find is that if you have an array with billions of elements, find will generally need to return billions of elements and they will be 64-bit. Where you most need the space savings, you can't have it.
In fact doesn't matlab return double for this, which is also 64 bits per index?

Reply to this email directly or view it on GitHub:
#127 (comment)

@ViralBShah
Copy link
Member

How about make find just take an optional argument that specifies the output type? Would that work with type inference? Without the return type specified, it would be 32-bit, and 64-bit if specified. It could even be 8-bit or 16-bit then.

This is also similar to how zeros works.

-viral

On Jul 17, 2011, at 1:43 PM, JeffBezanson wrote:

That pushes the dynamic behavior somewhere else --- we will never be able to infer the third type parameter of any array. It would be better to have find pick a return type based on numel.

The nasty thing about find is that if you have an array with billions of elements, find will generally need to return billions of elements and they will be 64-bit. Where you most need the space savings, you can't have it.
In fact doesn't matlab return double for this, which is also 64 bits per index?

Reply to this email directly or view it on GitHub:
#127 (comment)

@StefanKarpinski
Copy link
Member

There are four cases:

  1. On a 32-bit machine, as long as we don't make indices 64-bit everywhere, find can just always return 32-bit indices.
  2. On a 64-bit machine, the array has dimensions less than 2^31, so we could in principle return 32-bit indices, but the array is very sparse, having relatively few non-zero values, so it's acceptable for us to return 64-bit indices.
  3. On a 64-bit machine, the array has dimensions less than 2^31, so we could in principle return 32-bit indices, and the array is actually quite large, having many non-zero values, so there would be great space savings from returning 32-bit index vectors (cannot happen for dense matrices).
  4. On a 64-bit machine, the array has dimensions greater than 2^31, so the indices have to be 64-bit.

If find always returns indices the size of pointers on the native architecture, then all but case 3 are handled. I would argue that case 3 is pretty rare because it cannot happen at all with dense matrices, and even if you're in the sparse case, if you're doing numerical computing on a 64-bit platform, you probably have a lot of memory. In the situation that one finds oneself in case 3 and has not enough memory, I propose that we introduce find32 so that one can deal with it. We can also introduce find8, find16, find32, and find64 which always return index vectors of the requested size, and throw an error if the requested size isn't big enough for the indices of the matrix. We could introduce a findx that always returns indices of the minimum type required. This findx has nasty inference behavior, but people are going to write something like that anyway.

Ultimately the better way to deal with this is to come up with a way of doing what people use find for that doesn't involve having index vectors in memory all at once. My most common use-case for find is to use it in conjunction with sparse, to use the intervening triplet form to do some kind of filtering or transformation of a sparse matrix. That could be done much more efficiently in a streaming fashion. We could even use coroutines: a producer of (i,j,v) triples, a consumer that filters or modifies them, and a final consumer that does sparse(i,j,v), accepting one triple at a time to construct the matrix. That way there's no need to ever have more than one triple in memory at once.

@ViralBShah
Copy link
Member

Its not related to dimensions, but numel, because find can return linear indices for an N-d array. Sparseness just affects the memory usage, but not the index type that needs to be used.

I really dislike findxx. I would much rather have find(Int32, ...) etc, the way we do zeros, ones, etc.

-viral

On Jul 17, 2011, at 10:00 PM, StefanKarpinski wrote:

There are four cases:

  1. On a 32-bit machine, as long as we don't make indices 64-bit everywhere, find can just always return 32-bit indices.
  2. On a 64-bit machine, the array has dimensions less than 2^31, so we could in principle return 32-bit indices, but the array is very sparse, having relatively few non-zero values, so it's acceptable for us to return 64-bit indices.
  3. On a 64-bit machine, the array has dimensions less than 2^31, so we could in principle return 32-bit indices, and the array is actually quite large, having many non-zero values, so there would be great space savings from returning 32-bit index vectors (cannot happen for dense matrices).
  4. On a 64-bit machine, the array has dimensions greater than 2^31, so the indices have to be 64-bit.

If find always returns indices the size of pointers on the native architecture, then all but case 3 are handled. I would argue that case 3 is pretty rare because it cannot happen at all with dense matrices, and even if you're in the sparse case, if you're doing numerical computing on a 64-bit platform, you probably have a lot of memory. In the situation that one finds oneself in case 3 and has not enough memory, I propose that we introduce find32 so that one can deal with it. We can also introduce find8, find16, find32, and find64 which always return index vectors of the requested size, and throw an error if the requested size isn't big enough for the indices of the matrix. We could introduce a findx that always returns indices of the minimum type required. This findx has nasty inference behavior, but people are going to write something like that anyway.

Ultimately the better way to deal with this is to come up with a way of doing what people use find for that doesn't involve having index vectors in memory all at once. My most common use-case for find is to use it in conjunction with sparse, to use the intervening triplet form to do some kind of filtering or transformation of a sparse matrix. That could be done much more efficiently in a streaming fashion. We could even use coroutines: a producer of (i,j,v) triples, a consumer that filters or modifies them, and a final consumer that does sparse(i,j,v), accepting one triple at a time to construct the matrix. That way there's no need to ever have more than one triple in memory at once.

Reply to this email directly or view it on GitHub:
#127 (comment)

@StefanKarpinski
Copy link
Member

It's completely related to dimensions: if your dimensions are bigger than 2^31, then you can't use 32-bit indices, if they're smaller, you can. Thus, if find is to be able to work for all matrices — even ones with dimensions larger then 2^32 — and it's going to have nice type inference behavior, there are only two choices:

  1. find always returns 64-bit indices (sucks on 32-bit architectures),
  2. find always returns indices the size of native pointers.

Since those are the only two general and well-behaved choices without forcing the user to specify an index type, I think the latter is the best. Letting people indicate what type they want to use for indices is good too and we should provide that ability.

(I meant to put the note about "cannot happen for dense matrices" after case 2 above, but I still think that optimizing the whole thing for case 3, which is relatively rare, is silly.)

Btw, I'm not advocating findx — did you read what I wrote? I think findx is a bad idea, but people will inevitably write it. I'm advocating basically the same thing as you: find8, find16, find32, find64, and general find returns indices of the index type of the architecture. It would be fine for find(Int32, ...) to be the same as find32(...); the latter can just be a shorter way of writing it. I guess the find{T<:Int}(::Type{T}, M::Matrix) version is better because you can write it once fairly generically.

@StefanKarpinski
Copy link
Member

Basically, I think the best approach is this:

  • find(M::Matrix) returns indices that are the size of the native pointer type.
  • find{T<:Int}(::Type{T}, M::Matrix) returns indices of type T and throws an error if max(size(M)) > typemax(T).

We can also provide abbreviations like these:

find32(M::Matrix) = find(Int32,M)
findu32(M::Matrix) = find(Uint32,M)

@ViralBShah
Copy link
Member

Native pointer size is wasteful on 64 bit as discussed earlier.default should be 32 bit.

On 17-Jul-2011, at 10:54 PM, StefanKarpinskireply@reply.github.com wrote:

Basically, I think the best approach is this:

  • find(M::Matrix) returns indices that are the size of the native pointer type.
  • find{T<:Int}(::Type{T}, M::Matrix) returns indices of type T and throws an error if max(size(M)) > typemax(T).

We can also provide abbreviations like these:

find32(M::Matrix) = find(Int32,M)
findu32(M::Matrix) = find(Uint32,M)

Reply to this email directly or view it on GitHub:
#127 (comment)

@JeffBezanson
Copy link
Member Author

OK, passing the desired return type to find wins. I don't care what the default is.

@JeffBezanson
Copy link
Member Author

Using Int64 on 32-bit platforms may have a fairly high cost. I tried timing an empty loop with 1e8 iterations, and it takes 0.0632 seconds with 32-bit ints and 0.1682 seconds with 64-bit. Are we ok with that?

@StefanKarpinski
Copy link
Member

@ViralBShah wrote:

Native pointer size is wasteful on 64 bit as discussed earlier.default should be 32 bit.

This is not even a viable option. What does find(S) return if S is a sparse matrix with dimensions greater than 2^31 and find always returns 32-bit indices? It has to throw an error. Like I said before, the viable only choices for find(M::Matrix) are:

  1. Always return 64-bit indices.
  2. Return 32-bit indices on 32-bit platforms and 64-bit indices on 64-bit platforms.

Anything else either can't handle all possible sizes of matrices (as above), or returns results with types that depend on input data rather than just on its type. Both of these are unacceptable for the default find behavior.

Moreover, everyone agrees that using 64-bit indices on 32-bit systems sucks, so we are left with only one real choice: return indices of the native pointer size. This will work for everyone, everywhere, and has good type inference properties. For finer-grained control, people can use find(Int32,M) or whatever they need.

@StefanKarpinski
Copy link
Member

Using Int64 on 32-bit platforms may have a fairly high cost. I tried timing an empty loop with 1e8 iterations, and it takes 0.0632 seconds with 32-bit ints and 0.1682 seconds with 64-bit. Are we ok with that?

Hmm. That is pretty slow. Damn. Maybe having integer literals default to a different type does make sense. Tough call.

@JeffBezanson
Copy link
Member Author

Matlab already returns a 64-bit type for find, even on 32-bit systems I believe. So we will at least be doing something better on 32-bit, and if memory usage is a problem it's easy to tweak your program to fix it. This should be acceptable. Must beware of premature optimization.

Another way to describe our options is this: either use more memory by default and you need to "tweak" to be more efficient, or use Int32 by default and you need to "tweak" if find() gives an error due to a large array. The case for the latter is that an error message is much easier to track down and fix than a performance problem. So it is ok with me if find gives Int32 by default if it throws an error for arrays that are too big.

@StefanKarpinski
Copy link
Member

I favor doing something that will always work but possibly be slow in extreme cases and allow for easy tweaking.

@StefanKarpinski
Copy link
Member

What the hell is going on here?

julia> @time for i=1:int64(1e8); end;
elapsed time: 12.91784977912902832 sec

julia> @time for i=1:int32(1e8); end;
elapsed time: 0.04159307479858398 sec

How can a 32-bit loop counter be 300+ times faster than a 64-bit one? On a 64-bit machine.

@JeffBezanson
Copy link
Member Author

It's not calling promote. The 1 needs to be Int64 too. I should fix this.

On Jul 17, 2011 8:12 PM, "StefanKarpinski" reply@reply.github.com wrote:

What the hell is going on here?

julia> @time for i=1:int64(1e8); end;
elapsed time: 12.91784977912902832 sec

julia> @time for i=1:int32(1e8); end;
elapsed time: 0.04159307479858398 sec

How can a 32-bit loop counter be 300+ times faster than a 64-bit one? On a 64-bit machine.

Reply to this email directly or view it on GitHub:
#127 (comment)

@StefanKarpinski
Copy link
Member

Yeah, this makes much more sense:

julia> @time begin local r=1:int32(1e8); for i=r; end; end;
elapsed time: 0.04192805290222168 sec

julia> @time begin local r=1:int64(1e8); for i=r; end; end;
elapsed time: 0.0410611629486084 sec

Also, it looks like there's no longer any real advantage to handling for i=a:b specially:

julia> @time for i=1:int32(1e8); end;
elapsed time: 0.04133892059326172 sec

That's not any faster than the above cases where a Range1 object is created.

@StefanKarpinski
Copy link
Member

I guess it does still make a difference when the loop is an inner loop and the range object has to get made many times:

julia> @time for i=1:1000000; r=1:1000; for j=r; end; end;
elapsed time: 0.55828595161437988 sec

julia> @time for i=1:1000000; for j=1:1000; end; end;
elapsed time: 0.41384506225585938 sec

25% improvement here.

@ViralBShah
Copy link
Member

Ok, so find can take a return type as an argument, and the default should be 64-bit. Much as I prefer performance, I feel that there should be as few surprises as possible.

-viral

On Jul 18, 2011, at 12:54 AM, JeffBezanson wrote:

Using Int64 on 32-bit platforms may have a fairly high cost. I tried timing an empty loop with 1e8 iterations, and it takes 0.0632 seconds with 32-bit ints and 0.1682 seconds with 64-bit. Are we ok with that?

Reply to this email directly or view it on GitHub:
#127 (comment)

@StefanKarpinski
Copy link
Member

Phew! Glad we finally settled that :-)

@JeffBezanson
Copy link
Member Author

The number of changes has been insane so far. The assumption of Int32 had gotten everywhere. Glad we're dealing with it now.

@StefanKarpinski
Copy link
Member

Uh, oh. I'm making a biggish change myself. I'm hoping this doesn't result in a nightmare merge...

@JeffBezanson
Copy link
Member Author

What are you working on?

On Mon, Jul 18, 2011 at 1:52 AM, StefanKarpinski <
reply@reply.github.com>wrote:

Uh, oh. I'm making a biggish change myself. I'm hoping this doesn't result
in a nightmare merge...

Reply to this email directly or view it on GitHub:
#127 (comment)

@StefanKarpinski
Copy link
Member

I started trying to optimize various arithmetic functions and then got into
a bunch of promotion issues which manifest as performance issues like why
mod for unsigned ints sucks. So then I ended up cleaning up a bunch of
promotion stuff.

On Mon, Jul 18, 2011 at 1:58 AM, JeffBezanson <
reply@reply.github.com>wrote:

What are you working on?

On Mon, Jul 18, 2011 at 1:52 AM, StefanKarpinski <
reply@reply.github.com>wrote:

Uh, oh. I'm making a biggish change myself. I'm hoping this doesn't
result
in a nightmare merge...

Reply to this email directly or view it on GitHub:
#127 (comment)

Reply to this email directly or view it on GitHub:
#127 (comment)

@JeffBezanson
Copy link
Member Author

Go ahead and commit when you're ready. I will sit on my 2192-line diff for a bit longer.

@StefanKarpinski
Copy link
Member

Almost there. Was trying to track down a nasty circular definition that was causing type inference to stack overflow. Got it now, just writing up the commit.

@StefanKarpinski
Copy link
Member

Ok, pushed now.

JeffBezanson added a commit that referenced this issue Jul 18, 2011
changing array sizes and default integer type to Int64 on 64-bit platforms
tests now pass, but not everything works 100%
@JeffBezanson
Copy link
Member Author

A branch jb/int64 is now available.

@StefanKarpinski
Copy link
Member

I tried merging and it went fine. We managed to completely miss each other with those two rather large commits.

@StefanKarpinski
Copy link
Member

Shall I push the merged version?

@JeffBezanson
Copy link
Member Author

No, I want to do some more testing.

@JeffBezanson
Copy link
Member Author

Branch merged in commit e84691c.

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
Add compatibility for base64 → base64encode rename
Keno pushed a commit that referenced this issue Oct 9, 2023
Debug commands: implement unwinding of stack on thrown exception
NHDaly added a commit that referenced this issue May 22, 2024
* Change to streaming out the heap snapshot data (#1)

* Streaming the heap snapshot!

This should prevent the engine from OOMing while recording the snapshot!

Now we just need to sample the files, either online, before downloading,
or offline after downloading :)

If we're gonna do it offline, we'll want to gzip the files before
downloading them.

* Allow custom filename; use original API

* Support legacy heap snapshot interface. Add reassembly function.

* Add tests

* Apply suggestions from code review

* Update src/gc-heap-snapshot.cpp

* Change to always save the parts in the same directory

This way you can always recover from an OOM

* Fix bug in reassembler: from_node and to_node were in the wrong order

* Fix correctness mistake: The edges have to be reordered according to the
node order. That's the whole reason this is tricky.

But i'm not sure now whether the SoAs approach is actually an
optimization.... It seems like we should probably prefer to inline the
Edges right into the vector, rather than having to do another random
lookup into the edges table?

* Debugging messed up edge array idxs

* Disable log message

* Write the .nodes and .edges as binary data

* Remove unnecessary logging

* fix merge issues

* attempt to add back the orphan node checking logic

---------

Co-authored-by: Nathan Daly <nathan.daly@relational.ai>
Co-authored-by: Nathan Daly <NHDaly@gmail.com>

* attempt to fix the doc issue for assemble_snapshot

remove unused k_node_number_of_fields from gc-heap-snapshot.cpp

attempt to resolve the savepoint issue on serialize_node

* remove println in take_heap_snapshot to avoid messing up console output in Julia REPL

* rename alloc_type for array buffer in gc-heap-snapshot

* streaming strings directly to avoid cache in memory

dedupling strings for field paths

* address PR comments

---------

Co-authored-by: Nathan Daly <nathan.daly@relational.ai>
Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants