-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
64-bit array sizes #127
Comments
Is size_t signed or unsigned? Signed for 64-bit platforms is great because it's intuitive and the size is still so huge that it's no limitation to give up half the range for having negatives. For 32-bit platforms it's a bit more of a tough call. What if you want an array that's bigger than 2147483647 items but still smaller than 4294967295 items? Eh, maybe that's actually not a real issue since unless it's an array of bytes, you can't even address that much memory with a 32-bit pointer. And if you're dealing with things that big, you should probably get a 64-bit machine. Is the default integer type different then size_t? Are they always the same, or sometimes different? Is this a question of how big C's One alternative to |
By "default" I actually meant the types of integer literals. On a 64-bit machine "2" should be an |
size_t is unsigned, but I don't think it matters. We should use signed. I agree that >2GB arrays on a 32-bit machine is not a real use case. |
It's not even >2GB, it's more than 2^31 entries, which means 16GB arrays of 64-bit floats. Definitely not a real use case. |
I prefer Int64 for array dims and numel. Why confuse a julia user with size_t - they probably don't care what it is. |
They don't have to care or know what it is. They just use it and it's the right native size. On the other hand, if it's Int64 everywhere, then on 32-bit machines every single indexing operation will have to convert to 32 bits. Seems wasteful and pointless. Then again, I don't care very much about 32-bit platforms, personally. |
While I see the advantage of having 32-bit machines default to 32-bit integer literals while 64-bit machines default to 64-bit integer literals, I'm quite concerned that having different semantics for basic numerical expressions on different machines:
Seems very confusing and like a pretty big mistake. Array sizes and indexes are a different story since if you have an array index or size that's big enough to overflow, you won't be able to address all of it on a 32-bit machine, so there's no situation where the difference would come up. |
I would be ok with decimal integer literals being 64-bit everywhere though. Another issue, I think that even if decimal integer literals are always 64-bit, there's a strong argument to be made for having hex integer literals being unsigned and having a bit-size indicated by the number of digits. For example, |
Indices will continue to be 32-bit for the most part, even for 64-bit platforms, because otherwise, it is an unnecessary waste of memory. I guess one possibility is that if numel(array) fits in 32-bit, we can go with 32-bit, otherwise it should be 64-bit. Would it make sense to make indexType a parameter for the array? -viral On Jul 17, 2011, at 10:53 AM, StefanKarpinski wrote:
|
I like the hex literal idea; now that you mention it that's kind of something I always wanted. The memory issue is one reason to allow any kind of integer as an index. That way you can use Int32 if you need it. It's also fine with me if |
Can't the return type of find be based on numel(A)? What about the suggestion of indexType as a type parameter for Array? -viral On Jul 17, 2011, at 12:37 PM, JeffBezanson wrote:
|
BTW, find64 is an incredibly bad idea. -viral On Jul 17, 2011, at 12:37 PM, JeffBezanson wrote:
|
Why is it so bad? |
It seems inelegant from a user's perspective. If I write code that uses arrays of different sizes, then at some point, I have to change find to find64, or sprinkle all kinds of case handling all over the place. What if Array was Array{T,n,idxType} ? When the array is constructed, we will know numel, and hence idxType can be instantiated to the right type. Then, we can enforce the rule that all dimensions, numel, and indices are of idxType for that array. If we can do this, then I think it is reasonable that 64-bit indexing has to be done when idxType is Int64, and its ok to sacrifice space savings of using Int32 indexing in that case. If this were possible, it even allows us to create small arrays (16-bit idxType) more efficiently, with less space, although I am not sure if that has a speed advantage. -viral On Jul 17, 2011, at 12:58 PM, JeffBezanson wrote:
|
That pushes the dynamic behavior somewhere else --- we will never be able to infer the third type parameter of any array. It would be better to have The nasty thing about |
Yes, matlab return's doubles, but matlab uses doubles everywhere. I guess we don't really have a good solution here. -viral On Jul 17, 2011, at 1:43 PM, JeffBezanson wrote:
|
How about make find just take an optional argument that specifies the output type? Would that work with type inference? Without the return type specified, it would be 32-bit, and 64-bit if specified. It could even be 8-bit or 16-bit then. This is also similar to how zeros works. -viral On Jul 17, 2011, at 1:43 PM, JeffBezanson wrote:
|
There are four cases:
If Ultimately the better way to deal with this is to come up with a way of doing what people use |
Its not related to dimensions, but numel, because find can return linear indices for an N-d array. Sparseness just affects the memory usage, but not the index type that needs to be used. I really dislike findxx. I would much rather have find(Int32, ...) etc, the way we do zeros, ones, etc. -viral On Jul 17, 2011, at 10:00 PM, StefanKarpinski wrote:
|
It's completely related to dimensions: if your dimensions are bigger than 2^31, then you can't use 32-bit indices, if they're smaller, you can. Thus, if
Since those are the only two general and well-behaved choices without forcing the user to specify an index type, I think the latter is the best. Letting people indicate what type they want to use for indices is good too and we should provide that ability. (I meant to put the note about "cannot happen for dense matrices" after case 2 above, but I still think that optimizing the whole thing for case 3, which is relatively rare, is silly.) Btw, I'm not advocating |
Basically, I think the best approach is this:
We can also provide abbreviations like these:
|
Native pointer size is wasteful on 64 bit as discussed earlier.default should be 32 bit. On 17-Jul-2011, at 10:54 PM, StefanKarpinskireply@reply.github.com wrote:
|
OK, passing the desired return type to find wins. I don't care what the default is. |
Using Int64 on 32-bit platforms may have a fairly high cost. I tried timing an empty loop with 1e8 iterations, and it takes 0.0632 seconds with 32-bit ints and 0.1682 seconds with 64-bit. Are we ok with that? |
@ViralBShah wrote:
This is not even a viable option. What does
Anything else either can't handle all possible sizes of matrices (as above), or returns results with types that depend on input data rather than just on its type. Both of these are unacceptable for the default Moreover, everyone agrees that using 64-bit indices on 32-bit systems sucks, so we are left with only one real choice: return indices of the native pointer size. This will work for everyone, everywhere, and has good type inference properties. For finer-grained control, people can use |
Hmm. That is pretty slow. Damn. Maybe having integer literals default to a different type does make sense. Tough call. |
Matlab already returns a 64-bit type for find, even on 32-bit systems I believe. So we will at least be doing something better on 32-bit, and if memory usage is a problem it's easy to tweak your program to fix it. This should be acceptable. Must beware of premature optimization. Another way to describe our options is this: either use more memory by default and you need to "tweak" to be more efficient, or use Int32 by default and you need to "tweak" if find() gives an error due to a large array. The case for the latter is that an error message is much easier to track down and fix than a performance problem. So it is ok with me if find gives Int32 by default if it throws an error for arrays that are too big. |
I favor doing something that will always work but possibly be slow in extreme cases and allow for easy tweaking. |
What the hell is going on here?
How can a 32-bit loop counter be 300+ times faster than a 64-bit one? On a 64-bit machine. |
It's not calling promote. The 1 needs to be Int64 too. I should fix this. On Jul 17, 2011 8:12 PM, "StefanKarpinski" reply@reply.github.com wrote:
|
Yeah, this makes much more sense:
Also, it looks like there's no longer any real advantage to handling
That's not any faster than the above cases where a Range1 object is created. |
I guess it does still make a difference when the loop is an inner loop and the range object has to get made many times:
25% improvement here. |
Ok, so find can take a return type as an argument, and the default should be 64-bit. Much as I prefer performance, I feel that there should be as few surprises as possible. -viral On Jul 18, 2011, at 12:54 AM, JeffBezanson wrote:
|
Phew! Glad we finally settled that :-) |
The number of changes has been insane so far. The assumption of Int32 had gotten everywhere. Glad we're dealing with it now. |
Uh, oh. I'm making a biggish change myself. I'm hoping this doesn't result in a nightmare merge... |
What are you working on? On Mon, Jul 18, 2011 at 1:52 AM, StefanKarpinski <
|
I started trying to optimize various arithmetic functions and then got into On Mon, Jul 18, 2011 at 1:58 AM, JeffBezanson <
|
Go ahead and commit when you're ready. I will sit on my 2192-line diff for a bit longer. |
Almost there. Was trying to track down a nasty circular definition that was causing type inference to stack overflow. Got it now, just writing up the commit. |
Ok, pushed now. |
changing array sizes and default integer type to Int64 on 64-bit platforms tests now pass, but not everything works 100%
A branch jb/int64 is now available. |
I tried merging and it went fine. We managed to completely miss each other with those two rather large commits. |
Shall I push the merged version? |
No, I want to do some more testing. |
Branch merged in commit e84691c. |
Add compatibility for base64 → base64encode rename
Debug commands: implement unwinding of stack on thrown exception
* Change to streaming out the heap snapshot data (#1) * Streaming the heap snapshot! This should prevent the engine from OOMing while recording the snapshot! Now we just need to sample the files, either online, before downloading, or offline after downloading :) If we're gonna do it offline, we'll want to gzip the files before downloading them. * Allow custom filename; use original API * Support legacy heap snapshot interface. Add reassembly function. * Add tests * Apply suggestions from code review * Update src/gc-heap-snapshot.cpp * Change to always save the parts in the same directory This way you can always recover from an OOM * Fix bug in reassembler: from_node and to_node were in the wrong order * Fix correctness mistake: The edges have to be reordered according to the node order. That's the whole reason this is tricky. But i'm not sure now whether the SoAs approach is actually an optimization.... It seems like we should probably prefer to inline the Edges right into the vector, rather than having to do another random lookup into the edges table? * Debugging messed up edge array idxs * Disable log message * Write the .nodes and .edges as binary data * Remove unnecessary logging * fix merge issues * attempt to add back the orphan node checking logic --------- Co-authored-by: Nathan Daly <nathan.daly@relational.ai> Co-authored-by: Nathan Daly <NHDaly@gmail.com> * attempt to fix the doc issue for assemble_snapshot remove unused k_node_number_of_fields from gc-heap-snapshot.cpp attempt to resolve the savepoint issue on serialize_node * remove println in take_heap_snapshot to avoid messing up console output in Julia REPL * rename alloc_type for array buffer in gc-heap-snapshot * streaming strings directly to avoid cache in memory dedupling strings for field paths * address PR comments --------- Co-authored-by: Nathan Daly <nathan.daly@relational.ai> Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Current plan:
Size
for all array dims and numel(Array).Size
will be Int64 on 64-bit platforms and Int32 on 32-bit.Int
instead ofIndex
and allow any integer type.The text was updated successfully, but these errors were encountered: