Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding vectorized version of GetScenePrimPath to HdSceneDelegate #1744

Merged

Conversation

marktucker
Copy link
Contributor

Adding GetScenePrimPaths method to HdSceneDelegate which allows a single
method call to fetch prim paths for a large number of instances in a single
method call. Especially useful for native instances which use an O(n)
algorithm to discover the path of a single instance.

Description of Change(s)

For applications which make extensive use of GetScenePrimPath (as Houdini does), the performance of this function for large numbers of native instances makes it a huge bottleneck. A vectorized version of this method allows the native instancer implementation to amortize the expensive parts of this operation (creating the vector of instance ids and iterating through that vector) across all the instances being queried at once.

The alternative approach to this would have been to make GetScenePrimPath on instanceAdapter much, much faster, but I wasn't able to figure out how to accomplish that without making a change to the GetScenePrimPath signature, which would have made any solution that much more intrusive. The downside of this approach is that any custom prim adapters that implement GetScenePrimPath will also have to explicitly implement GetScenePrimPaths (even if it just runs a simple loop as the pointInstancerAdapter implementation does).

method call to fetch prim paths for a large number of instances in a single
method call. Especially useful for native instances which use an O(n)
algorithm to discover the path of a single instance.
@jilliene
Copy link

Filed as internal issue #USD-7145

@tcauchois
Copy link
Contributor

Hey Mark,

I have a few questions about this. Do you know where the time is going when GetScenePrimPath is running slowly for you? The things that jump to mind as possibilities are GetScenePrimPathFn and _ComputeInstanceMap, but I'm curious how it actually breaks down.

You also mentioned you had an intrusive change to speed up the single-path variant; how would that one work?

Thanks,
Tom

@marktucker
Copy link
Contributor Author

Hey Tom. You are correct, the time is all being spent on GetScenePrimPathFn and _ComputeInstanceMap. These functions are both linear on the number of instances. So when you call this function on every instance, you end up with N^2 performance on the number of instances.

My thought for speeding up GetScenePrimPath was to stash the results of _ComputeInstanceMap into a new input/output parameter of GetScenePrimPath so that it could be reused by subsequent calls to GetScenePrimPath. But this is pretty ugly, as the ownership and lifetime of that instance map object would be very unclear from the signature. And I'm not even sure if it would help enough because I think GetScenePrimPathFn would still be linear on N.

@tcauchois
Copy link
Contributor

tcauchois commented Mar 2, 2022

A few more questions: for your workflows, do you know if you're hitting _ComputeInstanceMap or GetScenePrimPathFn? If you're interpreting picking results, I'd expect them all to be the former. If you're spending a lot of time in the latter, I'm curious to hear your use case.

Passing the instance map out of the adapter is a bit problematic. We used to cache it in the instancer data: https://github.com/PixarAnimationStudios/USD/blob/release/pxr/usdImaging/usdImaging/instanceAdapter.h#L518 ... which you already have a pointer to in GetScenePrimPath. We ran into issues invalidating it and making it threadsafe to access, but those are pretty solvable.

@marktucker
Copy link
Contributor Author

We use this for selection highlighting and interpreting pick results (including box picking large areas of a scene). But the worse case is obviously for selection highlighting. We do not pass the selection to the render delegate, because we don't want the render delegate to have to restart its render when the selection changes, nor do we want the render delegate to decide on the "look" for a selected prim. Houdini does all selection highlighting for all render delegates simply by looking at the pick id AOVs generated by the render delegate. But this means that we need to know the SdfPath that corresponds to every pixel in the scene. So if you have 20K native instances, we make this query 20K times (or one time with the new API).

I'm sure you're right that the majority of the time is spent in _ComputeInstanceMap. But GetScenePrimPathFn is called, if I'm not mistaken, in a linear search through the instances by _RunForAllInstancesToDraw. So even if _ComputeInstanceMap was free, that would just expose GetScenePrimPathFn as the new bottleneck. I can try to do some accurate profiling if that might make the difference between accepting this PR and not accepting it.

Would it make any difference if I re-implemented the GetScenePrimPath (single query) method to be non-virtual, and instead implement it in UsdImagingPrimAdapter purely using GetScenePrimPaths? Then there is only one code path to deal with, and no complications of multithreading and caching. I thought about doing this originally in this PR, but didn't because I was trying to not be too disruptive (though I know it is still an API change for anyone who has implemented this function in a custom prim adapter)...

@tcauchois
Copy link
Contributor

Definitely the duplication between the single and vectorized version is a concern, and the right answer might be adding the vectorized version and deprecating the single version. (And only ever adding vectorized APIs in the future :).

We're discussing this right now and I'll hopefully have an update soon. In the meantime, a slightly off-topic question. You mentioned that one use case for this is selection highlighting: you're using this call to loop over pixels in the scene, translating from hydra ID to usd ID so you can determine if they're selected. We use a similar (but opposite) approach for hdPrman selection highlighting: usdview keeps a set of USD objects that are selected, and the app uses UsdImagingDelegate->PopulateSelection() to turn those into a set of hydra IDs whenever the selection changes. Prman uses HdxColorizeSelectionTask to loop over pixels, match their hydra IDs with the results of PopulateSelection, and apply a tint if there's a match. Would that approach speed things up for you? (I know it doesn't address the box selection side of things).

@marktucker
Copy link
Contributor Author

That seems like a very on-topic question... I must admit I'm not terribly familiar with how all the different pieces fit together when usdview does selection (using the "proper" hydra APIs that exist for this purpose). The Houdini viewport basically doesn't use HdxTasks, because we want to have full control over the look and rendering approach used in the viewport, regardless of the render delegate, and not rely on Hydra code for displaying AOVs or selections. So we just grab raw AOV data from the delegate and do our own processing and rendering into the viewport. This ensures consistency of looks and allows sharing more code between the Solaris/USD viewport and the standard Houdini/OBJ/SOP viewport. So changing to PopulateSelection/HdxColorizeSelectionTask would be a major headache. And I suspect we'd have to write custom tasks anyway to get the look we want (we use an outline for selection, and a filled look only when box picking - see attached image).
dual_selection

But then I couldn't figure out how PopulateSelection could be any more efficient, so I did some testing in usdview and in fact it isn't any more efficient than going the other direction. UsdImagingInstanceAdapter::PopulateSelection (which is also not vectorized) makes a call to _ComputeInstanceMap and _RunForAllInstancesToDraw for each primitive added to the selection. So if you have a scene with 10K instanceable references, and you select 1K of them in the scene graph tree (which results in 1K calls to PopulateSelection) it's just as slow as box picking 1K instances in the viewport and resolving them to SdfPaths using GetScenePrimPath. So you might want to consider vectorizing PopulateSelection too :)

@tcauchois
Copy link
Contributor

Left some comments. We're up for taking this; we'll probably immediately deprecate the non-vectorized version. For safety, though, it would be great to write one of the instance adapter implementations in terms of the other.

@tcauchois
Copy link
Contributor

Re: PopulateSelection: it has its own scalability issues and could probably use a vectorized rewrite. Note that you don't need to use it with HdxColorizeTask; it gives you an HdSelection object with a set of hydra-namespace selections, and you can have your own drawing code query that object. You pay for the # of objects selected, but only when the selection set changes, and you don't pay for the # of pixels you're processing, so it's just a totally different performance profile; maybe not useful.

Another thing: GetScenePrimPath (despite not being marked const) should be threadsafe. PopulateSelection would be, if not for the ApplyPendingUpdates at the beginning, which I've been meaning to move forever. I'm not sure if you're doing this already, but sticking a WorkParallelForN (or equivalent) around the GetScenePrimPath calls might help.

@marktucker
Copy link
Contributor Author

So I should amend this pull request with a change to implement GetScenePrimPath in terms of GetScenePrimPaths?

I know GetScenePrimPath is supposed to be thread safe. I believe many moons ago I tried multithreading our use of that function but at that time it wasn't actually thread safe when called simultaneously for two instances of the same prototype. Maybe it is now. But at best that gets me a 16x speedup on my 16 core machine. GetScenePrimPaths gets me multiple orders of magnitude speedup (minutes down to fractions of a second at the scale of instancing I've been testing with). And that's without multithreading calls to GetScenePrimPaths (which I could still do, but haven't yet because it would complicate the calling code significantly, and it's already "fast enough" for now).

@tcauchois
Copy link
Contributor

Mark: after this gets pulled in I'd like to drop GetScenePrimPath in favor of GetScenePrimPaths, so any efforts in that direction would be appreciated, but in particular the InstanceAdapter GetScenePrimPath being duplicated as GetScenePrimPaths (both the function and the helper struct) worries me because it's such an intricate bit of code, and when we duplicate bits of code like that they inevitably drift. Could you update that function in particular?

GetScenePrimPaths, to eliminate a large amount of nearly-duplicated code.
@marktucker
Copy link
Contributor Author

I hope this is what you meant. GetScenePrimPath now just calls GetScenePrimPaths in UsdImagingInstanceAdapter.

@tcauchois
Copy link
Contributor

tcauchois commented Mar 11, 2022 via email

@asluk
Copy link
Collaborator

asluk commented Mar 16, 2022

Hi @marktucker and @tcauchois -- we've been seeing GetScenePrimPath as a hotspot for the picking and selection highlighting backend for Omniverse RTX as well, and one of our developers will share additional findings in this PR, so that we can align further on optimization strategies and tradeoffs. Thank you!


remappedIndices.reserve(instanceIndices.size());
for (size_t i = 0; i < instanceIndices.size(); i++)
remappedIndices.push_back(indices[instanceIndices[i]]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a list remapped a list of remapped indices it could be beneficial to have a bitset (std::vector) where each bit set marks an index to fetch. This would change the O(N) operation of std::vector<int>::find(instanceIndex) to an O(1) operation. There are worst case scenarios where the path for each instance is being queries which would run in O(N^2) runtime with this implementation.

To avoid costly memory reallocations and copies it'd be be good to run over the list on instanceIndices twice, one time to check for the max index to determine required size of the vector and once to set all the bits for indices[instanceIndices[i]].

This solution has the downside of allocating too much memory for a single large index. To counter this in addition to the maximum remapped index one could determine the minimum remapped index as well to offset instanceIndex in the operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what std::vector<int>::find(instanceIndex) operation are you trying to avoid? I really don't understand what you're suggesting here... Could you provide some code that shows exactly what you want to change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll prepare a patch with the suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marktucker I've pushed a change with my ideas to my usd fork based on your branch here: https://github.com/mtavenrath/USD_sideeffects/commit/48b77edc80d90b0d0e014bb20e475632a913a344. There's no std::set/std::map anymore and everythings runs in linear time. I've tested the change with a scene with 1M instances querying all of them with a result of a few seconds only for all paths.

The idea is that requestedIndicesMap specifies if a index is requested at all (!= INT_MAX) and the value of requestedIndicesMap specifies the location for the index in the resulting vector. Thus there's no need to construct & iterate the result map anymore.

There is one 'worst case' scenario if someone queries the first and last instance. For this case requestedIndicesMap will be as large as the number of instances. Given it's single allocation only with sizeof(int) == 4 the cost of the allocation is neglectable when looking at memory consumed by the whole scene.

std::map/std::set quickly consume way more memory for way less elements given that each entry in those data structures has a size of at least 16 bytes and each entry is a single allocation and thus has allocation memory overhead as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tcauchois , I'll add @mtavenrath to NVIDIA's CLA. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mtavenrath! Just to defend my implementation a little bit, it was in fact only ever doing a std::set<int>::find, not std::vector<int>::find as you claimed. So things were never as bad as you feared. That said, your implementation is definitely going to be faster, and I suspect also smaller in memory footprint in the large-number cases that we're concerned about here. I haven't had a chance to test it out, but I definitely like the concept.

For purely organization and attribution purposes, I think @tcauchois 's plan of accepting my PR, then your PR on top makes the most sense (assuming these PRs are both acceptable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marktucker I realised that std::set::find instead of std::vetor is being used while implementing my idea and have to excuse myself for making that rash comment. Your attempt to do the picking is actually better than my first attempt which provided a function which returned a std::vectorwith all instance paths. Indexing had to be done on the applications side.

I agree that it makes sense to have two separate PRs for the two stages of optimization.

@tcauchois
Copy link
Contributor

Markus: just to double check, is your CLA in order? It would be great to get your changes as a separate PR. I'd also like to hear Mark's thoughts on these changes.

Thanks all!

@pixar-oss pixar-oss merged commit 3ed874c into PixarAnimationStudios:dev Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants