Adding vectorized version of GetScenePrimPath to HdSceneDelegate #1744

marktucker · 2022-01-14T20:25:51Z

Adding GetScenePrimPaths method to HdSceneDelegate which allows a single
method call to fetch prim paths for a large number of instances in a single
method call. Especially useful for native instances which use an O(n)
algorithm to discover the path of a single instance.

Description of Change(s)

For applications which make extensive use of GetScenePrimPath (as Houdini does), the performance of this function for large numbers of native instances makes it a huge bottleneck. A vectorized version of this method allows the native instancer implementation to amortize the expensive parts of this operation (creating the vector of instance ids and iterating through that vector) across all the instances being queried at once.

The alternative approach to this would have been to make GetScenePrimPath on instanceAdapter much, much faster, but I wasn't able to figure out how to accomplish that without making a change to the GetScenePrimPath signature, which would have made any solution that much more intrusive. The downside of this approach is that any custom prim adapters that implement GetScenePrimPath will also have to explicitly implement GetScenePrimPaths (even if it just runs a simple loop as the pointInstancerAdapter implementation does).

method call to fetch prim paths for a large number of instances in a single method call. Especially useful for native instances which use an O(n) algorithm to discover the path of a single instance.

jilliene · 2022-01-19T19:43:05Z

Filed as internal issue #USD-7145

tcauchois · 2022-02-16T19:01:14Z

Hey Mark,

I have a few questions about this. Do you know where the time is going when GetScenePrimPath is running slowly for you? The things that jump to mind as possibilities are GetScenePrimPathFn and _ComputeInstanceMap, but I'm curious how it actually breaks down.

You also mentioned you had an intrusive change to speed up the single-path variant; how would that one work?

Thanks,
Tom

marktucker · 2022-02-16T19:28:13Z

Hey Tom. You are correct, the time is all being spent on GetScenePrimPathFn and _ComputeInstanceMap. These functions are both linear on the number of instances. So when you call this function on every instance, you end up with N^2 performance on the number of instances.

My thought for speeding up GetScenePrimPath was to stash the results of _ComputeInstanceMap into a new input/output parameter of GetScenePrimPath so that it could be reused by subsequent calls to GetScenePrimPath. But this is pretty ugly, as the ownership and lifetime of that instance map object would be very unclear from the signature. And I'm not even sure if it would help enough because I think GetScenePrimPathFn would still be linear on N.

tcauchois · 2022-03-02T00:44:44Z

A few more questions: for your workflows, do you know if you're hitting _ComputeInstanceMap or GetScenePrimPathFn? If you're interpreting picking results, I'd expect them all to be the former. If you're spending a lot of time in the latter, I'm curious to hear your use case.

Passing the instance map out of the adapter is a bit problematic. We used to cache it in the instancer data: https://github.com/PixarAnimationStudios/USD/blob/release/pxr/usdImaging/usdImaging/instanceAdapter.h#L518 ... which you already have a pointer to in GetScenePrimPath. We ran into issues invalidating it and making it threadsafe to access, but those are pretty solvable.

marktucker · 2022-03-02T14:35:42Z

We use this for selection highlighting and interpreting pick results (including box picking large areas of a scene). But the worse case is obviously for selection highlighting. We do not pass the selection to the render delegate, because we don't want the render delegate to have to restart its render when the selection changes, nor do we want the render delegate to decide on the "look" for a selected prim. Houdini does all selection highlighting for all render delegates simply by looking at the pick id AOVs generated by the render delegate. But this means that we need to know the SdfPath that corresponds to every pixel in the scene. So if you have 20K native instances, we make this query 20K times (or one time with the new API).

I'm sure you're right that the majority of the time is spent in _ComputeInstanceMap. But GetScenePrimPathFn is called, if I'm not mistaken, in a linear search through the instances by _RunForAllInstancesToDraw. So even if _ComputeInstanceMap was free, that would just expose GetScenePrimPathFn as the new bottleneck. I can try to do some accurate profiling if that might make the difference between accepting this PR and not accepting it.

Would it make any difference if I re-implemented the GetScenePrimPath (single query) method to be non-virtual, and instead implement it in UsdImagingPrimAdapter purely using GetScenePrimPaths? Then there is only one code path to deal with, and no complications of multithreading and caching. I thought about doing this originally in this PR, but didn't because I was trying to not be too disruptive (though I know it is still an API change for anyone who has implemented this function in a custom prim adapter)...

tcauchois · 2022-03-03T21:04:11Z

Definitely the duplication between the single and vectorized version is a concern, and the right answer might be adding the vectorized version and deprecating the single version. (And only ever adding vectorized APIs in the future :).

We're discussing this right now and I'll hopefully have an update soon. In the meantime, a slightly off-topic question. You mentioned that one use case for this is selection highlighting: you're using this call to loop over pixels in the scene, translating from hydra ID to usd ID so you can determine if they're selected. We use a similar (but opposite) approach for hdPrman selection highlighting: usdview keeps a set of USD objects that are selected, and the app uses UsdImagingDelegate->PopulateSelection() to turn those into a set of hydra IDs whenever the selection changes. Prman uses HdxColorizeSelectionTask to loop over pixels, match their hydra IDs with the results of PopulateSelection, and apply a tint if there's a match. Would that approach speed things up for you? (I know it doesn't address the box selection side of things).

marktucker · 2022-03-04T05:22:53Z

That seems like a very on-topic question... I must admit I'm not terribly familiar with how all the different pieces fit together when usdview does selection (using the "proper" hydra APIs that exist for this purpose). The Houdini viewport basically doesn't use HdxTasks, because we want to have full control over the look and rendering approach used in the viewport, regardless of the render delegate, and not rely on Hydra code for displaying AOVs or selections. So we just grab raw AOV data from the delegate and do our own processing and rendering into the viewport. This ensures consistency of looks and allows sharing more code between the Solaris/USD viewport and the standard Houdini/OBJ/SOP viewport. So changing to PopulateSelection/HdxColorizeSelectionTask would be a major headache. And I suspect we'd have to write custom tasks anyway to get the look we want (we use an outline for selection, and a filled look only when box picking - see attached image).

But then I couldn't figure out how PopulateSelection could be any more efficient, so I did some testing in usdview and in fact it isn't any more efficient than going the other direction. UsdImagingInstanceAdapter::PopulateSelection (which is also not vectorized) makes a call to _ComputeInstanceMap and _RunForAllInstancesToDraw for each primitive added to the selection. So if you have a scene with 10K instanceable references, and you select 1K of them in the scene graph tree (which results in 1K calls to PopulateSelection) it's just as slow as box picking 1K instances in the viewport and resolving them to SdfPaths using GetScenePrimPath. So you might want to consider vectorizing PopulateSelection too :)

tcauchois · 2022-03-10T00:28:58Z

Left some comments. We're up for taking this; we'll probably immediately deprecate the non-vectorized version. For safety, though, it would be great to write one of the instance adapter implementations in terms of the other.

tcauchois · 2022-03-10T00:31:41Z

Re: PopulateSelection: it has its own scalability issues and could probably use a vectorized rewrite. Note that you don't need to use it with HdxColorizeTask; it gives you an HdSelection object with a set of hydra-namespace selections, and you can have your own drawing code query that object. You pay for the # of objects selected, but only when the selection set changes, and you don't pay for the # of pixels you're processing, so it's just a totally different performance profile; maybe not useful.

Another thing: GetScenePrimPath (despite not being marked const) should be threadsafe. PopulateSelection would be, if not for the ApplyPendingUpdates at the beginning, which I've been meaning to move forever. I'm not sure if you're doing this already, but sticking a WorkParallelForN (or equivalent) around the GetScenePrimPath calls might help.

marktucker · 2022-03-10T01:43:08Z

So I should amend this pull request with a change to implement GetScenePrimPath in terms of GetScenePrimPaths?

I know GetScenePrimPath is supposed to be thread safe. I believe many moons ago I tried multithreading our use of that function but at that time it wasn't actually thread safe when called simultaneously for two instances of the same prototype. Maybe it is now. But at best that gets me a 16x speedup on my 16 core machine. GetScenePrimPaths gets me multiple orders of magnitude speedup (minutes down to fractions of a second at the scale of instancing I've been testing with). And that's without multithreading calls to GetScenePrimPaths (which I could still do, but haven't yet because it would complicate the calling code significantly, and it's already "fast enough" for now).

tcauchois · 2022-03-10T19:55:50Z

Mark: after this gets pulled in I'd like to drop GetScenePrimPath in favor of GetScenePrimPaths, so any efforts in that direction would be appreciated, but in particular the InstanceAdapter GetScenePrimPath being duplicated as GetScenePrimPaths (both the function and the helper struct) worries me because it's such an intricate bit of code, and when we duplicate bits of code like that they inevitably drift. Could you update that function in particular?

GetScenePrimPaths, to eliminate a large amount of nearly-duplicated code.

marktucker · 2022-03-10T21:45:36Z

I hope this is what you meant. GetScenePrimPath now just calls GetScenePrimPaths in UsdImagingInstanceAdapter.

tcauchois · 2022-03-11T00:03:08Z

Thanks!

…

On Thu, Mar 10, 2022 at 1:45 PM Mark Tucker ***@***.***> wrote: I hope this is what you meant. GetScenePrimPath now just calls GetScenePrimPaths in UsdImagingInstanceAdapter. — Reply to this email directly, view it on GitHub <#1744 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAF6PT7GSBTKWE6LEIAQRHTU7JUQVANCNFSM5L7WDEHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

asluk · 2022-03-16T14:49:18Z

Hi @marktucker and @tcauchois -- we've been seeing GetScenePrimPath as a hotspot for the picking and selection highlighting backend for Omniverse RTX as well, and one of our developers will share additional findings in this PR, so that we can align further on optimization strategies and tradeoffs. Thank you!

mtavenrath · 2022-03-16T14:53:55Z

pxr/usdImaging/usdImaging/instanceAdapter.cpp

+
+        remappedIndices.reserve(instanceIndices.size());
+        for (size_t i = 0; i < instanceIndices.size(); i++)
+            remappedIndices.push_back(indices[instanceIndices[i]]);


Instead of creating a list remapped a list of remapped indices it could be beneficial to have a bitset (std::vector) where each bit set marks an index to fetch. This would change the O(N) operation of std::vector<int>::find(instanceIndex) to an O(1) operation. There are worst case scenarios where the path for each instance is being queries which would run in O(N^2) runtime with this implementation.

To avoid costly memory reallocations and copies it'd be be good to run over the list on instanceIndices twice, one time to check for the max index to determine required size of the vector and once to set all the bits for indices[instanceIndices[i]].

This solution has the downside of allocating too much memory for a single large index. To counter this in addition to the maximum remapped index one could determine the minimum remapped index as well to offset instanceIndex in the operator.

Sorry, what std::vector<int>::find(instanceIndex) operation are you trying to avoid? I really don't understand what you're suggesting here... Could you provide some code that shows exactly what you want to change?

I'll prepare a patch with the suggestions.

@marktucker I've pushed a change with my ideas to my usd fork based on your branch here: https://github.com/mtavenrath/USD_sideeffects/commit/48b77edc80d90b0d0e014bb20e475632a913a344. There's no std::set/std::map anymore and everythings runs in linear time. I've tested the change with a scene with 1M instances querying all of them with a result of a few seconds only for all paths.

The idea is that requestedIndicesMap specifies if a index is requested at all (!= INT_MAX) and the value of requestedIndicesMap specifies the location for the index in the resulting vector. Thus there's no need to construct & iterate the result map anymore.

There is one 'worst case' scenario if someone queries the first and last instance. For this case requestedIndicesMap will be as large as the number of instances. Given it's single allocation only with sizeof(int) == 4 the cost of the allocation is neglectable when looking at memory consumed by the whole scene.

std::map/std::set quickly consume way more memory for way less elements given that each entry in those data structures has a size of at least 16 bytes and each entry is a single allocation and thus has allocation memory overhead as well.

Hi @tcauchois , I'll add @mtavenrath to NVIDIA's CLA. Thanks!

Thanks @mtavenrath! Just to defend my implementation a little bit, it was in fact only ever doing a std::set<int>::find, not std::vector<int>::find as you claimed. So things were never as bad as you feared. That said, your implementation is definitely going to be faster, and I suspect also smaller in memory footprint in the large-number cases that we're concerned about here. I haven't had a chance to test it out, but I definitely like the concept.

For purely organization and attribution purposes, I think @tcauchois 's plan of accepting my PR, then your PR on top makes the most sense (assuming these PRs are both acceptable).

@marktucker I realised that std::set::find instead of std::vetor is being used while implementing my idea and have to excuse myself for making that rash comment. Your attempt to do the picking is actually better than my first attempt which provided a function which returned a std::vectorwith all instance paths. Indexing had to be done on the applications side.

I agree that it makes sense to have two separate PRs for the two stages of optimization.

tcauchois · 2022-03-24T21:45:22Z

Markus: just to double check, is your CLA in order? It would be great to get your changes as a separate PR. I'd also like to hear Mark's thoughts on these changes.

Thanks all!

Adding GetScenePrimPaths method to HdSceneDelegate which allows a single

d69f450

method call to fetch prim paths for a large number of instances in a single method call. Especially useful for native instances which use an O(n) algorithm to discover the path of a single instance.

Re-implement UsdImagingInstancerAdapter::GetScenePrimPath in terms of

6cff423

GetScenePrimPaths, to eliminate a large amount of nearly-duplicated code.

mtavenrath reviewed Mar 16, 2022

View reviewed changes

pixar-oss merged commit 3ed874c into PixarAnimationStudios:dev Mar 29, 2022

mtavenrath mentioned this pull request Apr 1, 2022

Improve runtime of GetScenePrimPaths from O(N log(N)) to O(N) #1822

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding vectorized version of GetScenePrimPath to HdSceneDelegate #1744

Adding vectorized version of GetScenePrimPath to HdSceneDelegate #1744

marktucker commented Jan 14, 2022

jilliene commented Jan 19, 2022

tcauchois commented Feb 16, 2022

marktucker commented Feb 16, 2022

tcauchois commented Mar 2, 2022 •

edited

Loading

marktucker commented Mar 2, 2022

tcauchois commented Mar 3, 2022

marktucker commented Mar 4, 2022

tcauchois commented Mar 10, 2022

tcauchois commented Mar 10, 2022

marktucker commented Mar 10, 2022

tcauchois commented Mar 10, 2022

marktucker commented Mar 10, 2022

tcauchois commented Mar 11, 2022 via email

asluk commented Mar 16, 2022

mtavenrath Mar 16, 2022

marktucker Mar 18, 2022

mtavenrath Mar 23, 2022

mtavenrath Mar 24, 2022

asluk Mar 24, 2022

marktucker Mar 25, 2022

mtavenrath Mar 25, 2022

tcauchois commented Mar 24, 2022

Adding vectorized version of GetScenePrimPath to HdSceneDelegate #1744

Adding vectorized version of GetScenePrimPath to HdSceneDelegate #1744

Conversation

marktucker commented Jan 14, 2022

Description of Change(s)

jilliene commented Jan 19, 2022

tcauchois commented Feb 16, 2022

marktucker commented Feb 16, 2022

tcauchois commented Mar 2, 2022 • edited Loading

marktucker commented Mar 2, 2022

tcauchois commented Mar 3, 2022

marktucker commented Mar 4, 2022

tcauchois commented Mar 10, 2022

tcauchois commented Mar 10, 2022

marktucker commented Mar 10, 2022

tcauchois commented Mar 10, 2022

marktucker commented Mar 10, 2022

tcauchois commented Mar 11, 2022 via email

asluk commented Mar 16, 2022

mtavenrath Mar 16, 2022

Choose a reason for hiding this comment

marktucker Mar 18, 2022

Choose a reason for hiding this comment

mtavenrath Mar 23, 2022

Choose a reason for hiding this comment

mtavenrath Mar 24, 2022

Choose a reason for hiding this comment

asluk Mar 24, 2022

Choose a reason for hiding this comment

marktucker Mar 25, 2022

Choose a reason for hiding this comment

mtavenrath Mar 25, 2022

Choose a reason for hiding this comment

tcauchois commented Mar 24, 2022

tcauchois commented Mar 2, 2022 •

edited

Loading