Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TypedArray support to jsi #248

Closed
wants to merge 1 commit into from

Conversation

wkozyra95
Copy link

Closes #182

Current API allows us to work with TypedArrays and ArrayBuffers, but there are cases where it's either inefficient or very cumbersome.

The main use case for me is WebGl implementation, which uses a lot of per frame calls that are passing typed arrays between js and c++. In a lot of cases, those are very small typed arrays (3 or 4 elements).

Few examples with current API

  • to detect TypedArray kind I would need to
    auto constructorName = this->getProperty(runtime, "__proto__")
                               .asObject(runtime)
                               .getProperty(runtime, "constructor")
                               .asObject(runtime)
                               .getProperty(runtime, "name")
                               .asString(runtime)
                               .utf8(runtime);
  • to create TypedArray
runtime.global()
        .getProperty(
            runtime, jsi::String::createFromUtf8(runtime, getTypedArrayConstructorName(kind)))
        .asObject(runtime)
        .asFunction(runtime)
        .callAsConstructor(runtime, {static_cast<double>(size)})
        .asObject(runtime);
  • to access data I would need to do all operations regarding offsets and sizes manually

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 20, 2020
@dulinriley dulinriley added the enhancement New feature or request label May 21, 2020
@mhorowitz
Copy link
Contributor

mhorowitz commented May 21, 2020

@wkozyra95 thanks for putting together this PR. It sounds like there's two separate concerns rolled into one PR here, so let me pull them apart.

  1. TypedArrays are cumbersome to use with JSI
  2. TypedArrays are inefficient to use with JSI

The philosophy around JSI is to provide a lightweight interface to access engine functionality, but also to keep it simple. It's possible, as you observe, to use TypedArrays by only using the ArrayBuffer class. So the question is, how do we best address issues of usability and performance, without unnecessarily adding methods to jsi::Runtime?

I agree that the current mechanism is cumbersome. But, this could be fixed by creating an API which is layered on top of the existing JSI, without adding anything to the Runtime at all. So, how would this perform? I hacked up a test case which implements a use case similar to the one you described, where I wrote HostFunctions which create a small TypedArray and fill in its elements, using the existing APIs, and the one in the PR:

class FillFloat64Array_arraybuffer {
public:
  FillFloat64Array_arraybuffer(jsi::Runtime& runtime)
    : f64a_(std::make_shared<jsi::Function>(
              runtime.global().getPropertyAsFunction(runtime, "Float64Array")))
    , b_(std::make_shared<jsi::PropNameID>(jsi::PropNameID::forAscii(runtime, "buffer"))) {}

  jsi::Value operator()(
      jsi::Runtime& runtime, const jsi::Value&, const jsi::Value* args, size_t count) {
    if (count < 1) {
      throw std::runtime_error("Expected 1 argument");
    }

    jsi::Object array = f64a_->callAsConstructor(runtime, args[0]).getObject(runtime);
    jsi::ArrayBuffer buffer = array
      .getProperty(runtime, *b_)
      .asObject(runtime)
      .getArrayBuffer(runtime);
    size_t len = static_cast<size_t>(args[0].asNumber());
    double* data = reinterpret_cast<double*>(buffer.data(runtime));
    for (size_t i = 0; i < len; ++i) {
      data[i] = static_cast<double>(i);
    }

    return std::move(array);
  }

private:
  std::shared_ptr<jsi::Function> f64a_;
  std::shared_ptr<jsi::PropNameID> b_;
};

jsi::Value fillFloat64Array_typedarray(
    jsi::Runtime& runtime, const jsi::Value&, const jsi::Value* args, size_t count) {
  if (count < 1) {
    throw std::runtime_error("Expected 1 argument");
  }
  size_t len = static_cast<size_t>(args[0].asNumber());
  std::vector<double> v;
  v.reserve(len);
  for (size_t i = 0; i < len; ++i) {
    v.push_back(i);
  }
  return jsi::TypedArray<jsi::TypedArrayKind::Float64Array>(runtime, v);
}

The example using the current APIs is a bit more verbose, because it takes care to cache some lookups to avoid running them every call, but if this was abstracted into a library, this just needs to be done once and shouldn't be difficult to use.

Then, I timed how long it took to run each function 100,000 times, and took the average. The arraybuffer version takes about 7.5µs per call, and the typedarray version takes about 5.3µs per call. I then explored why arraybuffer was taking longer. The answer turns out to be here: https://github.com/facebook/hermes/blob/master/API/hermes/hermes.cpp#L1877-L1881. If I just hack that code out of my test, now arraybuffer is only 4.5µs per call. If I add a similar bit of measurement to createTypedArray, it increases to about 8.3µs. So, this RAIITimer takes about 3µs to run.

This seems to me like a lot of overhead. The person who understands this best is out of the office right now, so it will take some time to understand what our options here are to mitigate the cost.

However, if we can do that, I think a better approach to this would be to implement TypedArray functions as a separate layer on top of JSI, instead of adding seven new methods to Runtime.

All that said, I have another question. I took a quick look at https://github.com/expo/expo/tree/master/packages/expo-gl-cpp and I couldn't quickly see how the small TypedArrays or instanceof calls you refer to in #182 are used. I ask because I think it would help understand if
3µs per creation is really impacting the final result; it seems plausible that GL rendering would dominate here. Do you have code which would help show me how expo-gl-cpp would use JSI using the existing APIs, and have you profiled it to see how much time is spent in TypedArray creation?

In part, I wonder if TypedArray actually preferable for this use case? I added one more test case to my hack, where I create a plain JS array containing a fixed number of numbers:

jsi::Value fillSmallArray(
    jsi::Runtime& runtime, const jsi::Value&, const jsi::Value* args, size_t count) {
  return jsi::Array::createWithElements(runtime, 1.2, 3.4, 5.6, 7.8);
}

This is only 4.3µs per call, and is not at all cumbersome. If it still makes sense to use TypedArray, I think that this is a nice pattern to use. It would not be hard to write a createFloat64WithElements() function similar to this, on top of the existing JSI, which efficiently creates and populates a new TypedArray.

If it does turn out to be necessary to add complexity to JSI to achieve reasonable performance, then I'm not against it. But I would like to see the due diligence above to make sure the added complexity is justified.

@wkozyra95
Copy link
Author

wkozyra95 commented May 22, 2020

Hi @mhorowitz , thanks for looking into this so quickly.

  1. TypedArrays are cumbersome to use with JSI

I agree that all issues related to that can be solved in a separate library on top of jsi, but this can be said about every class in jsi, jsi::Runtime already gives access to everything

The philosophy around JSI is to provide a lightweight interface to access engine functionality, but also to keep it simple.

I didn't realize that keeping this API as small as possible is a goal here. My original assumption was that if TypedArrays are implemented as separate entities inside VM, API should provide direct access to them. We have direct access to raw memory to read or modify data, but all the other operations are essentially js calls. When you say "lightweight interface" do you mean jsi::Runtime or entire JSI API?

because it takes care to cache some lookups to avoid running them every call

Lookup like this is possible for constructors, but what about cases where array is passed from js, for my use case it's more common. To read data I need to know TypedArray type, a pointer to array buffer, size and byteOffset. Type here is most problematic.

All that said, I have another question. I took a quick look at https://github.com/expo/expo/tree/master/packages/expo-gl-cpp and I couldn't quickly see how the small TypedArrays or instanceof calls you refer to in #182 are used.

code on master is using JSC directly, this branch contains jsi implementation with current API, last commit expo/expo@02f4e2f is implementing helper that essentially is a layer on top of jsi you mentioned

There are two ways we interact with TypedArrays

  • a function that returns it to js, this is not a very problematic part because those cases are blocking calls for gl, so it's slow either way in comparison to other gl calls. e.g. getParameters, getActiveUniforms
  • a function that takes array buffer as an argument, this is main problem especially for function that's are called per frame e.g. passing list of vertex positions to the shader gl.BufferData. Actual operation on data is very fast, but accessing size/byteOffset/type is slower than it could have been if accessed directly.

In part, I wonder if TypedArray actually preferable for this use case?

In a lot of cases they aren't, but that is what WebGL standard specifies.

If it does turn out to be necessary to add complexity to JSI to achieve reasonable performance, then I'm not against it. But I would like to see the due diligence above to make sure the added complexity is justified.

Originally I started working on that because I assumed that this is the only way, but after reading @tmikov answer #182 I realized that there are ways to do all of this with current API.

The performance of the implementation that is using the current API is good enough to support our use case. On the other hand most of the users don't use webgl directly, there is usually some js library that takes WebGlRenderer as an argument, in those cases there are a lot more unnecessary calls gl calls that might affect performance. Improvements from direct access to TypedArray could offset that.

have you profiled it to see how much time is spent in TypedArray creation

No, only benchmarking I have done was from js. I now that it's not very good argument to prove my point here, but I wanted to verify if any changes are noticeable in js.

I compared JSC implementation with hermes(with my changes)

  • when passing small typed arrays JSC implementation was faster
  • when using larger ones (e.g. >1000 elements) Hermes was faster

I run the same tests to compare Hermes with and without changes, but I don't remember those results. I'll rerun those and get back to you. One problem with the benchmark for those two cases is that I'm not familiar with the Hermes code to be 100% sure that my implementation is an efficient way to do that, for all I know some of the mechanisms I'm using are quite expensive and that could be replaced with sth better.

JS code that I used for testing was just a loop that called gl function and passed the same TypedArray object to it as an argument.

@mhorowitz
Copy link
Contributor

mhorowitz commented May 27, 2020

When you say "lightweight interface" do you mean jsi::Runtime or entire JSI API?

Mainly, I mean jsi::Runtime, because VM-dependent code has to be written and tested multiple times, and modifies an ABI which we are not versioning well now, and the more we add, the harder it will be later. I am ok with adding VM-independent layers on top of JSI, as these are easier to iterate and manage. I think we can also take an incremental approach here, where an app or library can provide such a layer themselves, and once we have some experience with performance and developer benefit, it would be easier to import it later.

Lookup like this is possible for constructors, but what about cases where array is passed from js, for my use case it's more common. To read data I need to know TypedArray type, a pointer to array buffer, size and byteOffset. Type here is most problematic.

I don't know the WebGL standard. Is it legal for the caller to pass any kind of TypedArray, and the library is expected to consume it?

One problem with the benchmark for those two cases is that I'm not familiar with the Hermes code to be 100% sure that my implementation is an efficient way to do that, for all I know some of the mechanisms I'm using are quite expensive and that could be replaced with sth better.

I'd rather make suggestions for code improvements than make unnecessary changes to JSI. If you have profiling which shows hot spots, we can discuss ways to optimize them. It may be that we can be creative here.

For example, your WebGL implementation could be a hybrid of JavaScript and C++, in a way which is invisible to the caller. If you have a function glFoo which takes two TypedArrays, you could have glFoo be a JS function which queries the type, buffer, size, and byteOffset props for each, and passes all of them to a glFooHost HostFunction, which would just need to make a single call to get the native data pointer, and then do whatever C++ pointer arithmetic is needed. This lets the JS side do what JS is good at, and the Native side do what Native is good at. If you have APIs which take very short arrays, you could even fetch the array elements in JS and pass them as individual parameters to a HostFunction. Of course, The more GL-specific these optimizations are, the less sense it would make to add them to JSI itself.

In any case, I think the place to start is with some measurements about where the bottlenecks (if any) are.

@wkozyra95
Copy link
Author

wkozyra95 commented Jun 9, 2020

Sorry it took me that long to answer

I don't know the WebGL standard. Is it legal for the caller to pass any kind of TypedArray, and the library is expected to consume it?

Depends on the function, some functions accept only specific type and some of them can accept anything e.g

gl.bufferData(gl.ARRAY_BUFFER, new Float32Array([1,2,3,4]), gl.STATIC_DRAW);
gl.vertexAttribPointer(position_attrib_index, 4, gl.FLOAT, false, 0, 0)

bufferData can take any typed array but glVertexAttribPointer defines in what format and size data is expected there

In any case, I think the place to start is with some measurements about where the bottlenecks (if any) are.

I prepared example with passing small TypedArray from js to c++.

https://gist.github.com/wkozyra95/e368933acf1a7d63e30fee34c6ef9b06

  • implementation using typed arrays api 11.254s
  • implementation using current api (with assumption offset = 0) 14.917s
  • implementation using current api 18.314s

for current api I intentionally prepared as a minimal example as possible, normally I would need also to check the specific type or at least verify if it's correct one with instanceof

@wkozyra95
Copy link
Author

JS function which queries the type, buffer, size, and byteOffset props for each, and passes all of them to a glFooHost

It's faster than using jsi api, but still slower than implementation exposing typed arrays,
For the same example as above it took 14.767s

    rt->evaluateJavaScript(std::make_shared<jsi::StringBuffer>("\
        const array = Float32Array.from([1,2,3,4,5,6,7,8]); \
        for (var i = 0; i < 10000000; i++) {\
            testFunc(array, array.length, array.byteOffset);\
        }\
    "), "");

@mhorowitz
Copy link
Contributor

mhorowitz commented Jun 9, 2020

I prepared example with passing small TypedArray from js to c++.

https://gist.github.com/wkozyra95/e368933acf1a7d63e30fee34c6ef9b06

Thanks for preparing this example! A few observations:

  1. As a general point, we have discovered that microbenchmarks like this one are not always good predictors of application performance. Your ArrayBuffer impl is about 1/3 slower than the TypedArray one, but it's not clear how this affects app performance, because I don't know how much overall time is spent getting the pointer to typed array data. My intuition is that GL calls are more time consuming than setting up their arguments, but I don't know how much more.
  2. Your ArrayBuffer implementation is not very efficient. Looking up property names from strings can be expensive when done repeatedly in a tight loop like this. PropNameID exists in order to cache this lookup. I modified your example to cache the ids: https://gist.github.com/mhorowitz/de72984cb5691510da24364dc079b27b
    With my version, the two implementations perform about the same; there's more variance across runs of the same binary than between runs with the different implementations.

@wkozyra95
Copy link
Author

  1. gl call(if not blocking) usually only makes a copy and adds call on the queue, actual work is done in a different thread, so creating vector and the calculating sum is a more or less realistic example

  2. I didn't think that just caching PropNameID will make such a difference, there is still some performance regression, but I no longer believe that it is worth to extend API.

Thank you for your help, I appreciate you taking the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add TypedArray support to jsi
4 participants