Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the Emscripten / WebAssembly API #7649

Closed
rianhunter opened this issue Dec 11, 2018 · 12 comments
Closed

Improvements to the Emscripten / WebAssembly API #7649

rianhunter opened this issue Dec 11, 2018 · 12 comments
Labels

Comments

@rianhunter
Copy link
Contributor

rianhunter commented Dec 11, 2018

Proposed Emscripten / WebAssembly API Changes

Since late September 2018, I've been implementing a straightforward translation layer from host calls from Emscripten-generated WebAssembly to the C library of the underlying host. The result today is a WebAssembly runtime that fully implements all the required functionality to run an unmodified version of Nginx compiled with Emscripten. While it works, there were a few modifications I would make to the existing Emscripten<->WebAssembly API to correct some inadequacies, simplify the translation logic, and optimize the translation logic.

These changes are proposed as minor edits to the existing API but long term it would be good if we could fully document the existing API with the intention of publishing it as a POSIX API standard for WebAssembly. I'm looking for feedback on these proposals before I potentially start working on a patch for Emscripten.

1. Include memory layout information in WASM binary (Critical) (DONE 0d83546)

The WASM module produced by Emscripten requires a few exports from the
host to run successfully:

  • memoryBase / __memory_base
  • __table_base
  • tempDouble
  • DYNAMICTOP_PTR
  • STACKTOP
  • STACK_MAX

The value of these exports depend on the target STATIC_BUMP value. This value is not encoded in the generated WASM file itself, and must be parsed out from the accompanying .js file.

I propose the STATIC_BUMP value be encoded in a custom "Emscripten Metadata" section in the generated WASM file.

2. Compiling with -O3 shouldn't minify WebAssembly import names (critical) (DONE 2852283)

This was recently changed in Emscripten. -O3 now by default mangles import names. I propose the default -O3 doesn't mangle the import names, and an extra option be added to do that extra minification if desired. Alternatively maintaining the current behavior and adding an option to disable the name mangling is another options.

3. Compile straight to .wasm file (recommended) (DONE 2852283)

It should be possible to specify a .wasm file as the output file. Right now you must specify a file ending in .js, when the .js file isn't always necessary.

4. Include table layout information in WASM binary (Critical) (DONE 0d83546)

The WASM module generated by Emscripten imports its index-zero table from the host. The expected size of this table is not generic and must be parsed out from the accompanying .js file as well.

I propose either:

  • that the import parameters for the table allow a generically sized table
  • the table size is encoded in a custom "Emscripten Metadata" similar section in the
    generated WASM file.

5. off_t and ino_t should be 64-bits (Critical) (DONE 4e4a794)

Most file systems support 64-bit off_t and ino_t. The WASM generated by Emscripten expects a 32-bit off_t and ino_t. This is a problem when delegating the host functions required by the generated WASM file directly to the host's C library since it's not possible to return the full value produced by the host to the hosted WASM.

I propose we convert off_t and ino_t to 64-bits.

6. More POSIX types should be 64-bits (Recommended)

Emscripten ends up using 32-bit versions of many POSIX structures, whereas the host implementation usually uses 64-bit values in its corresponding structures. While pointers might always necessarily be
32-bits in value and don't make sense in host context without conversion to host pointers first anyway, there are a few POSIX structures that in theory could be directly forwarded to the host
implementation. Examples include struct timespec, struct stat, struct tm, and more. If Emscripten encoded these using 64-bit integer values, then no conversion step would be required on the most popular data model on POSIX systems for the foreseeable future, LP64.

7. POSIX API, not System Call API (Recommended)

At first, it seemed like the API between Emscripten-generated WebAssembly and the Emscripten host should be based on system calls implied by POSIX. This doesn't work in the case of higher-level APIs that have implementation-defined behavior. Examples include: readdir() and friends, getgrent and friends, getpwent and friends, sem_post() and friends, and more.

Emscripten currently uses musl library code to implement readdir() and requires the host to implement the getdents64() system call. Only Linux provides an API that has getdents64() semantics, other POSIX systems like OpenBSD and macOS provide slightly similar but incompatible APIs and furthermore those APIs are unsupported. POSIX only requires that implementations provide readdir() and friends. It would be better for Emscripten-generated WebAssembly to generate calls to host-defined readdir() and friends instead of getdents64() to more easily implement the API on all POSIX systems.

This requires some modification to the Emscripten's version of musl which adds extra complexity but I think is overall worth it.

8. Better function signatures for system calls (Recommended)

This is related to the previous recommendation: right now the system call API names the system call by number (e.g. _syscall220). Also the arguments to the system calls are passed in a buffer instead of as WebAssembly function arguments, this requires extra work on the host implement to decode the arguments. I propose all APIs use descriptive names and pass arguments normally instead of in an indirect buffer.

9. API function calls are responsible for setting Errno (Recommended)

Right now some Emscripten APIs expect the system call to return a negative errno number to signal error, and some return the error in the errno value. I think all APIs should behave similarly for consistency.

@kripken
Copy link
Member

kripken commented Dec 12, 2018

Thanks for writing this up, @rianhunter! Also, your project is very exciting - incredible you can run nginx that way :)

The WASM module produced by Emscripten requires a few exports from the
host to run successfully:

Minor terminology note, those are wasm imports. I guess we can say the host exports them to the wasm, but it may confuse people.

tempDouble

Side note, that should not be needed any more (it helps emulate i64s, but wasm has them natively).

I propose the STATIC_BUMP value be encoded in a custom "Emscripten Metadata" section in the generated WASM file.

Agreed we should make this easy. I'm fine with a new metadata segment for this.

Another option is that the dynamic linking section already has something similar, and we support that already when you build with -s SIDE_MODULE=1,

https://github.com/WebAssembly/tool-conventions/blob/master/DynamicLinking.md#the-dylink-section

In other words, if you use a dynamic library this should already be emitted. However, dynamic libraries have some downsides, like they are relocatable - they don't know where their static memory will be (which adds overhead; a dynamic library without relocation may be possible, though). Another issue is dynamic libraries do not link in system libraries (but we can add an option for that too in principle).

But on the other hand dynamic libraries solve several of the other issues you raise, see comments below, so I do wonder if maybe that isn't the right direction (to improve dynamic linking towards what you want).

Compiling with -O3 shouldn't minify WebAssembly import names

Interesting. I think the deeper issue here is that emscripten minifying the import names is a valid optimization only because it knows the JS + wasm combination it emits will be used together. So perhaps we should add an option that says the wasm will be loaded without the JS. If that flag is on, we'd disable optimizations like this one (and metadce).

But that is already true for dynamic libraries, which we only emit wasm for.

Compile straight to .wasm file

Also something already handled by dynamic libraries, as just mentioned - we emit only wasm for them.

Include table layout information in WASM binary (Critical)

Also present in dynamic libraries (see link above), the table size is in the dylink section.

off_t and ino_t should be 64-bits (Critical)

Agreed, we should fix this. We used 32 because in the early JS days of emscripten 64-bit ints were very slow, but today there is no reason not to.

More POSIX types should be 64-bits (Recommended) [..] Examples include struct timespec, struct stat, struct tm, and more. If Emscripten encoded these using 64-bit integer values, then no conversion step would be required on the most popular data model on POSIX systems for the foreseeable future, LP64.

Sounds fine to me.

POSIX API, not System Call API (Recommended)

This is a complex question. We picked musl as our libc for convenience, and it uses linux syscalls since that's all it supports, but it definitely is not the optimal API for what you're doing here, where you do want portability to other OSes. I think what you want is maybe to not link in musl at all - then you'll get libc and POSIX calls as the interface you need to implement, instead of syscalls. (Note that the headers may cause portability issues - if you use musl's, stuff may not match the local system, etc.)

Dynamic libraries actually do that for you right now, as they don't link in system libraries at all.

Better function signatures for system calls (Recommended)

Seems ok to me, but we'd need to make sure a new ABI is efficient. We did optimize musl's a bit, mainly by making syscalls static, but sounds like what you propose would not interfere with that, so this might be fine.

API function calls are responsible for setting Errno (Recommended) Right now some Emscripten APIs expect the system call to return a negative errno number to signal error, and some return the error in the errno value. I think all APIs should behave similarly for consistency.

Do we not follow the POSIX or other relevant APIs when doing so? Maybe we have some bugs there.

@rianhunter
Copy link
Contributor Author

Thanks for writing this up, @rianhunter! Also, your project is very exciting - incredible you can run nginx that way :)

Emscripten is a huge part of that story. Thanks for WebAssembly and Emscripten. Just doing my part :)

The WASM module produced by Emscripten requires a few exports from the
host to run successfully:

Minor terminology note, those are wasm imports. I guess we can say the host exports them to the wasm, but it may confuse people.

Yes yes very true. I was being loose with my language. The "env" host module exports those objects and the generated "asm" module names those imports.

I propose the STATIC_BUMP value be encoded in a custom "Emscripten Metadata" section in the generated WASM file.

Agreed we should make this easy. I'm fine with a new metadata segment for this.

Cool, do you have a preference of serialization format for something like that? I would default to JSON but maybe it should be a in binary encoding similar to how WASM is encoded itself.

Another option is that the dynamic linking section already has something similar, and we support that already when you build with -s SIDE_MODULE=1,

I see. So dynamic modules are intended to be compiled as standalone WASM files (with embedded metadata) that Emscripten knows how to load? Do you think extending the existing dynamic library generation code to support standalone WASM makes the most sense?

Compiling with -O3 shouldn't minify WebAssembly import names
Compile straight to .wasm file

Yeah maybe the right thing to do here is not minify the import names when the output file ends in .wasm, i.e. Emscripten is generating a standalone WASM file.

More POSIX types should be 64-bits (Recommended) [..] Examples include struct timespec, struct stat, struct tm, and more. If Emscripten encoded these using 64-bit integer values, then no conversion step would be required on the most popular data model on POSIX systems for the foreseeable future, LP64.

Sounds fine to me.

Okay, when I get around to implementing this, I'll do it on a struct by struct basis and whichever ones are appropriate can be determined at code review time.

POSIX API, not System Call API (Recommended)

This is a complex question. We picked musl as our libc for convenience, and it uses linux syscalls since that's all it supports, but it definitely is not the optimal API for what you're doing here, where you do want portability to other OSes. I think what you want is maybe to not link in musl at all - then you'll get libc and POSIX calls as the interface you need to implement, instead of syscalls. (Note that the headers may cause portability issues - if you use musl's, stuff may not match the local system, etc.)

Better function signatures for system calls (Recommended)

My guess is that some musl code will still be necessary, even if most of the POSIX functions are delegated to the host implementation. How comfortable are you with diverging from upstream musl? Do you often pull in code from upstream musl? Or do you consider it mostly done?

API function calls are responsible for setting Errno (Recommended) Right now some Emscripten APIs expect the system call to return a negative errno number to signal error, and some return the error in the errno value. I think all APIs should behave similarly for consistency.

Do we not follow the POSIX or other relevant APIs when doing so? Maybe we have some bugs there.

I don't think it's a bug. For certain syscalls like read() a ___syscallxxx() style import is generated, and those aren't expected to modify errno. For others like _waitpid(), it's expected to modify errno. This probably has something to do with how musl is implemented.


Seems like you're mostly okay with these changes. How should I go about starting the process? Should I just go ahead and submit PRs for these things (along with tests). What about documenting the API, should I just host that anywhere for now until the implementation is more firmed up? I can also submit doc changes to Emscripten.

@kripken
Copy link
Member

kripken commented Dec 17, 2018

I see. So dynamic modules are intended to be compiled as standalone WASM files (with embedded metadata) that Emscripten knows how to load? Do you think extending the existing dynamic library generation code to support standalone WASM makes the most sense?

I think that might be best, yes. Then we are using an existing standard, and don't need to define anything new. It's a good starting point since it should work out of the box right now, actually, if you build with -s SIDE_MODULE=1:

emcc tests/hello_world.c -o hello.wasm -s SIDE_MODULE=1

Note how it emits just a .wasm file. But it can be optimized and improved. In particular, some tasks might be:

  • Allow non-relocatable dynamic libraries (otherwise all your static globals have an offset to access them).
  • Allow linking in some system libraries statically (if you want musl or part of it).

My guess is that some musl code will still be necessary, even if most of the POSIX functions are delegated to the host implementation. How comfortable are you with diverging from upstream musl? Do you often pull in code from upstream musl? Or do you consider it mostly done?

We don't pull often, but would like to keep the possibility of doing so. However, if it's a matter of using some musl functions and not others, we already do that quite a bit. The thing we should avoid is modifying musl code, but I hope that's not necessary here? The errno issues you mention are the one thing that sounds concerning there. Might be a good idea to invesigate why musl uses it the way it does (I'd hope if it's inconsistent then it's for a good reason).

Seems like you're mostly okay with these changes. How should I go about starting the process? Should I just go ahead and submit PRs for these things (along with tests).

Sure!

What about documenting the API, should I just host that anywhere for now until the implementation is more firmed up? I can also submit doc changes to Emscripten.

Yeah, would be good to document this. In general we want to document the wasm/JS "ABI" that we currently have, so those two might be the same document or related documents. docs/ might be the place for them.

@rianhunter
Copy link
Contributor Author

rianhunter commented Dec 24, 2018

Note how it emits just a .wasm file. But it can be optimized and improved. In particular, some tasks might be:

  • Allow non-relocatable dynamic libraries (otherwise all your static globals have an offset to access them).
  • Allow linking in some system libraries statically (if you want musl or part of it).

After looking into this a bit more, I see very clearly what you're saying. What I'm looking for does share a lot with the dynamic loading use-case. Ultimately my WebAssembly runtime will have to support dynamic linking anyway as well. At the same time, a minimal amount of code should be implemented by the runtime, e.g. I don't think malloc() or printf() should be host-level code. Your two example dynlib extensions make a lot of sense given the constraints I mentioned.

I think the right thing to do is add the dylink section to the main module, and potentially extend it to indicate that the module isn't relocatable. I would expect that using RUNTIME_LINKED_LIBS would already cause the main module to contain a dylink section but that doesn't seem to be the case, so I'll have to keep digging.

My guess is that some musl code will still be necessary, even if most of the POSIX functions are delegated to the host implementation. How comfortable are you with diverging from upstream musl? Do you often pull in code from upstream musl? Or do you consider it mostly done?

We don't pull often, but would like to keep the possibility of doing so. However, if it's a matter of using some musl functions and not others, we already do that quite a bit. The thing we should avoid is modifying musl code, but I hope that's not necessary here? The errno issues you mention are the one thing that sounds concerning there. Might be a good idea to invesigate why musl uses it the way it does (I'd hope if it's inconsistent then it's for a good reason).

Nope, no musl edits should be necessary. It seems to mostly be a matter of not linking certain function implementations (e.g. readdir() and friends, syscall wrappers).

@kripken
Copy link
Member

kripken commented Dec 26, 2018

I opened #7733 which adds an option to emit a wasm by itself (no JS) that isn't a side module. It will prevent minification of the imports and exports in that case, as discussed earlier.

I think the right thing to do is add the dylink section to the main module

Interesting idea. Might be worth opening an issue here to discuss that: https://github.com/WebAssembly/tool-conventions I'm curious what other people think - maybe it should be called something else when not in a dynamic library, in particular. But it does seem to contain what we want for non-dynamic libraries too, yeah.

@rianhunter
Copy link
Contributor Author

I think the right thing to do is add the dylink section to the main module

Interesting idea. Might be worth opening an issue here to discuss that: https://github.com/WebAssembly/tool-conventions I'm curious what other people think - maybe it should be called something else when not in a dynamic library, in particular. But it does seem to contain what we want for non-dynamic libraries too, yeah.

I thought about it a bit more, and I think dylink section alone is not sufficient. I think there should also be version metadata so the runtime knows which ABI to use (e.g. in cases like 32bit vs 64bit off_t).

@kripken
Copy link
Member

kripken commented Jan 2, 2019

Yeah, some ABI metadata may make sense.

Perhaps we should define an emscripten metadata section that includes whatever we need, and have an option to emit it (it would increase code size by default). As an emscripten internal thing it wouldn't need to be discussed on tool-conventions, unless we can figure out a way to generalize it, but I'm not sure that's practical - thoughts?

@rianhunter
Copy link
Contributor Author

rianhunter commented Jan 2, 2019 via email

@sunfishcode
Copy link
Collaborator

When we get further along with other tools that want to be ABI-compatible, I recommend just making off_t 64-bit, and not having metadata to describe the possibility of it being 32-bit. Catching incompatibilities is better than silent breakage, but the incompatibilities are still there and bubble up through the ecosystem, requiring library authors to provide multiple versions of their libraries to cover the ABI variants their users need.

This is an area that I'm hoping the reference-sysroot can help bridge. It currently specifies a 64-bit off_t, for example.

@rianhunter
Copy link
Contributor Author

rianhunter commented Jan 2, 2019

When we get further along with other tools that want to be ABI-compatible, I recommend just making off_t 64-bit, and not having metadata to describe the possibility of it being 32-bit.

A lot of these changes will probably happen incrementally, so versioning would allow seamless compatibility along the way. I don't expect every ABI version to be considered stable, and I would only expect third party implementors to support stable ABI versions. The ABI versioning should probably conform to semantic versioning conventions, where major version 0 promises no compatibility with other versions one way or another. I agree that the first stable version ("1.0") should address everything critical here, like off_t being 64-bit.

@rianhunter
Copy link
Contributor Author

rianhunter commented Jan 7, 2019

Update on PRs covering these issues

1,4 is being covered by #7815
2,3 is being covered by #7733
5 is being covered by #7799

Only what I considered non-critical is left to do, I plan to put out PRs on those in the coming weeks.

kripken added a commit that referenced this issue Jan 9, 2019
Until now we allowed either emitting JS+wasm, or for shared modules only, a wasm by itself (a side module, which would be loaded by other code that is responsible for any JS support code). There are other use cases than shared modules, though:

* #7649 - running wasm in non-JS environments
* Users that prefer to write their own JS support code, see https://github.com/kripken/emscripten/wiki/WebAssembly-Standalone

This PR is a step in helping those use cases, by allowing emcc input.cpp -o output.wasm, which was previously an error. What it does is

* Not emit the JS file.
* Note that we are not emitting the JS file, which prevents some optimizations that only make sense when emcc emits both and can coordinate them, like minifying names between the two (as came up in the discussion in #7649).
rianhunter added a commit to rianhunter/emscripten that referenced this issue Apr 16, 2019
This is related to issue emscripten-core#7649 (specifically point 7). The
getdents64() API is not easily portable across different platforms due
to the fact that there is no standardized way to read dirents from a
raw file descriptor or seek a raw file descriptor to a specific
dirent, and in practice each major platform does it differently with
different contraints. Platforms tend to only support an
opendir()/readdir()-based directory traversal API.

To make it easier to run Emscripten-generated WASM files in
non-browser contexts, this change makes generated WASM files import
opendir()/readdir() functions from the host environment instead
of the Linuxy getdents64() syscall. This makes the host functions
that support those calls a lot less hacky.

Note that this does not change the dirent ABI, which continues to have
the same structure in memory as before. This allows us to continue
using higher level musl code that interacts with directories, like
scandir().
kripken pushed a commit that referenced this issue Apr 24, 2019
Related to #7649 (point 5) and subset of #7799

* Update python.bc

* Update output wasm sizes

* Update wasm backend tests
VirtualTim pushed a commit to VirtualTim/emscripten that referenced this issue May 21, 2019
Related to emscripten-core#7649 (point 5) and subset of emscripten-core#7799

* Update python.bc

* Update output wasm sizes

* Update wasm backend tests
@stale
Copy link

stale bot commented Apr 24, 2020

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Apr 24, 2020
@stale stale bot closed this as completed May 1, 2020
belraquib pushed a commit to belraquib/emscripten that referenced this issue Dec 23, 2020
Related to emscripten-core#7649 (point 5) and subset of emscripten-core#7799

* Update python.bc

* Update output wasm sizes

* Update wasm backend tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants