Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip symbols from produced .wasm #6904

Closed
vedadkajtaz opened this issue Jul 25, 2018 · 24 comments
Closed

Strip symbols from produced .wasm #6904

vedadkajtaz opened this issue Jul 25, 2018 · 24 comments

Comments

@vedadkajtaz
Copy link

Hello,
not sure whether this is actually an emscripten or clang issue, but here it goes.

I cannot manage to find a consistent way of stripping the symbols from the produced .wasm file. Basically, we end up with all the namespace, class and function names publicly visible.

I've created a mini project to reproduce the issue (3 classes w/virtual methods).
Basically, providing -Os, -O2 or higher on one hand, and --llvm-lto 2 on the other, strips the symbols away.

However, it does not work for my actual project. Varying -O option somehow makes the number of visible symbols vary by couple of hundreds, but most of them still end up in the .wasm.

I have also tried -g0, --llvm-opts "['-strip-debug']", postprocessing with wasm-opt, but haven't found the way to strip those symbols.

Any ideas?

Thanks!

@vedadkajtaz
Copy link
Author

These are the c++ mangled symbols I'm referring to:

$ strings test.wasm | grep namespace
N13somenamespace9SomeClassE
N13somenamespace17SomeExtendedClassE

@vedadkajtaz
Copy link
Author

vedadkajtaz commented Jul 25, 2018

Note that I also have the global constructors issue mentioned in this post:

https://groups.google.com/forum/#!topic/emscripten-discuss/8jXT3-vQWUQ

In my (-Os optimized) build, the mangled strings represent +300Kb out of 7,3Mb in .wasm, and the __GLOBAL__* additional 15k (plus ~3 times as much in the generated .js).

Thanks,

@yurydelendik
Copy link
Collaborator

yurydelendik commented Jul 25, 2018

@vedadkajtaz Can you use code explorer, wabt, or hex viewer to tell which sections contain these strings/symbols? It can be debug/DWAFR sections, names, linking, or just something in data section strings.

@kripken
Copy link
Member

kripken commented Jul 25, 2018

+1 to @yurydelendik's comment. Strings like that can come from multiple areas:

  • Debug builds: If built with -g or -profiling or such, emcc will keep function names in the wasm
  • Functions exported from the wasm are exported by their name. So main will be a visible symbol in a regular C program. If you add stuff to EXPORTED_FUNCTIONS (or use EMSCRIPTEN_KEEPALIVE in the source) then those will be visible too. Emscripten may also auto-export things from the wasm if it detects they are needed from JS, for example, if you use malloc in JS then it is exported from wasm.
    • Global constructors are currently called from JS, which means they are exported. That means that optimizations may alter the names you see, as LLVM or the ctor-evaller may eliminate a ctor. (We could avoid this by calling ctors from wasm - there just hasn't been an urgent need for this yet.)
  • Things imported to wasm also show up as names in the binary. If the wasm needs to call SDL_Init for example, that will show up. (But these are just their names, there is no content for them.)

To see what's going on in your case, aside from exploring the wasm, you can look in the JS (maybe with -g1 to whitespace is there) to see how those strings are used.

@vedadkajtaz
Copy link
Author

vedadkajtaz commented Jul 26, 2018

Hello,

Thanks for the quick replies.

@yurydelendik

I've transformed the binary into a .wat file (my intent was to parse it, get rid and/or obfuscate the symbols, and re-transform into a .wasm, until a better solution is found).

The global constructors are in the export section, eg:

  (export "__GLOBAL__sub_I_some_filename_cpp" (func 402))
  (export "__GLOBAL__sub_I_some_other_filename_cpp" (func 402))

Btw notice that 2/3rds of these point to the same function number (whose signature seems not to expect any arguments to distinguish the callers), so I guess there is room for optimization here as well.

The mangled symbols are in the data section, unfortunately in the middle of segments, making it hard to parse, eg:

(data (i32.const 925750) "<binary data><mangled symbol>..."

I've researched the issue, and if I understand correctly, these are RTTI symbols. We do need RTTI feature (for dynamic_cast), yet I'm surprised that the compiler really needs to retain all of those symbols. There are a couple of typeid() calls in the code, I'll try to get rid of them all and see if this still happens.

@kripken

No -g (though I did try -g0 as well), no -profiling. They aren't exported either. There are no occurrences of EMSCRIPTEN_KEEPALIVE in our code, yet we do use Embind. That being said, the list of symbols in the .wasm file is way beyond the few classes we do export through Embind.

Also, I did try -s EVAL_CTORS=1, but haven't seen any difference in the generated code.

Regarding the JS, only the global constructors appear in the generated file, no mangled RTTI symbols.

Basically, this is how it looks like:

Module.asm=asm;var __GLOBAL__I_000101=Module.__GLOBAL__I_000101=function(){return Module.asm.__GLOBAL__I_000101.apply(null,arguments)},__GLOBAL__sub_I_some_filename_cpp=Module.__GLOBAL__sub_I_some_filename_cpp=function(){return Module.asm.__GLOBAL__sub_I_some_filename_cpp.apply(null,arguments)},
...

followed by hundreds of others. Then, further on:

__ATINIT__.push({func:function(){__GLOBAL__I_000101()}},{func:function(){__GLOBAL__sub_I_some_filename_cpp()}}...

again, followed by hundreds of others.

Thanks,

@kripken
Copy link
Member

kripken commented Jul 26, 2018

RTTI symbols.

Yeah, names in the data section are likely RTTI. They could also be something like an assert message, although I think that can only generate a string for the filename, not the function.

Btw notice that 2/3rds of these point to the same function number (whose signature seems not to expect any arguments to distinguish the callers), so I guess there is room for optimization here as well.

Interesting. I think what's going on there is that the duplicate function eliminator pass has merged those functions' implementations.

How important is it for you to not have global constructor function names? We can remove those (by exporting a single "runGlobalConstructors" which calls them in wasm, then the only string would be the name of that singleton). It hasn't been a priority til now, but it shouldn't be too much work to do.

@vedadkajtaz
Copy link
Author

vedadkajtaz commented Jul 26, 2018

Regarding the RTTI: the iOS version (which shares +95% of the code) production build exposes roughly the same mangled symbols (slightly more actually), so we may dismiss this as a not emscripten-specific issue.

My attempt of removing all the typeid() calls didn't work out. I guess they remain due to dynamic_cast(), which we won't be able to remove (although, as far as I understand, dynamic_cast() should only need the inheritance graph, not names).

I'll probably figure out a solution to obfuscate those in all platforms binaries (or decide that we don't care, but I doubt so).

Interesting. I think what's going on there is that the duplicate function eliminator pass has merged those functions' implementations.

Possible. This is the (obfuscated) excerpt from the .wat:

  (export "__GLOBAL__sub_I_aaaaaaaaaaaaaaaaaaaaaaa" (func 402))
  (export "__GLOBAL__sub_I_bbbbbbbbbbbbbbbbbbbbbbbb" (func 402))
  (export "__GLOBAL__sub_I_ccccccccccccccccccccccccc" (func 402))
  (export "__GLOBAL__sub_I_dddddddddddddddddddddddddddd" (func 402))
  (export "__GLOBAL__sub_I_eeeeeeeeeeeeeeeeeeeeeeee" (func 402))
...

And the function definition:

  (func (;402;) (type 34)
    call 4995)

So, the actual job seems to be done in 4995.

How important is it for you to not have global constructor function names?

Well, it has double importance for this project: the size (both download and wasm compiling, which seems to grow exponentially with the .wasm file size on some browsers), and internals exposure (unless we decide that it doesn't matter, as stated above).

Perhaps you could point me to the place in the code this process, and your "runGlobalConstructors" suggestion would take place, I might be able to help (unless it's python code, which I have no experience in).

Thanks, your help is very appreciated.

@kripken
Copy link
Member

kripken commented Jul 27, 2018

My current thought is that this could be done in binaryen's wasm-ctor-eval tool. That receives a wasm and a list of the global constructors, sees which it can eliminate, and removes them, so it feels like the natural place for this additional optimization. That is in src/tools/wasm-ctor-eval.cpp in binaryen. So all we'd need there is to see if ctors remain, and if so create a __post_instantiate function which calls them, and export that instead. (__post_instantiate is used in dynamic linking, and seems like the right convention here. we'd need to handle the case where it already exists)

It would involve a little Python code, though, in emscripten's tools/ctor_evaller.py. That calls the binaryen tool and gets the number of successful ctors removed - we'd need to change that to get whether any ctors remain, in which case we have just the __post_instantiate one, or none.

@vedadkajtaz
Copy link
Author

Thanks for the feedback, I'll take a look.

FYI I've successfully implemented an rtti obfuscator for the .wasm and the asmjs .mem files (and will try applying it to the iOS and Android binaries as well).

@vedadkajtaz
Copy link
Author

Hmm, is binaryen involved when building the asmjs target? The generated asmjs .js exposes exactly the same issue:

/* global initializers */  __ATINIT__.push({ func: function() { __GLOBAL__I_000101() } }, { func: function() { __GLOBAL__sub_I_aaaaaaaaa_cpp() } }, { func: function() { __GLOBAL__sub_I_bbbbbbbbbb_cpp() } }

...

var real___GLOBAL__sub_I_aaaaaaaaa_cpp = asm["__GLOBAL__sub_I_aaaaaaaaa_cpp"]; asm["__GLOBAL__sub_I_aaaaaaaaa_cpp"] = function() {
  assert(runtimeInitialized, 'you need to wait for the runtime to be ready (e.g. wait for main() to be called)');
  assert(!runtimeExited, 'the runtime was exited (use NO_EXIT_RUNTIME to keep it alive after main() exits)');
  return real___GLOBAL__sub_I_aaaaaaaaa_cpp.apply(null, arguments);
};

...

var __GLOBAL__sub_I_aaaaaaaaa_cpp = Module["__GLOBAL__sub_I_aaaaaaaaa_cpp"] = function() {
  assert(runtimeInitialized, 'you need to wait for the runtime to be ready (e.g. wait for main() to be called)');
  assert(!runtimeExited, 'the runtime was exited (use NO_EXIT_RUNTIME to keep it alive after main() exits)');
  return Module["asm"]["__GLOBAL__sub_I_aaaaaaaaa_cpp"].apply(null, arguments) };

@kripken
Copy link
Member

kripken commented Aug 1, 2018

Binaryen is not used for asm.js, but the backed emits the same constructor list for both. But yeah, that means that if we optimize this in binaryen it would not help asm.js. For asm.js though, doing a text replacement to obfuscated names should be pretty easy.

@vedadkajtaz
Copy link
Author

Indeed.

@vedadkajtaz
Copy link
Author

but the backed emits the same constructor list for both

Where does this take place?

@kripken
Copy link
Member

kripken commented Aug 7, 2018

In asm2wasm, that's the GlobalInitializers data structure in lib/Target/JSBackend/JSBackend.cpp. I'm not sure where that happens in the wasm backend. Both backends emit it in the initializers field that emscripten.py receives.

Looking forward, I think it's more important to support asm2wasm + the wasm backend as opposed to asm2wasm + asm.js, so doing this once in binaryen seems simplest (+ some other solution for asm.js if needed in the meantime).

Another reason for doing it in binaryen is that the wasm backend may add some complexity here - we probably can't just collapse all the ctors into a singleton when using wasm object files, in particular, as the ctors may need to be linked and reordered etc. later. So this can only happen in the very final linking stage (where binaryen runs).

@vedadkajtaz
Copy link
Author

Hello,

FYI I'm getting this error while running unmodified wasm-ctor-eval (triggered by -s EVAL_CTORS=1):

trying to eval __GLOBAL__I_000101
  ...stopping since could not eval: call import: env.segfault

@kripken
Copy link
Member

kripken commented Sep 7, 2018

That's expected - evalling of ctors has to stop if it sees an import may be called, like env.segfault. I think that import arrives because of the SAFE_HEAP option - makes sense ctor evalling would not work well in that mode, as it instruments with a lot of import calls.

@vedadkajtaz
Copy link
Author

Thanks, will try without SAFE_HEAP and will let you know.

@vedadkajtaz
Copy link
Author

After disabling SAFE_HEAP, ASSERTIONS and STACK_OVERFLOW_CHECK, I get:

trying to eval __GLOBAL__I_000101
...stopping since could not eval: call import: env.invoke_i

@kripken
Copy link
Member

kripken commented Sep 7, 2018

Invokes could be due to exceptions or setjmp. Very hard to optimize with those around.

@vedadkajtaz
Copy link
Author

I see.

I played a bit with the wasm-ctor-eval.cpp, replacing the returns that follows the failure with continues, and figured out that every single constructor evaluation failed: the first one due to env.invoke_i as stated above, all others due to env._emscripten_asm_const_i.

However, most (if not all) of the constructors lack any emscripten-specific code (hence no EM_ASM() constructs). Might there be something wrong with the tool?

@vedadkajtaz
Copy link
Author

vedadkajtaz commented Sep 10, 2018

Actually, my debugging output was somehow truncated, hence the previous statement was not accurate.
Moreover, playing with wasm-ctor-eval helped me discover hundreds of avoidable global constructors, that got into the project due to the use of a "nifty counter" pattern.

We're now down to 34 global constructors.

Out of these 34, eval of:

  • 20 succeeded
  • 3 failed due to env.invoke_i...
  • 7 failed due to env.__embind_register_class, env.__embind_register_value_object or env.__embind_register_void (which roughly matches our use of embind)
  • 4 failed due to env._emscripten_asm_const_i (yet to investigate these)

I wonder whether the tool can safely recover from FailToEvalException thrown from instance.callExport() ? If so, we'd still get the benefit of getting rid of most of the global constructors in my case. Others could then be somehow merged into a single function call.

@vedadkajtaz
Copy link
Author

Actually, my debugging output was somehow truncated

It appears it was timing out. The ctor_evaller.py's timeout=10 timeout was too low for my output.

@stale
Copy link

stale bot commented Sep 18, 2019

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Sep 18, 2019
@stale stale bot closed this as completed Sep 25, 2019
@erikziyunchi
Copy link

If you are having something like this as the output of twiggy
image
Then you can use the wasm-opt tool as wasm-opt --strip-debug from.wasm -o to.wasm, it worked in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants