Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning an owned string #2

Closed
shepmaster opened this issue Jun 27, 2015 · 14 comments
Closed

Returning an owned string #2

shepmaster opened this issue Jun 27, 2015 · 14 comments

Comments

@shepmaster
Copy link
Owner

fn concat_things(a: &str, b:&str) -> String {
    format!("{} - {}", a, b)
}
@flo-l
Copy link

flo-l commented Jun 28, 2015

Thank you for starting this project, it's awesome!! Here's my proposal regarding owned String ffi.

rust -> ruby owned String

Requirements

  • memory-safe
  • as few allocations as possible
  • as few memory copied as possible
  • unicode support

My Proposal

This exploits the fact that rust strings are always Unicode strings. Ruby can handle unicode strings directly, so the additional costs of creating a null terminated C string are omitted.

lib.rs

#[no_mangle]
pub extern fn return_string_from_rust() -> (*const u8, usize) {
    let mut s: String = "I ♥ Unicode".into();
    s.shrink_to_fit(); // is this needed?

    let tuple = (s.as_ptr(), s.len());
    std::mem::forget(s); // is this needed?
    tuple
}

main.rb

require 'ffi'

# a rust fat pointer
class Slice < FFI::Struct
  layout :ptr, :pointer,
         :len, :uint

  def read_string
    self[:ptr].read_string(self[:len]).force_encoding("UTF-8") #does this copy the string?
  end
end

module StringArguments
  extend FFI::Library
  ffi_lib 'libstring_arguments'
  attach_function :return_string_from_rust, [], Slice.by_value

  def self.get_string
    return_string_from_rust.read_string
  end
end

puts StringArguments.get_string

Open Questions

Is the call to shrink_to_fit() needed/enough?

If str.len() != str.capacity() the unused memory is leaked.
As it seems shrink_to_fit() does not guarantee str.len() == str.capacity(), so how should one deal with that?

Is it necessary to call mem::forget(str) ?

I think yes because otherwise str would be deallocated. But isn't rust smart enough to detect that the tuple borrows str and is returned?

Does FFI::Pointer#read_string copy the buffer?

If yes how to avoid that?

@flo-l
Copy link

flo-l commented Jun 28, 2015

I did some research.

Firstly one could deallocate the unused capacity like so:

if s.len() != s.capacity() {
    unsafe {
        let ptr = s.as_mut_vec().as_mut_ptr();
        std::rt::heap::deallocate(
            ptr.offset(s.len() as isize),
            s.capacity() - s.len(),
            std::mem::min_align_of::<u8>()
        );
    }
}

Secondly I'm now sure that mem::forget has to be called.

From the docs of String.as_ptr(): The caller must ensure that the string outlives this pointer, and that it is not reallocated (e.g. by pushing to the string).

Thirdly I think I found out that FFI::Pointer#read_string copies the string.

FFI::Pointer#read_string calls #get_bytes, this calls VALUE rb_tainted_str_new, this calls VALUE str_new0, which copies the string memory here.

So FFI::Pointer#read_string copies the string, at least in mri ruby :(

@shepmaster
Copy link
Owner Author

@flo-l I think that we will need to use CString::into_raw and CString::from_ptr. I added those for the express purposes of being able to return an allocated string as a standalone concept. :-)

Your research is spot-on though - shrinking to fit is required because we need to be able to use the length of the string to rebuild the capacity of the string when passed back to Rust to be freed. mem:forget is also required as otherwise the memory will be freed by Rust and a dangling pointer will be returned to the caller.

A few more points:

This exploits the fact that rust strings are always Unicode strings

It pays to be pedantic - while Rust strings always contain Unicode data, "Unicode" isn't an encoding - Rust has UTF-8 encoded strings.

shrink_to_fit() does not guarantee str.len() == str.capacity()

Similar concerns were brought up before, but evidently it's OK.

You really need to use CStr or CString when crossing FFI boundaries. Rust strings are not NUL-terminated, and may contain inner NUL bytes. Only these structs make sure that's not the case when crossing the boundary.

I'd be happy to provide more knowledge. Let me know what you think would be useful to be added to the site!

@alubbe
Copy link

alubbe commented Apr 30, 2016

I'm also trying to solve this and have tried to combine the feedback posted here so far.

extern crate libc;
use libc::c_char;
use std::ffi::CStr;
use std::str;

#[no_mangle]
pub extern fn greet_from_rust(subject_c_ptr: *const c_char) -> *const u8 {
  let subject_c = unsafe {
      assert!(!subject_c_ptr.is_null());

      CStr::from_ptr(subject_c_ptr)
  };
  let subject = str::from_utf8(subject_c.to_bytes()).unwrap();

  let mut new_string: String = "hey, ".to_owned();
  new_string.push_str(subject);

  new_string.shrink_to_fit();
  let new_string_ptr = new_string.as_ptr();
  std::mem::forget(new_string);
  return new_string_ptr;
}

However, I was not able to replace the return type *const u8 with *const c_char.

Also, who now owns new_string? Does Ruby and its GC dispose of it later?

Can we make this code any faster by not copying everything, but working on the C string/char* directly?

@flo-l
Copy link

flo-l commented May 4, 2016

I haven't checked with valgrind, but I think we're leaking memory with mem::forget, because the ruby code doesn't free the memory (at least I couldn't find any place where it does it).

Also, as I mentioned in my previous post, ruby (mri) copies the string contents. This is unfortunate, because it hurts performance badly I guess.

Possible solutions:

Ad mem::free(): As far as I understand the problem is that ruby and rust don't use the same allocator (ruby: system and rust: jemalloc), so memory allocated at one side can't be deallocated at the other. To mitigate that we could supply a callback fn pointer that frees the string in rust. Ruby code must call the fn after it copied the string.

Speaking of copying: I guess the allocator problem explains why ruby has to copy the string. Although ruby is garbage collected, it needs to free the string at some time too. But it can't free a string allocated with a different allocator.

I would like to mention that I am by no means an expert, so I could be totally wrong and somebody should confirm that what I'm saying is correct.

To sum up we can't make string ffi with ruby more efficient because of problems in ruby world. But maybe we could add some mechanism/function to ruby that lets foreign code supply a pointer to some memory, which ruby treats as string without copying!

@flo-l
Copy link

flo-l commented May 4, 2016

@shepmaster

ad UTF-8 strings in ffi: aren't ruby strings UTF-8 encoded internally? Or at least isn't there the option to force a ruby strings encoding to be UTF-8? I don't understand why we need to go through CStr if we have an already perfectly valid UTF-8 string ready!

@shepmaster
Copy link
Owner Author

the ruby code doesn't free the memory

That's correct.

ruby and rust don't use the same allocator

Mostly true. You can now choose which allocator you use, which means that one could theoretically export the Ruby allocator into Rust world and then use it. That might be an interesting experiment.

a callback fn pointer that frees the string in rust

Yep!

But maybe we could add some mechanism/function to ruby that lets foreign code supply a pointer to some memory, which ruby treats as string without copying!

That's certainly possible. One could create a RustString class in Ruby that holds on to the Rust-allocated memory (and presumably frees it when the GC runs). However, as soon as you want to use it for just about anything, you'd have to convert it anyway (or maybe have to re-write all of the default String, which would be sad times).

aren't ruby strings UTF-8 encoded internally?

My experience was mostly around JRuby encodings, but IIRC there's multiple underlying representations for ASCII-only strings, UTF-8 strings, and "other" strings.

go through CStr if we have an already perfectly valid UTF-8 string

There may be more efficient ways of translating strings between Rust and Ruby; most of my thinking so far has been around going from Rust to C (a pointer to 8-bit values that is terminated with a NUL character). A Ruby string may be different enough from a C string to make an alternate path more beneficial.

@shepmaster
Copy link
Owner Author

Maybe I'll take a crack at all this tomorrow...

@flo-l
Copy link

flo-l commented May 4, 2016

Thanks for the review!

I think you misunderstood my plan regarding the mechanism for string creation without copying. I don't want to write a Ruby class, I wan't to add a function to the ruby ffi API which takes a pointer to some string data and a size_t representing the length of the allocation (or valid part of the allocation, whatever), which constructs a Ruby String object, that uses the supplied string data directly without copying.

It feels wasteful to transform a perfectly valid UTF-8 string to a CString and then transform it back to UTF-8 in ruby world again.

Maybe we should open an issue at ruby/ruby to get some feedback? I could do some more research and then open one.

@shepmaster
Copy link
Owner Author

add a function to the ruby ffi API which takes a pointer to some string data and a size_t representing the length of the allocation

If such a function didn't already exist, I'd be half-surprised. I'd expect a function that accepts a NUL-terminated string though.

that uses the supplied string data directly without copying

That will be trickier, as it will hit the issue of using the same allocator, but that's at least possible.

the ruby ffi API

Which API in particular are you thinking? The C-Ruby level, or something like the FFI gem?

@flo-l
Copy link

flo-l commented May 4, 2016

I'm thinking of adding it to Ruby.h, aka C-Ruby level. I couldn't find such a thing here. Once it's in Ruby.h I could add it to the ffi gem too.

All the str_new_* functions call str_new0() internally, which copies here.

Oh and btw, the equivalent functionality for arrays would be even more useful. I'm thinking of doing heavy number crunching with rust, eg. But that's another topic!

@shepmaster
Copy link
Owner Author

Ok, that all makes sense. Using Rust terminology, you'd want a str_new_foo that takes ownership of the pointer. Good luck with any changes you propose to Ruby core; I've heard it can be an interesting process! 🌴

I'm thinking of doing heavy number crunching with rust

I'd generally suggest minimizing the surface area between the two languages. If you can have the Ruby code give the Rust code the data once, then tell it to do a bunch of operations all in Rust-land, then get the results, you only have to round-trip the data once. You can have wrapper Ruby classes for each Rust type you need, but you'd basically just be holding onto a pointer.

@flo-l
Copy link

flo-l commented May 4, 2016

yeah I heard that, sometimes when you're dealing just with the rust community you forget how tough human interaction can be..

You're right with minimizing the surface area between the two languages, but it bugs me that it's necessary.

@shepmaster
Copy link
Owner Author

Eventually released!

Closed via #36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants