Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid non-API calls when truncating overallocated vectors #362

Merged
merged 6 commits into from
Aug 7, 2024

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jul 26, 2024

Closes #355
Closes #358

Description

In cpp11 we used 3 non-API calls:

  • SETLENGTH()
  • SET_TRUELENGTH()
  • SET_GROWABLE_BIT()

The combination of these 3 allowed us to implement efficient growable r_vectors. See:

inline SEXP truncate(SEXP x, R_xlen_t length, R_xlen_t capacity) {
#if R_VERSION >= R_Version(3, 4, 0)
SETLENGTH(x, length);
SET_TRUELENGTH(x, capacity);
SET_GROWABLE_BIT(x);
#else
x = safe[Rf_lengthgets](x, length);
#endif
return x;
}
template <typename T>
inline r_vector<T>::operator SEXP() const {
auto* p = const_cast<r_vector<T>*>(this);
if (data_ == R_NilValue) {
p->resize(0);
return data_;
}
if (length_ < capacity_) {
p->data_ = truncate(p->data_, length_, capacity_);
SEXP nms = names();
auto nms_size = Rf_xlength(nms);
if ((nms_size > 0) && (length_ < nms_size)) {
nms = truncate(nms, length_, capacity_);
names() = nms;
}
}
return data_;
}

We absolutely cannot use these functions anymore. However, growable vectors are still pretty nice. The only viable alternative for cpp11 (as a header only package) seems to be to use Rf_xlengthgets() instead, so that is what this PR does.

Side note: In rlang, we are probably going to work around this by implementing ALTREP views. "Truncation" is just a contiguous view into x (with size capacity) from 1:length. However, this requires registering an ALTREP class, which requires a compilation unit. So, not easy to do for cpp11. r-lib/rlang#1725

Using Rf_xlengthgets() means we can still use growable vectors as normal, but when we need to return a SEXP back to R, if:

  • We have a writable:: vector
  • And we have called push_back() on it (thereby "growing" it and overallocating for it)

Then we are going to have to pay the allocation and copy cost of truncating that vector down to its length_.

The implicit conversion to SEXP is a pretty "hot" code path, so I've added some benchmarks below. In particular this operator is called:

  • When as_sexp() is called on an r_vector. When we return a cpp11::writable::doubles to the R side, the auto generated code in cpp11.cpp calls this.
  • Whenever a user tries to get the underlying SEXP out of an r_vector.

The benchmarks definitely show a noticeable performance penalty. But remember that this is only for writable vectors where push_back() has been used. In practice, for many (but not all) algorithms that I've written in C / C++, you can determine the required vector size up front, so you don't ever need push_back(). It might be that we try and recommend that people write their code with pre-allocated vectors if they are having issues. Note that we are still much much more efficient than repeated push_back() calls in Rcpp though. Using growable vectors and then paying a 1 time truncation cost at the end is typically way more performant than reallocation on every push.

Implementation notes and gotchas

In theory we could just switch to the code path covered by R_VERSION < R_Version(3, 4, 0), i.e. just call Rf_xlengthgets(), but I think there are actually a few bugs and safety issues with this approach that people just haven't seen because everyone is on a newer version of R.

Issue 1 - Protection

Using SETLENGTH() and friends is nice because there is no new allocation, meaning no new protection is required. That is not the case with Rf_xlengthgets(). If we blindly switch to

p->data_ = safe[Rf_lengthgets](x, length);

then no one is protecting data_ and we have a serious protection issue. In theory this was happening on R < 3.4.0. If someone calls the SEXP conversion operator after a push_back() and then tries to continue to use their cpp11 writable vector, it will likely get gc'd out from under them.

My solution is to switch to resize(), which does the preserved.insert() / preserved.release() dance that is required to protect the newly allocated truncated data.

Issue 2 - Internal state updates

In current cpp11, I think there is another bug with the current approach. After truncation, the following internal variables need updates and are not currently being updated:

  • capacity_
  • is_altrep_
  • data_p_
  • protect_
  • (length_ stays the same and data_ is updated)

This seems pretty bad? If you request the SEXP and that truncates and allocates on old R versions, and then you try and use the original writable::doubles object, then I think you can get into a pretty bad state if you try and pull elements out of it or push to it because these internal variables aren't right.

My solution is the same as above, to switch to resize(). In addition to performing protection, it also updates these internal variables (with the exception of is_altrep_, but we can fix that in a follow up PR).

Note 3 - Names handling

If we unconditionally use Rf_xlengthgets(), then names attribute truncation is automatically handled for us, so that's nice. I've removed the code that was trying to specially truncate names after truncating the original vector, because it is not required now. There is a list test that tests for this specifically.

Note 4 - On updating data_ in the conversion operator

The current behavior of the implicit conversion to SEXP operator is to actually modify the original object even though it is labeled as const. This is really cheating a bit, and feels kind of wrong. It is the reason for the const_cast<> call there to throw away constness.

We assume that the idea is that the SEXP returned to the caller should correspond to the internal data_ stored by the writable:: vector (at least until the next push_back() when a reallocation could change data_). People are probably relying on this fact (it seems like it would be easy to do), so we probably can't change this at this point. It seems like it enables behavior like this:

cpp11::writable::doubles x;
Rf_setAttrib(x, Rf_install("fancy"), value);

The implicit conversion operator is called, returning a SEXP, that SEXP is immediately modified, but the modification reflects in x too. With the current API of cpp11, I imagine a lot of people are doing this.

Note that this would pay the truncation cost if it was something like

cpp11::writable::doubles x;
x.push_back(0);
Rf_setAttrib(x, Rf_install("fancy"), value); // extracts the SEXP, so forcibly truncates

This is certainly a tricky part of cpp11

Benchmarks

First, lets look at the motivations.Rmd benchmark for push_back(). This is the most relevant benchmark here, because it calls push_back() many times and then has to convert to SEXP to return the result to R. A pretty typical use case.

It was this benchmark, running the grow_() and rcpp_grow_() functions in cpp11test:

grid <- expand.grid(len = 10 ^ (0:7), pkg = "cpp11", stringsAsFactors = FALSE)
grid <- rbind(
  grid,
  expand.grid(len = 10 ^ (0:4), pkg = "rcpp", stringsAsFactors = FALSE)
)
b_grow <- bench::press(.grid = grid,
  {
    fun = match.fun(sprintf("%sgrow_", ifelse(pkg == "cpp11", "", paste0(pkg, "_"))))
    bench::mark(
      fun(len)
    )
  }
)[c("len", "pkg", "min", "mem_alloc", "n_itr", "n_gc")]
// Rcpp version looks similar

[[cpp11::register]] cpp11::writable::doubles grow_(R_xlen_t n) {
  cpp11::writable::doubles x;
  R_xlen_t i = 0;
  while (i < n) {
    x.push_back(i++);
  }

  return x;
}

When comparing Main to This PR, we see that the 10,000,000 push_back() case goes from 49.02ms -> 62.99ms with memory allocations of 256MB -> 332.3MB. This is the new extra 1 time truncation we pay at the end after all the pushes. I personally think this is okay.

# Existing saved RDS file

> readRDS("vignettes/growth.Rds")
# A tibble: 13 × 6
        len pkg        min mem_alloc n_itr  n_gc
      <dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
 1        1 cpp11    3.3µs        0B 10000     0
 2       10 cpp11   6.05µs        0B  9999     1
 3      100 cpp11   8.49µs    1.89KB 10000     0
 4     1000 cpp11  14.18µs   16.03KB  9999     1
 5    10000 cpp11  63.77µs  256.22KB  3477     2
 6   100000 cpp11 443.32µs       2MB   404     5
 7  1000000 cpp11   3.99ms      16MB    70     3
 8 10000000 cpp11 105.51ms     256MB     1     5
 9        1 rcpp    2.64µs        0B 10000     0
10       10 rcpp    3.13µs        0B  9999     1
11      100 rcpp   13.87µs   42.33KB  9997     3
12     1000 rcpp  440.77µs    3.86MB   319     1
13    10000 rcpp   54.13ms  381.96MB     2     2

# Main - regenerating RDS file contents on a faster computer

> b_grow
# A tibble: 13 × 6
        len pkg        min mem_alloc n_itr  n_gc
      <dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
 1        1 cpp11 410.01ns        0B 10000     0
 2       10 cpp11   1.44µs        0B 10000     0
 3      100 cpp11   2.58µs    1.89KB 10000     0
 4     1000 cpp11   5.04µs   16.03KB  9996     4
 5    10000 cpp11  33.87µs  256.22KB  4971    16
 6   100000 cpp11 253.18µs       2MB   861    28
 7  1000000 cpp11   2.44ms      16MB    74    26
 8 10000000 cpp11  49.02ms     256MB   100   126
 9        1 rcpp  368.98ns        0B 10000     0
10       10 rcpp    1.19µs        0B  9998     2
11      100 rcpp   11.85µs   42.33KB  9999     1
12     1000 rcpp  217.91µs    3.86MB   606    10
13    10000 rcpp   31.61ms  381.96MB    10    98

# This PR

> b_grow
# A tibble: 13 × 6
        len pkg        min mem_alloc n_itr  n_gc
      <dbl> <chr> <bch:tm> <bch:byt> <int> <dbl>
 1        1 cpp11 573.99ns        0B 10000     0
 2       10 cpp11    1.6µs        0B 10000     0
 3      100 cpp11   2.54µs    2.72KB  9999     1
 4     1000 cpp11   6.36µs   23.89KB  9996     4
 5    10000 cpp11  42.56µs  334.39KB  4932    21
 6   100000 cpp11 356.21µs    2.76MB   622    27
 7  1000000 cpp11   3.32ms   23.63MB    68    32
 8 10000000 cpp11  62.99ms   332.3MB   100   155
 9        1 rcpp  410.01ns        0B 10000     0
10       10 rcpp     1.8µs        0B 10000     0
11      100 rcpp   18.74µs   42.33KB  9999     1
12     1000 rcpp  282.29µs    3.86MB   569     4
13    10000 rcpp   32.08ms  381.96MB    28    72

The absolute "worst case scenario" that I can think of is that a user allocates a vector of size size, then pushes exactly 1 item to the back of it (doubling the capacity), and then returns the SEXP from that (truncating it and throwing away the extra capacity).

I came up with a benchmark for that:

[[cpp11::register]] SEXP cpp11_push_and_truncate_(SEXP size_sexp) {
  R_xlen_t size = INTEGER(size_sexp)[0];

  // Allocate `size` worth of doubles (filled with garbage data)
  cpp11::writable::doubles out(size);

  // Push 1 more past the existing length/capacity,
  // doubling the capacity for cpp11 vectors
  out.push_back(0);

  // Truncate back to `size + 1` size and return result.
  return SEXP(out);
}

[[cpp11::register]] SEXP rcpp_push_and_truncate_(SEXP size_sxp) {
  R_xlen_t size = INTEGER(size_sxp)[0];

  // Allocate `size` worth of doubles (filled with garbage data)
  Rcpp::NumericVector out(size);

  // Push 1 more past the existing capacity
  out.push_back(0);

  return out;
}

This shows that Rcpp is faster in this specific case because we only do 1 push. If we do repeated pushes (like above) we see that change quickly.

In the Main vs This PR cases for cpp11, both cases double the capacity after that one push_back(), but in this PR we also have to pay the truncation allocation, making it slower. If this is the worst case scenario then I think I'm okay with it.

bench::press(len = as.integer(10 ^ (0:6)),
  {
    bench::mark(
      cpp11 = cpp11_push_and_truncate_(len),
      rcpp = rcpp_push_and_truncate_(len),
      check = FALSE,
      min_iterations = 1000
    )
  }
)[c("expression", "len", "min", "mem_alloc", "n_itr", "n_gc")]

# Main

# A tibble: 14 × 6
   expression     len      min mem_alloc n_itr  n_gc
   <bch:expr>   <int> <bch:tm> <bch:byt> <int> <dbl>
 1 cpp11            1 615.02ns        0B 10000     0
 2 rcpp             1 369.04ns        0B 10000     0
 3 cpp11           10 861.01ns      208B 10000     0
 4 rcpp            10 368.98ns        0B 10000     0
 5 cpp11          100 984.06ns    2.44KB 10000     0
 6 rcpp           100 450.99ns    1.66KB  9999     1
 7 cpp11         1000    2.5µs   23.53KB  9996     4
 8 rcpp          1000 778.99ns   15.73KB  9998     2
 9 cpp11        10000  15.25µs  234.47KB  9970    30
10 rcpp         10000   3.61µs  156.35KB  9981    19
11 cpp11       100000 141.74µs    2.29MB  1262    50
12 rcpp        100000  30.91µs    1.53MB  1994    43
13 cpp11      1000000   1.72ms   22.89MB   626   374
14 rcpp       1000000  365.8µs   15.26MB   825   175


# This PR

# A tibble: 14 × 6
   expression     len      min mem_alloc n_itr  n_gc
   <bch:expr>   <int> <bch:tm> <bch:byt> <int> <dbl>
 1 cpp11            1 820.03ns        0B 10000     0
 2 rcpp             1 410.01ns        0B 10000     0
 3 cpp11           10 901.99ns      208B 10000     0
 4 rcpp            10 368.98ns        0B 10000     0
 5 cpp11          100   1.15µs    3.27KB 10000     0
 6 rcpp           100 410.01ns    1.66KB  9999     1
 7 cpp11         1000   3.44µs    31.4KB 10000     0
 8 rcpp          1000 942.96ns   15.73KB  9999     1
 9 cpp11        10000  23.82µs  312.65KB  6028     5
10 rcpp         10000   3.65µs  156.35KB  9997     3
11 cpp11       100000 226.73µs    3.05MB  1091     6
12 rcpp        100000  33.62µs    1.53MB  3623    10
13 cpp11      1000000   2.61ms   30.52MB   941    59
14 rcpp       1000000 612.58µs   15.26MB   970    30

Same benchmark as above, but with a very large vector (10,000,000 elements). Here you really see the memory allocation differences between the two cpp11 approaches.

# Longer benchmark, lots of gc
len <- as.integer(10 ^ 7)

bench::mark(
  cpp11 = cpp11_push_and_truncate_(len),
  rcpp = rcpp_push_and_truncate_(len),
  min_iterations = 200
)[c("expression", "min", "mem_alloc", "n_itr", "n_gc")]

# Main

# A tibble: 2 × 5
  expression      min mem_alloc n_itr  n_gc
  <bch:expr> <bch:tm> <bch:byt> <int> <dbl>
1 cpp11        24.6ms     229MB    53   147
2 rcpp          7.6ms     153MB    97   103

# This PR

# A tibble: 2 × 5
  expression      min mem_alloc n_itr  n_gc
  <bch:expr> <bch:tm> <bch:byt> <int> <dbl>
1 cpp11        40.5ms     305MB   200   200
2 rcpp          7.6ms     153MB   200   100

@DavisVaughan DavisVaughan merged commit 6189417 into r-lib:main Aug 7, 2024
14 checks passed
@DavisVaughan DavisVaughan deleted the fix/r-devel-compat branch August 7, 2024 21:25
@pachadotdev
Copy link
Contributor

excellent!

here is some meta text to make this issue easier to find with Google, Bing, etc

If CRAN checks reveal messages such as

gcc.exe (GCC) 13.2.0
File 'mypkg/libs/x64/mypkg.dll':
  Found non-API calls to R: 'SETLENGTH', 'SET_GROWABLE_BIT',
    'SET_TRUELENGTH'

Or:

Debian clang version 18.1.8 (9)
File 'mypkg/libs/x64/mypkg.so':
  Found non-API calls to R: 'SETLENGTH', 'SET_GROWABLE_BIT',
    'SET_TRUELENGTH'

this is the fix :)

@pachadotdev
Copy link
Contributor

Hi @DavisVaughan
Do you have plans to send this to CRAN soon? I cannot update a pkg I have hosted there because of these warnings. If the update should take long, I can vendor cpp11 headers from this PR and solve it.

@DavisVaughan
Copy link
Member Author

I cannot update a pkg I have hosted there because of these warnings

You can. CRAN knows about them and will ignore them if you tell them they come from cpp11. They have done that for many of us.

I do not have an exact release date yet. Still working on some fixes and minor features. But soon ish, probably a 1 month timeline depending on breakages

@pachadotdev
Copy link
Contributor

@DavisVaughan hi, is there a way I can help with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Found non-API calls to R: SETLENGTH, SET_TRUELENGTH
2 participants