Update based on some recent developments (#379)

* bookkeeping * removed unnecessary call * marking as const * adding more consts * additional safety measures * parallelization working (#355) * parallelization working * updating tests * styling * updating test * added info about parallelization * correcting an error in docs * update cran-comments * updating comments * updating news * deleting submission file * Adding support for sampling from prior (#360) * added prior sampling * updating news * deleting submission file * fixing the prior sampling * styling * added unit test for SMC starting from prior * added test for sample_prior * updating test after merging with TBB * styling * updating news * fixed error in update_mallows.SMCMallows * updated failing test and added namespace qualifier * added long-running SMC test from prior * Take care of item names properly (#363) * fixing item name issue * updating a unit test * styling * Can now deal with a single vector of input data (#364) * added handling of vector data #361 * styling * updating news * updated set_priors function * closes #370 * Added a gamma prior (#371) * updated set_priors function * closes #370 * Had forgot to implement the change... (#372) * updated set_priors function * closes #370 * fixed forgotten implementation * Added lag option (#373) * updated documentation * adding references section * more documentation updates * updated set_priors function * closes #370 * implemented user-defined lag * added lag, closes #369 * styling * ready for the change * Resampling issue 365 (#376) * updated set_priors function * closes #370 * ready for the change * implemented the resampler in cpp * added various resampling options. tests are missing * fixed numerical overflow problem and forgotten mcmc loop * added a simple test for the resampler * styling * added missing memory header * removed two tests where my M1 mac gives different results * removed git conflict marker * added a line shift * removing const-ref from built-in types * refactoring limits functions for pairwise augmentation * increasing test strictness * added some more work * moving distance code into implementation file * moved partition function code into cpp files * added code for reproducing Liu et al 2019 review * changed updating frequency for pkgdown. closes #380 * adding ignore to codecov * fixing #381 (#382) * Heatplot issue 381 (#383) * fixing #381 * added unit test for #381 * updated news
ocbe-uio · Feb 16, 2024 · 8850531 · 8850531
1 parent a1d3ee8
commit 8850531
Show file tree

Hide file tree

Showing 85 changed files with 1,784 additions and 478 deletions.
diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml
@@ -1,10 +1,6 @@
 # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 on:
-  push:
-    branches: [master]
-  pull_request:
-    branches: [master]
   release:
     types: [published]
   workflow_dispatch:

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: BayesMallows
 Type: Package
 Title: Bayesian Preference Learning with the Mallows Rank Model
-Version: 2.0.1
+Version: 2.0.1.9003
 Authors@R: c(person("Oystein", "Sorensen",
                     email = "oystein.sorensen.1985@gmail.com",
                     role = c("aut", "cre"),
@@ -46,7 +46,7 @@ BugReports: https://github.com/ocbe-uio/BayesMallows/issues
 License: GPL-3
 Encoding: UTF-8
 LazyData: true
-RoxygenNote: 7.2.3
+RoxygenNote: 7.3.1
 Depends: R (>= 3.5.0)
 Imports: Rcpp (>= 1.0.0),
   ggplot2 (>= 3.1.0),

diff --git a/NAMESPACE b/NAMESPACE
@@ -6,12 +6,15 @@ S3method(compute_consensus,BayesMallows)
 S3method(compute_consensus,SMCMallows)
 S3method(compute_posterior_intervals,BayesMallows)
 S3method(compute_posterior_intervals,SMCMallows)
+S3method(generate_initial_ranking,BayesMallowsIntransitive)
+S3method(generate_initial_ranking,BayesMallowsTransitiveClosure)
 S3method(plot,BayesMallows)
 S3method(plot,SMCMallows)
 S3method(print,BayesMallows)
 S3method(print,BayesMallowsMixtures)
 S3method(print,SMCMallows)
 S3method(update_mallows,BayesMallows)
+S3method(update_mallows,BayesMallowsPriorSamples)
 S3method(update_mallows,SMCMallows)
 export(assess_convergence)
 export(assign_cluster)
@@ -33,6 +36,7 @@ export(plot_elbow)
 export(plot_top_k)
 export(predict_top_k)
 export(sample_mallows)
+export(sample_prior)
 export(set_compute_options)
 export(set_initial_values)
 export(set_model_options)

diff --git a/NEWS.md b/NEWS.md
@@ -1,5 +1,26 @@
+# BayesMallows (development versions)
+
+* Fixed a bug in heat_plot() when the model has been estimated with 
+  rho_thinning > 1, causing the probabilities to be unnormalized. Issue #381. 
+  Thanks to Marta Crispino for discovering the bug.
+* Added stratified, systematic, and residual resampling to the sequential 
+  Monte Carlo algorithm. These distributions should in general be preferred to
+  multinomial resampling, which was the only available option until now.
+* The move step of the SMC algorithm now allows a user-defined lag for the 
+  sampling of latent ranks, specified in the "latent_sampling_lag" argument
+  to set_smc_options().
+* Prior for precision parameter alpha is now a gamma distribution. Until now
+  an exponential distribution has been assumed. Since the exponential is a special
+  case of the gamma with shape parameter equal to 1 (the default), this is not
+  a breaking change. However, it adds flexibility when it comes to specifying the prior.
+* setup_rank_data() now accepts a single vector of rankings, silently converting a to matrix with a single row.
+* Sequential Monte Carlo algorithm can now start from a sample from the prior
+  distribution, see the sample_prior() function for an example.
+* Added support for parallelism under-the-hood with oneTBB.
+
 # BayesMallows 2.0.1
 
+* Edits to C++ code fixing memory leaks.
 * Edits to unit tests which caused issues on CRAN.
 
 # BayesMallows 2.0.0

diff --git a/R/compute_mallows.R b/R/compute_mallows.R
@@ -79,21 +79,10 @@ compute_mallows <- function(
   validate_class(initial_values, "BayesMallowsInitialValues")
   validate_rankings(data)
   validate_preferences(data, model_options)
+  validate_rankings(data)
   validate_initial_values(initial_values, data)
 
-  pfun_values <- tryCatch(
-    prepare_partition_function(model_options$metric, data$n_items),
-    error = function(e) {
-      if (is.null(pfun_estimate)) {
-        stop(
-          "Exact partition function not known. Please provide an ",
-          "estimate in argument pfun_estimate."
-        )
-      } else {
-        return(NULL)
-      }
-    }
-  )
+  pfun_values <- extract_pfun_values(model_options, data, pfun_estimate)
 
   if (is.null(cl)) {
     lapplyfun <- lapply

diff --git a/R/estimate_partition_function.R b/R/estimate_partition_function.R
@@ -112,6 +112,22 @@ estimate_partition_function <- function(
   matrix(c(power, stats::lm(form, data = estimate)$coefficients), ncol = 2)
 }
 
+extract_pfun_values <- function(model_options, data, pfun_estimate) {
+  tryCatch(
+    prepare_partition_function(model_options$metric, data$n_items),
+    error = function(e) {
+      if (is.null(pfun_estimate)) {
+        stop(
+          "Exact partition function not known. Please provide an ",
+          "estimate in argument pfun_estimate."
+        )
+      } else {
+        return(NULL)
+      }
+    }
+  )
+}
+
 prepare_partition_function <- function(metric, n_items) {
   if (metric %in% c("cayley", "hamming", "kendall")) {
     return(NULL)

diff --git a/R/generate_initial_ranking.R b/R/generate_initial_ranking.R
@@ -4,6 +4,7 @@ generate_initial_ranking <- function(
   UseMethod("generate_initial_ranking")
 }
 
+#' @export
 generate_initial_ranking.BayesMallowsTransitiveClosure <- function(
     preferences, cl = NULL, shuffle_unranked = FALSE, random = FALSE,
     random_limit = 8L) {
@@ -34,12 +35,7 @@ generate_initial_ranking.BayesMallowsTransitiveClosure <- function(
   }
 }
 
-
-.S3method(
-  "generate_initial_ranking", "BayesMallowsTransitiveClosure",
-  generate_initial_ranking.BayesMallowsTransitiveClosure
-)
-
+#' @export
 generate_initial_ranking.BayesMallowsIntransitive <- function(
     preferences, cl = NULL, shuffle_unranked = FALSE,
     random = FALSE, random_limit = 8L) {

diff --git a/R/heat_plot.R b/R/heat_plot.R
@@ -36,14 +36,14 @@ heat_plot <- function(model_fit, burnin = model_fit$burnin, ...) {
     drop = FALSE
   ]
   posterior_ranks$probability <- 1
-  posterior_ranks$iteration <- NULL
+
   heatplot_data <- aggregate(posterior_ranks[, "probability", drop = FALSE],
     by = list(
       cluster = posterior_ranks$cluster,
       item = posterior_ranks$item,
       value = posterior_ranks$value
     ),
-    FUN = function(x) sum(x) / (model_fit$nmc - burnin)
+    FUN = function(x) sum(x) / length(unique(posterior_ranks$iteration))
   )
 
   heatplot_data$item <- factor(heatplot_data$item, levels = item_order)

diff --git a/R/sample_prior.R b/R/sample_prior.R
@@ -0,0 +1,30 @@
+#' Sample from prior distribution
+#'
+#' Function to obtain samples from the prior distributions of the Bayesian
+#' Mallows model. Intended to be given to [update_mallows()].
+#'
+#' @param n An integer specifying the number of samples to take.
+#' @param n_items An integer specifying the number of items to be ranked.
+#' @param priors An object of class "BayesMallowsPriors" returned from
+#'   [set_priors()].
+#'
+#' @return An object of class "BayesMallowsPriorSample", containing `n`
+#' independent samples of \eqn{\alpha} and \eqn{\rho}.
+#'
+#' @export
+#'
+#' @family modeling
+#' @example /inst/examples/sample_prior_example.R
+sample_prior <- function(n, n_items, priors = set_priors()) {
+  validate_positive(n)
+  validate_positive(n_items)
+  ret <- list(
+    alpha = stats::rgamma(n, shape = priors$gamma, rate = priors$lambda),
+    rho = replicate(n, sample(n_items, n_items)),
+    priors = priors,
+    n_items = n_items,
+    items = seq_len(n_items)
+  )
+  class(ret) <- "BayesMallowsPriorSamples"
+  ret
+}
diff --git a/R/set_priors.R b/R/set_priors.R
@@ -3,8 +3,13 @@
 #' @description Set values related to the prior distributions for the Bayesian
 #'   Mallows model.
 #'
+#' @param gamma Strictly positive numeric value specifying the shape parameter
+#'   of the gamma prior distribution of \eqn{\alpha}. Defaults to `1`, thus
+#'   recovering the exponential prior distribution used by
+#'   \insertCite{vitelli2018}{BayesMallows}.
+#'
 #' @param lambda Strictly positive numeric value specifying the rate parameter
-#'   of the truncated exponential prior distribution of \eqn{\alpha}. Defaults
+#'   of the gamma prior distribution of \eqn{\alpha}. Defaults
 #'   to `0.001`. When `n_cluster > 1`, each mixture component \eqn{\alpha_{c}}
 #'   has the same prior distribution.
 #'
@@ -26,10 +31,13 @@
 #'   [update_mallows()].
 #' @export
 #'
+#' @references \insertAllCited{}
+#'
 #' @family preprocessing
 #'
-set_priors <- function(lambda = 0.001, psi = 10, kappa = c(1, 3)) {
+set_priors <- function(gamma = 1, lambda = 0.001, psi = 10, kappa = c(1, 3)) {
   stopifnot(length(kappa) == 2)
+  validate_positive(gamma)
   validate_positive(lambda)
   validate_integer(psi)
   validate_positive(psi)

diff --git a/R/set_smc_options.R b/R/set_smc_options.R
@@ -1,20 +1,73 @@
 #' @title Set SMC compute options
 #'
-#' @description Sets the SMC compute options to be used in [update_mallows.BayesMallows()].
+#' @description Sets the SMC compute options to be used in
+#'   [update_mallows.BayesMallows()].
 #'
 #' @param n_particles Integer specifying the number of particles.
 #' @param mcmc_steps Number of MCMC steps to be applied in the resample-move
 #'   step.
+#' @param resampler Character string defining the resampling method to use. One
+#'   of "stratified", "systematic", "residual", and "multinomial". Defaults to
+#'   "stratified". While multinomial resampling was used in
+#'   \insertCite{steinSequentialInferenceMallows2023;textual}{BayesMallows},
+#'   stratified, systematic, or residual resampling typically give lower Monte
+#'   Carlo error \insertCite{Douc2005,Hol2006,Naesseth2019}{BayesMallows}.
+#' @param latent_sampling_lag Parameter specifying the number of timesteps to go
+#'   back when resampling the latent ranks in the move step. See Section 6.2.3
+#'   of \insertCite{Kantas2015}{BayesMallows} for details. The \eqn{L} in their
+#'   notation corresponds to `latent_sampling_lag`. See more under Details.
+#'   Defaults to `NA`, which means that all latent ranks from previous timesteps
+#'   are resampled. If set to `0`, no move step is applied to the latent ranks.
 #'
 #' @return An object of class "SMCOptions".
+#'
+#'
+#' @section Lag parameter in move step:
+#'
+#'   The parameter `latent_sampling_lag` corresponds to \eqn{L} in
+#'   \insertCite{Kantas2015}{BayesMallows}. Its use in this package is can be
+#'   explained in terms of Algorithm 12 in
+#'   \insertCite{steinSequentialInferenceMallows2023}{BayesMallows}. The
+#'   relevant line of the algorithm is:
+#'
+#'   **for** \eqn{j = 1 : M_{t}} **do** \cr
+#'   **M-H step:** update \eqn{\tilde{\mathbf{R}}_{j}^{(i)}} with proposal
+#'   \eqn{\tilde{\mathbf{R}}_{j}' \sim q(\tilde{\mathbf{R}}_{j}^{(i)} |
+#'   \mathbf{R}_{j}, \boldsymbol{\rho}_{t}^{(i)}, \alpha_{t}^{(i)})}.\cr
+#'   **end**
+#'
+#'   Let \eqn{L} denote the value of `latent_sampling_lag`. With this parameter,
+#'   we modify for algorithm so it becomes
+#'
+#'   **for** \eqn{j = M_{t-L+1} : M_{t}} **do** \cr
+#'   **M-H step:** update \eqn{\tilde{\mathbf{R}}_{j}^{(i)}} with proposal
+#'   \eqn{\tilde{\mathbf{R}}_{j}' \sim q(\tilde{\mathbf{R}}_{j}^{(i)} |
+#'   \mathbf{R}_{j}, \boldsymbol{\rho}_{t}^{(i)}, \alpha_{t}^{(i)})}.\cr
+#'   **end**
+#'
+#'   This means that with \eqn{L=0} no move step is performed on any latent
+#'   ranks, whereas \eqn{L=1} means that the move step is only applied to the
+#'   parameters entering at the given timestep. The default,
+#'   `latent_sampling_lag = NA` means that \eqn{L=t} at each timestep, and hence
+#'   all latent ranks are part of the move step at each timestep.
+#'
+#'
 #' @export
+#' @references \insertAllCited{}
 #'
 #' @family preprocessing
 set_smc_options <- function(
     n_particles = 1000,
-    mcmc_steps = 5) {
+    mcmc_steps = 5,
+    resampler = c("stratified", "systematic", "residual", "multinomial"),
+    latent_sampling_lag = NA_integer_) {
   validate_integer(n_particles)
   validate_integer(mcmc_steps)
+  if (!is.na(latent_sampling_lag)) validate_integer(latent_sampling_lag)
+  resampler <- match.arg(
+    resampler,
+    c("stratified", "systematic", "residual", "multinomial")
+  )
 
   ret <- as.list(environment())
   class(ret) <- "SMCOptions"

diff --git a/R/setup_rank_data.R b/R/setup_rank_data.R
@@ -8,7 +8,9 @@
 #'   optional initial value of the rankings. If `rankings` has column names,
 #'   these are assumed to be the names of the items. `NA` values in rankings are
 #'   treated as missing data and automatically augmented; to change this
-#'   behavior, see the `na_action` argument to [set_model_options()].
+#'   behavior, see the `na_action` argument to [set_model_options()]. A vector
+#'   length `n_items` is silently converted to a matrix of length `1 x n_items`,
+#'   and names (if any), are used as column names.
 #'
 #' @param preferences A data frame with one row per pairwise comparison, and
 #'   columns `assessor`, `top_item`, and `bottom_item`. Each column contains the
@@ -80,6 +82,12 @@
 #'   when all possible orderings are computed, i.e., when `random=TRUE`.
 #'   Defaults to `8L`.
 #'
+#' @param timepoint Integer vector specifying the timepoint. Defaults to `NULL`,
+#'   which means that a vector of ones, one for each observation, is generated.
+#'   Used by [update_mallows()] to identify data with a given iteration of the
+#'   sequential Monte Carlo algorithm. If not `NULL`, must contain one integer
+#'   for each row in `rankings`.
+#'
 #' @note Setting `random=TRUE` means that all possible orderings of each
 #'   assessor's preferences are generated, and one of them is picked at random.
 #'   This can be useful when experiencing convergence issues, e.g., if the MCMC
@@ -126,7 +134,8 @@ setup_rank_data <- function(
     cl = NULL,
     shuffle_unranked = FALSE,
     random = FALSE,
-    random_limit = 8L) {
+    random_limit = 8L,
+    timepoint = NULL) {
   na_action <- match.arg(na_action, c("augment", "fail", "omit"))
 
   if (is.null(rankings) && is.null(preferences)) {
@@ -138,10 +147,19 @@ setup_rank_data <- function(
     if (na_action == "fail" && any(is.na(rankings))) {
       stop("rankings matrix contains NA values")
     }
+    if (!is.matrix(rankings)) {
+      rankings <- matrix(rankings,
+        nrow = 1,
+        dimnames = list(NULL, names(rankings))
+      )
+    }
 
     if (na_action == "omit" && any(is.na(rankings))) {
       keeps <- apply(rankings, 1, function(x) !any(is.na(x)))
-      print(paste("Omitting", sum(!keeps), "row(s) from rankings due to NA values"))
+      print(paste(
+        "Omitting", sum(!keeps),
+        "row(s) from rankings due to NA values"
+      ))
       rankings <- rankings[keeps, , drop = FALSE]
     }
   } else {
@@ -153,7 +171,10 @@ setup_rank_data <- function(
   if (!is.null(observation_frequency)) {
     validate_positive_vector(observation_frequency)
     if (nrow(rankings) != length(observation_frequency)) {
-      stop("observation_frequency must be of same length as the number of rows in rankings")
+      stop(
+        "observation_frequency must be of same ",
+        "length as the number of rows in rankings"
+      )
     }
   } else {
     observation_frequency <- rep(1, nrow(rankings))
@@ -164,6 +185,19 @@ setup_rank_data <- function(
     stop("invalid permutations provided in rankings matrix")
   }
   n_items <- ncol(rankings)
+
+  if (!is.null(colnames(rankings))) {
+    items <- colnames(rankings)
+  } else {
+    items <- paste("Item", seq(from = 1, to = n_items, by = 1))
+  }
+
+  if (is.null(timepoint)) timepoint <- rep(1, nrow(rankings))
+  if (length(timepoint) != nrow(rankings)) {
+    stop("must have one timepoint per row")
+  }
+
+
   constraints <- generate_constraints(preferences, n_items, cl)
   consistent <- matrix(integer(0))