* [Python and R] adjusted_asymmetric_accuracy

now accepts confusion matrices with fewer columns than rows. Such "missing" columns are now treated as if they were filled with 0s. * [Python and R] `pair_sets_index`, and `normalized_accuracy` return the same results for nonsymmetric confusion matrices and transposes thereof.
gagolews · Sep 17, 2022 · f215b33 · f215b33
1 parent c0e6eac
commit f215b33
Show file tree

Hide file tree

Showing 12 changed files with 116 additions and 90 deletions.
diff --git a/.devel/sphinx/news.md b/.devel/sphinx/news.md
@@ -1,6 +1,18 @@
 # What Is New in *genieclust*
 
 
+## 1.1.1.9001 (under development)
+
+*  [Python and R] `adjusted_asymmetric_accuracy`
+   now accepts confusion matrices with fewer columns than rows.
+   Such "missing" columns are now treated as if they were filled with 0s.
+
+*  [Python and R] `pair_sets_index`, and `normalized_accuracy` return
+   the same results for nonsymmetric confusion matrices and transposes thereof.
+
+*  ...
+
+
 ## 1.1.1 (2022-09-15)
 
 *  [Python] #75: `nmslib` is now optional.

diff --git a/.devel/sphinx/rapi/compare_partitions.md b/.devel/sphinx/rapi/compare_partitions.md
@@ -48,11 +48,11 @@ Each index except `adjusted_asymmetric_accuracy()` can act as a pairwise partiti
 
 Each index except `mi_score()` (which computes the mutual information score) outputs 1 given two identical partitions. Note that partitions are always defined up to a bijection of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2-partition.
 
-`adjusted_asymmetric_accuracy()` (Gagolewski, 2022) only accepts $K = L$. It is an external cluster validity measure which assumes that the label vector `x` (or rows in the confusion matrix) represents the reference (ground truth) partition. It is a corrected-for-chance summary of the proportion of correctly classified points in each cluster (with cluster matching based on the solution to the maximal linear sum assignment problem; see [`normalized_confusion_matrix`](compare_partitions.md)), given by: $(\max_\sigma \sum_{i=1}^K (c_{i, \sigma(i)}/(c_{i, 1}+...+c_{i, K})) - 1)/(K - 1)$, where $C$ is the confusion matrix.
+`adjusted_asymmetric_accuracy()` (Gagolewski, 2022) is an external cluster validity measure which assumes that the label vector `x` (or rows in the confusion matrix) represents the reference (ground truth) partition. It is a corrected-for-chance summary of the proportion of correctly classified points in each cluster (with cluster matching based on the solution to the maximal linear sum assignment problem; see [`normalized_confusion_matrix`](compare_partitions.md)), given by: $(\max_\sigma \sum_{i=1}^K (c_{i, \sigma(i)}/(c_{i, 1}+...+c_{i, K})) - 1)/(K - 1)$, where $C$ is the confusion matrix.
 
-`normalized_accuracy()` is defined as $(Accuracy(C_\sigma)-1/L)/(1-1/L)$, where $C_\sigma$ is a version of the confusion matrix for given `x` and `y`, $K \leq L$, with columns permuted based on the solution to the maximal linear sum assignment problem. The $Accuracy(C_\sigma)$ part is sometimes referred to as set-matching classification rate or pivoted accuracy.
+`normalized_accuracy()` is defined as $(Accuracy(C_\sigma)-1/max(K,L))/(1-1/max(K,L))$, where $C_\sigma$ is a version of the confusion matrix for given `x` and `y` with columns permuted based on the solution to the maximal linear sum assignment problem. The $Accuracy(C_\sigma)$ part is sometimes referred to as set-matching classification rate or pivoted accuracy.
 
-`pair_sets_index()` gives the Pair Sets Index (PSI) adjusted for chance (Rezaei, Franti, 2016), $K \leq L$. Pairing is based on the solution to the linear sum assignment problem of a transformed version of the confusion matrix. Its simplified version assumes E=1 in the definition of the index, i.e., uses Eq. (20) instead of (18).
+`pair_sets_index()` gives the Pair Sets Index (PSI) adjusted for chance (Rezaei, Franti, 2016). Pairing is based on the solution to the linear sum assignment problem of a transformed version of the confusion matrix. Its simplified version assumes E=1 in the definition of the index, i.e., uses Eq. (20) instead of (18).
 
 `rand_score()` gives the Rand score (the \"probability\" of agreement between the two partitions) and `adjusted_rand_score()` is its version corrected for chance, see (Hubert, Arabie, 1985), its expected value is 0.0 given two independent partitions. Due to the adjustment, the resulting index might also be negative for some inputs.
 

diff --git a/.devel/tinytest/test-compare-partitions.R b/.devel/tinytest/test-compare-partitions.R
@@ -47,11 +47,3 @@ for (score in scores) {
         }
     }
 }
-
-x <- c(1, 1, 1, 2, 2, 2, 3, 2, 1)
-y <- c(1, 1, 1, 2, 2, 2, 3, 4, 4)
-expect_error(normalized_accuracy(y, x))
-expect_error(pair_sets_index(y, x))
-expect_error(pair_sets_index(y, x, TRUE))
-expect_error(adjusted_asymmetric_accuracy(y, x))
-expect_true(mi_score(x, y) >= 0)
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: genieclust
 Type: Package
 Title: Fast and Robust Hierarchical Clustering with Noise Points Detection
-Version: 1.1.1
-Date: 2022-09-15
+Version: 1.1.1.9001
+Date: 2022-09-17
 Authors@R: c(
     person("Marek", "Gagolewski",
         role = c("aut", "cre", "cph"),

diff --git a/NEWS b/NEWS
@@ -1,6 +1,18 @@
 # What Is New in *genieclust*
 
 
+## 1.1.1.9001 (under development)
+
+*  [Python and R] `adjusted_asymmetric_accuracy`
+   now accepts confusion matrices with fewer columns than rows.
+   Such "missing" columns are now treated as if they were filled with 0s.
+
+*  [Python and R] `pair_sets_index`, and `normalized_accuracy` return
+   the same results for nonsymmetric confusion matrices and transposes thereof.
+
+*  ...
+
+
 ## 1.1.1 (2022-09-15)
 
 *  [Python] #75: `nmslib` is now optional.

diff --git a/R/RcppExports.R b/R/RcppExports.R
@@ -28,7 +28,7 @@
 #'
 #'
 #' \code{adjusted_asymmetric_accuracy()} (Gagolewski, 2022)
-#' only accepts \eqn{K = L}. It is an external cluster validity measure
+#' is an external cluster validity measure
 #' which assumes that the label vector \code{x} (or rows in the confusion
 #' matrix) represents the reference (ground truth) partition.
 #' It is a corrected-for-chance summary of the proportion of correctly
@@ -39,15 +39,15 @@
 #' where \eqn{C} is the confusion matrix.
 #'
 #' \code{normalized_accuracy()} is defined as
-#' \eqn{(Accuracy(C_\sigma)-1/L)/(1-1/L)}, where \eqn{C_\sigma} is a version
-#' of the confusion matrix for given \code{x} and \code{y},
-#' \eqn{K \leq L}, with columns permuted based on the solution to the
+#' \eqn{(Accuracy(C_\sigma)-1/max(K,L))/(1-1/max(K,L))}, where \eqn{C_\sigma} is a version
+#' of the confusion matrix for given \code{x} and \code{y}
+#' with columns permuted based on the solution to the
 #' maximal linear sum assignment problem.
 #' The \eqn{Accuracy(C_\sigma)} part is sometimes referred to as
 #' set-matching classification rate or pivoted accuracy.
 #'
 #' \code{pair_sets_index()} gives the Pair Sets Index (PSI)
-#' adjusted for chance (Rezaei, Franti, 2016), \eqn{K \leq L}.
+#' adjusted for chance (Rezaei, Franti, 2016).
 #' Pairing is based on the solution to the linear sum assignment problem
 #' of a transformed version of the confusion matrix.
 #' Its simplified version assumes E=1 in the definition of the index,

diff --git a/genieclust/__init__.py b/genieclust/__init__.py
@@ -20,12 +20,13 @@
 #                                                                              #
 # ############################################################################ #
 
+# version string, e.g., "1.0.0.9001" or "1.1.1"
+__version__ = "1.1.1.9001"
+
 
 from . import plots
 from . import inequity
 from . import tools
 from . import compare_partitions
 from . import internal
 from .genie import Genie, GIc
-
-__version__ = "1.1.1"  # see also ../setup.py; e.g., "1.0.0.9001"
diff --git a/genieclust/compare_partitions.pyx b/genieclust/compare_partitions.pyx
@@ -325,7 +325,7 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
 
     C : ndarray
         A ``c_contiguous`` confusion matrix (contingency table)
-        with :math:`K` rows and :math:`L` columns, where :math:`K \\le L`.
+        with :math:`K` rows and :math:`L` columns.
 
 
     Returns
@@ -355,7 +355,7 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
         ``'spsi'``
             Simplified pair sets index
         ``'aaa'``
-            Adjusted asymmetric accuracy (or ``nan`` if :math:`K \\neq L`);
+            Adjusted asymmetric accuracy;
             it is assumed that rows in `C` represent the ground-truth
             partition
 
@@ -397,8 +397,8 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
     nonempty and pairwise disjoint subsets.
     For instance, these can be two clusterings of a dataset with :math:`n`
     observations specified as vectors of labels. Moreover, let `C` be the
-    confusion matrix (with :math:`K` rows and :math:`L` columns,
-    :math:`K \\leq L`) corresponding to `x` and `y`; see also
+    confusion matrix with :math:`K` rows and :math:`L` columns,
+    corresponding to `x` and `y`; see also
     :func:`confusion_matrix`.
 
     This function implements a few scores that aim to quantify
@@ -418,8 +418,7 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
     possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4)
     represent the same 2-partition.
 
-    `adjusted_asymmetric_accuracy` [2]_
-    only accepts :math:`K = L`. It is an external cluster validity measure
+    `adjusted_asymmetric_accuracy` [2]_ is an external cluster validity measure
     which assumes that the label vector `x` (or rows in the confusion
     matrix) represents the reference (ground truth) partition.
     It is a corrected-for-chance summary of the proportion of correctly
@@ -430,16 +429,16 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
     where :math:`C` is the confusion matrix.
 
     `normalized_accuracy` is a measure defined as
-    :math:`(\\mathrm{Accuracy}(C_\\sigma)-1/L)/(1-1/L)`,
+    :math:`(\\mathrm{Accuracy}(C_\\sigma)-1/\\max(K,L))/(1-1/\\max(K,L))`,
     where :math:`C_\\sigma` is a version of the confusion matrix
-    for given `x` and `y`, :math:`K \\leq L`, with columns permuted
+    for given `x` and `y` with columns permuted
     based on the solution to the maximal linear sum assignment problem.
     Note that the :math:`\\mathrm{Accuracy}(C_\\sigma)` part
     is sometimes referred to as set-matching classification
     rate or pivoted accuracy.
 
     `pair_sets_index` gives the Pair Sets Index (PSI)
-    adjusted for chance [3]_, :math:`K \\leq L`.
+    adjusted for chance [3]_.
     Pairing is based on the solution to the linear sum assignment problem
     of a transformed version of the confusion matrix.
     Its simplified version assumes E=1 in the definition of the index,
@@ -515,9 +514,6 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
     """
     cdef Py_ssize_t xc = C.shape[0]
     cdef Py_ssize_t yc = C.shape[1]
-    if xc > yc:
-        raise ValueError("number of rows in the confusion matrix \
-            must be less than or equal to the number of columns")
 
     cdef dict res1 = c_compare_partitions.Ccompare_partitions_pairs(&C[0,0], xc, yc)
 
@@ -956,9 +952,6 @@ cpdef double normalized_accuracy(x, y):
     cdef np.ndarray[Py_ssize_t,ndim=2] C = confusion_matrix(x, y)
     cdef Py_ssize_t xc = C.shape[0]
     cdef Py_ssize_t yc = C.shape[1]
-    if xc > yc:
-        raise ValueError("Number of rows in the confusion matrix "
-            "must be less than or equal to the number of columns.")
     return c_compare_partitions.Ccompare_partitions_nacc(&C[0,0], xc, yc)
 
 
@@ -1007,13 +1000,14 @@ cpdef double adjusted_asymmetric_accuracy(x, y):
     -----
 
     Let :math:`C` be a confusion matrix with :math:`K` rows
-    and :math:`K` columns.
+    and :math:`L` columns.
     AAA is an external cluster validity measure.
     It is a corrected-for-chance summary of the proportion of correctly
     classified points in each cluster (with cluster matching based on the
     solution to the maximal linear sum assignment problem; see
     :func:`normalize_confusion_matrix`), given by:
     :math:`(\\max_\\sigma \\sum_{i=1}^K (c_{i, \sigma(i)}/(c_{i, 1}+...+c_{i, K})) - 1)/(K - 1)`.
+    Missing columns are treated as if they were filled with 0s.
 
     Note that this measure is not symmetric, i.e., ``index(x, y)`` does not
     have to be equal to ``index(y, x)``.
@@ -1034,9 +1028,6 @@ cpdef double adjusted_asymmetric_accuracy(x, y):
     cdef np.ndarray[Py_ssize_t,ndim=2] C = confusion_matrix(x, y)
     cdef Py_ssize_t xc = C.shape[0]
     cdef Py_ssize_t yc = C.shape[1]
-    if xc != yc:
-        raise ValueError("Number of rows in the confusion matrix "
-            "must be equal to the number of columns.")
     return c_compare_partitions.Ccompare_partitions_aaa(&C[0,0], xc, yc)
 
 
@@ -1099,9 +1090,6 @@ cpdef double pair_sets_index(x, y, bint simplified=False):
     cdef np.ndarray[Py_ssize_t,ndim=2] C = confusion_matrix(x, y)
     cdef Py_ssize_t xc = C.shape[0]
     cdef Py_ssize_t yc = C.shape[1]
-    if xc > yc:
-        raise ValueError("Number of rows in the confusion matrix "
-            "must be less than or equal to the number of columns.")
 
     if simplified:
         return c_compare_partitions.Ccompare_partitions_psi(&C[0,0], xc, yc).spsi

diff --git a/man/compare_partitions.Rd b/man/compare_partitions.Rd
diff --git a/setup.py b/setup.py
@@ -30,6 +30,7 @@
 import glob
 import os
 import sys
+import re
 
 
 cython_modules = {
@@ -136,10 +137,15 @@ def build_extensions(self):
 with open("README.md", "r") as fh:
     long_description = fh.read()
 
+with open("genieclust/__init__.py", "r") as fh:
+    __version__ = re.search("(?m)^\\s*__version__\\s*=\\s*[\"']([0-9.]+)[\"']", fh.read())
+    if __version__ is None:
+        raise ValueError("the package version could not be read")
+    __version__ = __version__.group(1)
 
 setuptools.setup(
     name="genieclust",
-    version="1.1.1",  # see also genieclust/__init__.py; e.g., 1.0.0.9001
+    version=__version__,
     description="Genie: Fast and Robust Hierarchical Clustering with Noise Points Detection",
     long_description=long_description,
     long_description_content_type="text/markdown",