Skip to content

Commit

Permalink
* [Python and R] adjusted_asymmetric_accuracy
Browse files Browse the repository at this point in the history
   now accepts confusion matrices with fewer columns than rows.
   Such "missing" columns are now treated as if they were filled with 0s.

*  [Python and R] `pair_sets_index`, and `normalized_accuracy` return
   the same results for nonsymmetric confusion matrices and transposes thereof.
  • Loading branch information
gagolews committed Sep 17, 2022
1 parent c0e6eac commit f215b33
Show file tree
Hide file tree
Showing 12 changed files with 116 additions and 90 deletions.
12 changes: 12 additions & 0 deletions .devel/sphinx/news.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
# What Is New in *genieclust*


## 1.1.1.9001 (under development)

* [Python and R] `adjusted_asymmetric_accuracy`
now accepts confusion matrices with fewer columns than rows.
Such "missing" columns are now treated as if they were filled with 0s.

* [Python and R] `pair_sets_index`, and `normalized_accuracy` return
the same results for nonsymmetric confusion matrices and transposes thereof.

* ...


## 1.1.1 (2022-09-15)

* [Python] #75: `nmslib` is now optional.
Expand Down
6 changes: 3 additions & 3 deletions .devel/sphinx/rapi/compare_partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,11 @@ Each index except `adjusted_asymmetric_accuracy()` can act as a pairwise partiti

Each index except `mi_score()` (which computes the mutual information score) outputs 1 given two identical partitions. Note that partitions are always defined up to a bijection of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2-partition.

`adjusted_asymmetric_accuracy()` (Gagolewski, 2022) only accepts $K = L$. It is an external cluster validity measure which assumes that the label vector `x` (or rows in the confusion matrix) represents the reference (ground truth) partition. It is a corrected-for-chance summary of the proportion of correctly classified points in each cluster (with cluster matching based on the solution to the maximal linear sum assignment problem; see [`normalized_confusion_matrix`](compare_partitions.md)), given by: $(\max_\sigma \sum_{i=1}^K (c_{i, \sigma(i)}/(c_{i, 1}+...+c_{i, K})) - 1)/(K - 1)$, where $C$ is the confusion matrix.
`adjusted_asymmetric_accuracy()` (Gagolewski, 2022) is an external cluster validity measure which assumes that the label vector `x` (or rows in the confusion matrix) represents the reference (ground truth) partition. It is a corrected-for-chance summary of the proportion of correctly classified points in each cluster (with cluster matching based on the solution to the maximal linear sum assignment problem; see [`normalized_confusion_matrix`](compare_partitions.md)), given by: $(\max_\sigma \sum_{i=1}^K (c_{i, \sigma(i)}/(c_{i, 1}+...+c_{i, K})) - 1)/(K - 1)$, where $C$ is the confusion matrix.

`normalized_accuracy()` is defined as $(Accuracy(C_\sigma)-1/L)/(1-1/L)$, where $C_\sigma$ is a version of the confusion matrix for given `x` and `y`, $K \leq L$, with columns permuted based on the solution to the maximal linear sum assignment problem. The $Accuracy(C_\sigma)$ part is sometimes referred to as set-matching classification rate or pivoted accuracy.
`normalized_accuracy()` is defined as $(Accuracy(C_\sigma)-1/max(K,L))/(1-1/max(K,L))$, where $C_\sigma$ is a version of the confusion matrix for given `x` and `y` with columns permuted based on the solution to the maximal linear sum assignment problem. The $Accuracy(C_\sigma)$ part is sometimes referred to as set-matching classification rate or pivoted accuracy.

`pair_sets_index()` gives the Pair Sets Index (PSI) adjusted for chance (Rezaei, Franti, 2016), $K \leq L$. Pairing is based on the solution to the linear sum assignment problem of a transformed version of the confusion matrix. Its simplified version assumes E=1 in the definition of the index, i.e., uses Eq. (20) instead of (18).
`pair_sets_index()` gives the Pair Sets Index (PSI) adjusted for chance (Rezaei, Franti, 2016). Pairing is based on the solution to the linear sum assignment problem of a transformed version of the confusion matrix. Its simplified version assumes E=1 in the definition of the index, i.e., uses Eq. (20) instead of (18).

`rand_score()` gives the Rand score (the \"probability\" of agreement between the two partitions) and `adjusted_rand_score()` is its version corrected for chance, see (Hubert, Arabie, 1985), its expected value is 0.0 given two independent partitions. Due to the adjustment, the resulting index might also be negative for some inputs.

Expand Down
8 changes: 0 additions & 8 deletions .devel/tinytest/test-compare-partitions.R
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,3 @@ for (score in scores) {
}
}
}

x <- c(1, 1, 1, 2, 2, 2, 3, 2, 1)
y <- c(1, 1, 1, 2, 2, 2, 3, 4, 4)
expect_error(normalized_accuracy(y, x))
expect_error(pair_sets_index(y, x))
expect_error(pair_sets_index(y, x, TRUE))
expect_error(adjusted_asymmetric_accuracy(y, x))
expect_true(mi_score(x, y) >= 0)
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: genieclust
Type: Package
Title: Fast and Robust Hierarchical Clustering with Noise Points Detection
Version: 1.1.1
Date: 2022-09-15
Version: 1.1.1.9001
Date: 2022-09-17
Authors@R: c(
person("Marek", "Gagolewski",
role = c("aut", "cre", "cph"),
Expand Down
12 changes: 12 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
# What Is New in *genieclust*


## 1.1.1.9001 (under development)

* [Python and R] `adjusted_asymmetric_accuracy`
now accepts confusion matrices with fewer columns than rows.
Such "missing" columns are now treated as if they were filled with 0s.

* [Python and R] `pair_sets_index`, and `normalized_accuracy` return
the same results for nonsymmetric confusion matrices and transposes thereof.

* ...


## 1.1.1 (2022-09-15)

* [Python] #75: `nmslib` is now optional.
Expand Down
10 changes: 5 additions & 5 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
#'
#'
#' \code{adjusted_asymmetric_accuracy()} (Gagolewski, 2022)
#' only accepts \eqn{K = L}. It is an external cluster validity measure
#' is an external cluster validity measure
#' which assumes that the label vector \code{x} (or rows in the confusion
#' matrix) represents the reference (ground truth) partition.
#' It is a corrected-for-chance summary of the proportion of correctly
Expand All @@ -39,15 +39,15 @@
#' where \eqn{C} is the confusion matrix.
#'
#' \code{normalized_accuracy()} is defined as
#' \eqn{(Accuracy(C_\sigma)-1/L)/(1-1/L)}, where \eqn{C_\sigma} is a version
#' of the confusion matrix for given \code{x} and \code{y},
#' \eqn{K \leq L}, with columns permuted based on the solution to the
#' \eqn{(Accuracy(C_\sigma)-1/max(K,L))/(1-1/max(K,L))}, where \eqn{C_\sigma} is a version
#' of the confusion matrix for given \code{x} and \code{y}
#' with columns permuted based on the solution to the
#' maximal linear sum assignment problem.
#' The \eqn{Accuracy(C_\sigma)} part is sometimes referred to as
#' set-matching classification rate or pivoted accuracy.
#'
#' \code{pair_sets_index()} gives the Pair Sets Index (PSI)
#' adjusted for chance (Rezaei, Franti, 2016), \eqn{K \leq L}.
#' adjusted for chance (Rezaei, Franti, 2016).
#' Pairing is based on the solution to the linear sum assignment problem
#' of a transformed version of the confusion matrix.
#' Its simplified version assumes E=1 in the definition of the index,
Expand Down
5 changes: 3 additions & 2 deletions genieclust/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,13 @@
# #
# ############################################################################ #

# version string, e.g., "1.0.0.9001" or "1.1.1"
__version__ = "1.1.1.9001"


from . import plots
from . import inequity
from . import tools
from . import compare_partitions
from . import internal
from .genie import Genie, GIc

__version__ = "1.1.1" # see also ../setup.py; e.g., "1.0.0.9001"
32 changes: 10 additions & 22 deletions genieclust/compare_partitions.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
C : ndarray
A ``c_contiguous`` confusion matrix (contingency table)
with :math:`K` rows and :math:`L` columns, where :math:`K \\le L`.
with :math:`K` rows and :math:`L` columns.
Returns
Expand Down Expand Up @@ -355,7 +355,7 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
``'spsi'``
Simplified pair sets index
``'aaa'``
Adjusted asymmetric accuracy (or ``nan`` if :math:`K \\neq L`);
Adjusted asymmetric accuracy;
it is assumed that rows in `C` represent the ground-truth
partition
Expand Down Expand Up @@ -397,8 +397,8 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
nonempty and pairwise disjoint subsets.
For instance, these can be two clusterings of a dataset with :math:`n`
observations specified as vectors of labels. Moreover, let `C` be the
confusion matrix (with :math:`K` rows and :math:`L` columns,
:math:`K \\leq L`) corresponding to `x` and `y`; see also
confusion matrix with :math:`K` rows and :math:`L` columns,
corresponding to `x` and `y`; see also
:func:`confusion_matrix`.
This function implements a few scores that aim to quantify
Expand All @@ -418,8 +418,7 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4)
represent the same 2-partition.
`adjusted_asymmetric_accuracy` [2]_
only accepts :math:`K = L`. It is an external cluster validity measure
`adjusted_asymmetric_accuracy` [2]_ is an external cluster validity measure
which assumes that the label vector `x` (or rows in the confusion
matrix) represents the reference (ground truth) partition.
It is a corrected-for-chance summary of the proportion of correctly
Expand All @@ -430,16 +429,16 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
where :math:`C` is the confusion matrix.
`normalized_accuracy` is a measure defined as
:math:`(\\mathrm{Accuracy}(C_\\sigma)-1/L)/(1-1/L)`,
:math:`(\\mathrm{Accuracy}(C_\\sigma)-1/\\max(K,L))/(1-1/\\max(K,L))`,
where :math:`C_\\sigma` is a version of the confusion matrix
for given `x` and `y`, :math:`K \\leq L`, with columns permuted
for given `x` and `y` with columns permuted
based on the solution to the maximal linear sum assignment problem.
Note that the :math:`\\mathrm{Accuracy}(C_\\sigma)` part
is sometimes referred to as set-matching classification
rate or pivoted accuracy.
`pair_sets_index` gives the Pair Sets Index (PSI)
adjusted for chance [3]_, :math:`K \\leq L`.
adjusted for chance [3]_.
Pairing is based on the solution to the linear sum assignment problem
of a transformed version of the confusion matrix.
Its simplified version assumes E=1 in the definition of the index,
Expand Down Expand Up @@ -515,9 +514,6 @@ cpdef dict compare_partitions(Py_ssize_t[:,::1] C):
"""
cdef Py_ssize_t xc = C.shape[0]
cdef Py_ssize_t yc = C.shape[1]
if xc > yc:
raise ValueError("number of rows in the confusion matrix \
must be less than or equal to the number of columns")

cdef dict res1 = c_compare_partitions.Ccompare_partitions_pairs(&C[0,0], xc, yc)

Expand Down Expand Up @@ -956,9 +952,6 @@ cpdef double normalized_accuracy(x, y):
cdef np.ndarray[Py_ssize_t,ndim=2] C = confusion_matrix(x, y)
cdef Py_ssize_t xc = C.shape[0]
cdef Py_ssize_t yc = C.shape[1]
if xc > yc:
raise ValueError("Number of rows in the confusion matrix "
"must be less than or equal to the number of columns.")
return c_compare_partitions.Ccompare_partitions_nacc(&C[0,0], xc, yc)


Expand Down Expand Up @@ -1007,13 +1000,14 @@ cpdef double adjusted_asymmetric_accuracy(x, y):
-----
Let :math:`C` be a confusion matrix with :math:`K` rows
and :math:`K` columns.
and :math:`L` columns.
AAA is an external cluster validity measure.
It is a corrected-for-chance summary of the proportion of correctly
classified points in each cluster (with cluster matching based on the
solution to the maximal linear sum assignment problem; see
:func:`normalize_confusion_matrix`), given by:
:math:`(\\max_\\sigma \\sum_{i=1}^K (c_{i, \sigma(i)}/(c_{i, 1}+...+c_{i, K})) - 1)/(K - 1)`.
Missing columns are treated as if they were filled with 0s.
Note that this measure is not symmetric, i.e., ``index(x, y)`` does not
have to be equal to ``index(y, x)``.
Expand All @@ -1034,9 +1028,6 @@ cpdef double adjusted_asymmetric_accuracy(x, y):
cdef np.ndarray[Py_ssize_t,ndim=2] C = confusion_matrix(x, y)
cdef Py_ssize_t xc = C.shape[0]
cdef Py_ssize_t yc = C.shape[1]
if xc != yc:
raise ValueError("Number of rows in the confusion matrix "
"must be equal to the number of columns.")
return c_compare_partitions.Ccompare_partitions_aaa(&C[0,0], xc, yc)


Expand Down Expand Up @@ -1099,9 +1090,6 @@ cpdef double pair_sets_index(x, y, bint simplified=False):
cdef np.ndarray[Py_ssize_t,ndim=2] C = confusion_matrix(x, y)
cdef Py_ssize_t xc = C.shape[0]
cdef Py_ssize_t yc = C.shape[1]
if xc > yc:
raise ValueError("Number of rows in the confusion matrix "
"must be less than or equal to the number of columns.")

if simplified:
return c_compare_partitions.Ccompare_partitions_psi(&C[0,0], xc, yc).spsi
Expand Down
10 changes: 5 additions & 5 deletions man/compare_partitions.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 7 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import glob
import os
import sys
import re


cython_modules = {
Expand Down Expand Up @@ -136,10 +137,15 @@ def build_extensions(self):
with open("README.md", "r") as fh:
long_description = fh.read()

with open("genieclust/__init__.py", "r") as fh:
__version__ = re.search("(?m)^\\s*__version__\\s*=\\s*[\"']([0-9.]+)[\"']", fh.read())
if __version__ is None:
raise ValueError("the package version could not be read")
__version__ = __version__.group(1)

setuptools.setup(
name="genieclust",
version="1.1.1", # see also genieclust/__init__.py; e.g., 1.0.0.9001
version=__version__,
description="Genie: Fast and Robust Hierarchical Clustering with Noise Points Detection",
long_description=long_description,
long_description_content_type="text/markdown",
Expand Down
Loading

0 comments on commit f215b33

Please sign in to comment.