clust-gen >> mixed types; db-dbt >> dbt-score; diag-reg >> {olsrr}; r…

…eg-surv >> paper
ercbk · Dec 16, 2024 · 5c4af39 · 5c4af39
1 parent fc52fd4
commit 5c4af39
Show file tree

Hide file tree

Showing 22 changed files with 1,073 additions and 725 deletions.
diff --git a/_book/qmd/algorithms-ml.html b/_book/qmd/algorithms-ml.html
@@ -2501,6 +2501,8 @@ <h3 data-number="1.1.1" class="anchored" data-anchor-id="sec-alg-ml-trees-dt"><s
 </ol></li>
 <li>Packages
 <ul>
+<li><span style="color: #990000">{</span><a href="https://bonsai.tidymodels.org/" style="color: #990000">bonsai</a><span style="color: #990000">}</span> - tidymodels extension fits <span style="color: #990000">{partykit::ctree}</span></li>
+<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/partykit/index.html" style="color: #990000">partykit</a><span style="color: #990000">}</span> - model-based trees</li>
 <li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/glmertree/index.html" style="color: #990000">glmertree</a><span style="color: #990000">}</span> - Generalized Linear Mixed Model Trees
 <ul>
 <li>Combines <code>lmer</code>/<code>glmer</code> from <span style="color: #990000">{lme4}</span> and <code>lmtree</code>/<code>glmtree</code> from <span style="color: #990000">{partykit}</span></li>
@@ -3574,7 +3576,7 @@ <h3 data-number="1.2.3" class="anchored" data-anchor-id="sec-alg-ml-boost-xgb"><
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"loop":false,"selector":".lightbox","descPosition":"bottom","openEffect":"zoom","closeEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","descPosition":"bottom","selector":".lightbox","loop":false,"openEffect":"zoom"});
 (function() {
   let previousOnload = window.onload;
   window.onload = () => {

diff --git a/_book/qmd/classification.html b/_book/qmd/classification.html
diff --git a/_book/qmd/clustering-general.html b/_book/qmd/clustering-general.html
@@ -1291,7 +1291,7 @@
         <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../qmd/loss-functions.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">5</span>&nbsp; <span class="chapter-title">Loss Functions</span></span></a>
+ <span class="menu-text">Loss Functions</span></a>
   </div>
 </li>
         <li class="sidebar-item sidebar-item-section">
@@ -1324,7 +1324,7 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../qmd/mathematics-probability.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Probability</span></span></a>
+ <span class="menu-text">Probability</span></a>
   </div>
 </li>
           <li class="sidebar-item">
@@ -1606,7 +1606,7 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../qmd/production-ml-monitoring.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">ML Monitoring</span></span></a>
+ <span class="menu-text"><span class="chapter-number">5</span>&nbsp; <span class="chapter-title">ML Monitoring</span></span></a>
   </div>
 </li>
           <li class="sidebar-item">
@@ -2084,6 +2084,7 @@ <h2 id="toc-title">Table of contents</h2>
   <li><a href="#sec-clust-gen-umap" id="toc-sec-clust-gen-umap" class="nav-link" data-scroll-target="#sec-clust-gen-umap">UMAP</a></li>
   <li><a href="#sec-clust-gen-kmeans" id="toc-sec-clust-gen-kmeans" class="nav-link" data-scroll-target="#sec-clust-gen-kmeans">K-Means</a></li>
   <li><a href="#sec-clust-gen-dbscan" id="toc-sec-clust-gen-dbscan" class="nav-link" data-scroll-target="#sec-clust-gen-dbscan">DBSCAN</a></li>
+  <li><a href="#mixed-variable-types" id="toc-mixed-variable-types" class="nav-link" data-scroll-target="#mixed-variable-types">Mixed Variable Types</a></li>
   </ul>
 </nav>
     </div>
@@ -2129,9 +2130,10 @@ <h2 class="unnumbered anchored" data-anchor-id="sec-clust-gen-misc">Misc</h2>
 <li>Biclustering (aka Two-Mode Clustering)
 <ul>
 <li>Simultaneously clusters the rows and columns of an (<span class="math inline">\(n×p\)</span>)-dimensional data matrix <span class="math inline">\(Y\)</span>, with rows associated to <span class="math inline">\(n\)</span> statistical units and columns consisting of <span class="math inline">\(p\)</span> ordinal outcomes.</li>
-<li>Packags
+<li>Packages
 <ul>
 <li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/biclustermd/index.html" style="color: #990000">biclustermd</a><span style="color: #990000">}</span> - Biclustering with Missing Data</li>
+<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/MixSim/index.html" style="color: #990000">MixSim</a><span style="color: #990000">}</span> - Simulating Data to Study Performance of Clustering Algorithms</li>
 </ul></li>
 <li>Papers
 <ul>
@@ -2627,6 +2629,58 @@ <h2 class="unnumbered anchored" data-anchor-id="sec-clust-gen-dbscan">DBSCAN</h2
 <li>Clusters not totally reproducible. Clusters are defined sequentially so depending on which group of core points the algorithm starts with and hyperparameter values, some non-core points that are within the eps area of multiple clusters may be assigned to different clusters on different runs of the algorithm.</li>
 </ul></li>
 </ul>
+</section>
+<section id="mixed-variable-types" class="level2">
+<h2 class="anchored" data-anchor-id="mixed-variable-types">Mixed Variable Types</h2>
+<ul>
+<li>Packages
+<ul>
+<li><span style="color: #990000">{</span><a href="https://github.com/amarkos/DIBclust" style="color: #990000">DIBclust</a><span style="color: #990000">}</span> (see paper) Deterministic Information Bottleneck (DIB) clustering
+<ul>
+<li>Preserves the most relevant information while forming concise and interpretable clusters, guided by principles from information theory</li>
+</ul></li>
+<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/kamila/" style="color: #990000">kamila</a><span style="color: #990000">}</span> - KAMILA clustering, a novel method for clustering mixed-type data in the spirit of k-means clustering.
+<ul>
+<li>It does not require dummy coding of variables, and is efficient enough to scale to rather large data sets.</li>
+</ul></li>
+<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/FactoMineR/index.html" style="color: #990000">FactoMineR</a><span style="color: #990000">}</span> - Multivariate Exploratory Data Analysis and Data Mining
+<ul>
+<li>Has function for FAMD with K-Means approach</li>
+</ul></li>
+<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/clustMixType/index.html" style="color: #990000">clustMixType</a><span style="color: #990000">}</span> (<a href="https://journal.r-project.org/archive/2018/RJ-2018-048/index.html">Vignette</a>) - Performs k-prototypes partitioning clustering for mixed variable-type data</li>
+<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/cluster/index.html" style="color: #990000">cluster</a><span style="color: #990000">}</span> - Can use for PAM w/Gower’s Dissimilarity mixed type clustering
+<ul>
+<li>First compute the dissimilarity matrix using <code>daisy(df, metric = "gower")</code>. Then use that dissimilarity matrix as input for <code>pam(matrix, k, diss = TRUE)</code>.</li>
+</ul></li>
+</ul></li>
+<li>Papers
+<ul>
+<li><a href="https://arxiv.org/abs/2407.03389">A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data</a>
+<ul>
+<li>Introduces <span style="color: #990000">{DIBclust}</span></li>
+<li>Compares the Deterministic Information Bottleneck (DIB) method to KAMILA, K-Prototypes, Factor Analysis for Mixed Data (FAMD) with K-Means, and PAM using Gower’s dissimilarity
+<ul>
+<li>Used Adjusted Rand Index to compare cluster method results with the ground truth.</li>
+</ul></li>
+<li>Simulation Data
+<ul>
+<li>DIBmix outperforms the other methods in a majority of scenarios with the only exception being when the proportion of categorical variables is high. Alternative bandwidth choices could potentially enhance performance in these cases.</li>
+<li>FAMD performed best with a high proportion of categorical variables probably due to its data reduction step.
+<ul>
+<li>It was only slightly better than DIBmix. The other methods were substantially worse.</li>
+</ul></li>
+<li>Unlike the other methods, DIBmix effectively handles datasets with unbalanced cluster sizes.
+<ul>
+<li>This robustness arises from the entropy term in its objective function, which minimises for imbalanced clusters, thus allowing exploration of diverse partition structures</li>
+</ul></li>
+</ul></li>
+<li>Real Data Sets
+<ul>
+<li>Out of 10 datasets for the UCI Repository, DIBmix and K-Prototypes each had 4 best scores. Kamila and PAM each had 1. The results given the characteristics of the datasets were inline with the results of the simulated data.</li>
+</ul></li>
+</ul></li>
+</ul></li>
+</ul>
 
 
 </section>
@@ -3084,7 +3138,7 @@ <h2 class="unnumbered anchored" data-anchor-id="sec-clust-gen-dbscan">DBSCAN</h2
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","openEffect":"zoom","loop":false,"closeEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","selector":".lightbox","closeEffect":"zoom","loop":false,"descPosition":"bottom"});
 (function() {
   let previousOnload = window.onload;
   window.onload = () => {

diff --git a/_book/qmd/clustering-time-series.html b/_book/qmd/clustering-time-series.html
@@ -1254,7 +1254,7 @@
         <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../qmd/loss-functions.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">5</span>&nbsp; <span class="chapter-title">Loss Functions</span></span></a>
+ <span class="menu-text">Loss Functions</span></a>
   </div>
 </li>
         <li class="sidebar-item sidebar-item-section">
@@ -1287,7 +1287,7 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../qmd/mathematics-probability.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Probability</span></span></a>
+ <span class="menu-text">Probability</span></a>
   </div>
 </li>
           <li class="sidebar-item">
@@ -1569,7 +1569,7 @@
           <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="../qmd/production-ml-monitoring.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">7</span>&nbsp; <span class="chapter-title">ML Monitoring</span></span></a>
+ <span class="menu-text"><span class="chapter-number">5</span>&nbsp; <span class="chapter-title">ML Monitoring</span></span></a>
   </div>
 </li>
           <li class="sidebar-item">
@@ -2040,7 +2040,6 @@ <h2 id="toc-title">Table of contents</h2>
   <ul>
   <li><a href="#sec-clust-ts-misc" id="toc-sec-clust-ts-misc" class="nav-link active" data-scroll-target="#sec-clust-ts-misc">Misc</a></li>
   <li><a href="#sec-clust-ts-preproc" id="toc-sec-clust-ts-preproc" class="nav-link" data-scroll-target="#sec-clust-ts-preproc">Preprocessing</a></li>
-  <li><a href="#sec-clust-ts-cvi" id="toc-sec-clust-ts-cvi" class="nav-link" data-scroll-target="#sec-clust-ts-cvi">Cluster Validity Indices (CVI)</a></li>
   <li><a href="#sec-clust-ts-dtw" id="toc-sec-clust-ts-dtw" class="nav-link" data-scroll-target="#sec-clust-ts-dtw">{dtwclust}</a>
   <ul>
   <li><a href="#sec-clust-ts-dtw-wkflw" id="toc-sec-clust-ts-dtw-wkflw" class="nav-link" data-scroll-target="#sec-clust-ts-dtw-wkflw">Workflow</a></li>
@@ -2103,56 +2102,6 @@ <h2 class="unnumbered anchored" data-anchor-id="sec-clust-ts-preproc">Preprocess
 <li>Standardization or min-max normalization if your data is bounded</li>
 </ul>
 </section>
-<section id="sec-clust-ts-cvi" class="level2 unnumbered">
-<h2 class="unnumbered anchored" data-anchor-id="sec-clust-ts-cvi">Cluster Validity Indices (CVI)</h2>
-<ul>
-<li>Notes from <span style="color: #990000">{dtwclust}</span> vignette</li>
-<li>Types
-<ul>
-<li>For either Hard (aka Crisp) or Soft (aka Fuzzy) Partitioning</li>
-<li><u>Internal (IVI)</u> - Tries to define a measure of cluster purity</li>
-<li><u>External (EVI)</u> - Compares the obtained partition to the correct one. Thus, external CVIs can only be used if the ground truth is known
-<ul>
-<li>Issues
-<ul>
-<li>Associated, ground truth class labels do not necessarily correspond to recoverable, natural clusters</li>
-<li>High EVI scores reliably indicate good label recovery, low EVI scores do not necessarily indicate poor clustering structure in a partition, given the possibility of legitimate alternatives</li>
-</ul></li>
-<li>These concerns can be mitigated by instead utilizing datasets where cluster labels have been curated to encompass those formal and informal cluster concepts deemed important for the problem and domain according to the relevant experts.
-<ul>
-<li>Manually labelling such datasets can be tedious and expensive. Synthetic datasets where the generation process is informed by the literature and expert opinion is a more efficient alternative.</li>
-</ul></li>
-</ul></li>
-<li><u>Relative (RVI)</u> - Attempt to quantify the extent to which a partition exemplifies one or more generally desirable cluster concepts, such as small within-group dissimilarities, large between group dissimilarities or effective representation by cluster prototypes
-<ul>
-<li>e.g.&nbsp;Silhouette Width Criterion and Davies-Bouldin Index</li>
-<li>Issues (<a href="https://arxiv.org/abs/2412.02026">source</a>)
-<ul>
-<li>Comparisons based on RVIs will be biased towards those partitions and methods that align with the primarily domain-agnostic cluster concepts inherently preferred by different RVIs</li>
-<li>RVIs show a tendency to insufficiently penalize large, noisy clusters, and favor partitions that isolate outliers</li>
-<li>The use of RVIs for any comparisons of normalization procedures, representation methods and distance measures are highly dubious due to the influence of these components on pairwise dissimilarities</li>
-</ul></li>
-</ul></li>
-</ul></li>
-<li>Scores<br>
-<img src="./_resources/Clustering,_Time_Series.resources/Screenshot (53).png" class="img-fluid">
-<ul>
-<li>Available through <code>dtwclust::cvi</code></li>
-<li>Global centroid is one that’s computed using the whole dataset
-<ul>
-<li>dtwclust implements whichever distance method that originally used in the clustering computation</li>
-</ul></li>
-<li>Some CVIs require symmetric distance functions (distance from a to b = distance b to a)
-<ul>
-<li>A warning is printed if an asymmetric distance method was used</li>
-</ul></li>
-<li><span style="color: #990000">{clue}</span> - Compare repetitions of non-deterministic clustering methods (e.g.&nbsp;partitional) where random element means you get a different result each time
-<ul>
-<li>It uses a measure called “dissimilarities using minimal Euclidean membership distance” to compare different runs of a cluster method</li>
-</ul></li>
-</ul></li>
-</ul>
-</section>
 <section id="sec-clust-ts-dtw" class="level2 unnumbered">
 <h2 class="unnumbered anchored" data-anchor-id="sec-clust-ts-dtw">{dtwclust}</h2>
 <ul>