Skip to content

derek-corcoran-barrios/NeuralNetworkNMDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuralNetworkNMDS

The goal of NeuralNetworkNMDS is to demonstrate how neural networks can be used to achieve a similar process to Non-metric Multidimensional Scaling (NMDS). This involves taking an n-dimensional data set and reducing it to n-k dimensions, typically two or three for visualization purposes. This approach leverages the power of autoencoders, a type of neural network, to capture complex, non-linear relationships in the data.

Example 1: Iris Dataset

In this example, we will use the famous Iris dataset. We will preprocess the data by scaling the numeric features and then use an autoencoder to reduce its dimensions. Finally, we will visualize the results using ggplot2.

Preparing the Dataset

First, we load the necessary libraries and prepare the Iris dataset by selecting all columns except for the species, scaling the numeric features, and converting them to numeric types:

library(h2o)
library(vegan)
library(dplyr)
library(ggplot2)
library(cluster)
library(Rtsne)
library(patchwork)
Data <- iris |> 
  dplyr::select(-Species) |> 
  dplyr::mutate_if(is.numeric, scale) |> 
  as.data.frame() |> 
  dplyr::mutate_all(as.numeric)

Building the Autoencoder

We use an autoencoder to reduce the dimensions of the dataset. Autoencoders are neural networks designed to learn efficient codings of the input data. Here, we will use an autoencoder with three hidden layers, with the middle layer (the bottleneck) reducing the data to two dimensions.

# Initialize the H2O cluster
h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         9 minutes 53 seconds 
#>     H2O cluster timezone:       Europe/Copenhagen 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.44.0.3 
#>     H2O cluster version age:    5 months and 30 days 
#>     H2O cluster name:           H2O_started_from_R_au687614_iyl317 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   3.83 GB 
#>     H2O cluster total cores:    8 
#>     H2O cluster allowed cores:  8 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.4.0 (2024-04-24)
# Convert to H2O frame
data_h2o <- as.h2o(Data)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
# Define the autoencoder model
autoencoder <- h2o.deeplearning(
  x = names(data_h2o),
  training_frame = data_h2o,
  autoencoder = TRUE,
  hidden = c(10, 2, 10),  # hidden layers, with 2 as the bottleneck layer
  activation = "Tanh",
  epochs = 50
)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
# Get the lower-dimensional representation (the bottleneck layer)
encoded_data_h2o <- h2o.deepfeatures(autoencoder, data_h2o, layer = 2)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
# Convert to a data frame for plotting
encoded_data <- as.data.frame(encoded_data_h2o)
colnames(encoded_data) <- c("Dim1", "Dim2")
encoded_data$species <- iris$Species

Visualizing the Results

We visualize the reduced dimensions using ggplot2. Each point represents a data sample, colored by its species.

AutoEnc <- ggplot(encoded_data, aes(x = Dim1, y = Dim2, color = species)) + geom_point() + theme_bw() + ggtitle("Autoencoder 2D")

AutoEnc

Advantages of Using Autoencoders over NMDS

Using an autoencoder (or a neural network-based approach) for dimensionality reduction has several advantages over Non-metric Multidimensional Scaling (NMDS), particularly for certain types of data and applications:

  1. Handling Non-linear Relationships:
    • Autoencoder: Neural networks are powerful in capturing complex, non-linear relationships within the data. They can model intricate patterns that linear or simpler non-linear techniques might miss.
    • NMDS: NMDS preserves the rank order of distances between points in the reduced dimensionality space, which works well for capturing the general structure of data but may struggle with highly non-linear relationships.
  2. Scalability:
    • Autoencoder: Neural networks can handle large datasets efficiently using modern hardware accelerations like GPUs. This makes autoencoders suitable for high-dimensional and large-scale data.
    • NMDS: NMDS can become computationally expensive and slow as the number of data points and dimensions increases, making it less suitable for very large datasets.
  3. Feature Learning:
    • Autoencoder: Autoencoders can learn meaningful features directly from the data without explicit feature engineering. The hidden layers capture various levels of abstraction, leading to a more informative lower-dimensional representation.
    • NMDS: NMDS focuses solely on preserving the distance (dissimilarity) structure and does not learn new features or representations of the data.
  4. Flexibility and Customization:
    • Autoencoder: Neural networks offer flexibility in terms of architecture (number of layers, types of layers, activation functions, etc.), which can be tailored to the specific characteristics of the data and the requirements of the problem.
    • NMDS: NMDS has fewer parameters to tune and is less flexible in adapting to specific data characteristics beyond choosing the number of dimensions.
  5. Integration with Deep Learning Frameworks:
    • Autoencoder: Autoencoders can be easily integrated with other deep learning frameworks and methods. This integration allows for further extensions, such as combining with supervised learning tasks, anomaly detection, or generating new data samples.
    • NMDS: NMDS is primarily used as a standalone technique for dimensionality reduction and visualization and is not typically integrated into broader machine learning pipelines.
  6. Data Reconstruction:
    • Autoencoder: An autoencoder can be used not only for dimensionality reduction but also for data reconstruction. The decoder part of the autoencoder reconstructs the input data from its lower-dimensional representation, which can be useful for tasks like denoising or anomaly detection.
    • NMDS: NMDS does not provide a mechanism for reconstructing the original data from the reduced dimensions.

Recommendations for Neural Network Architecture

When choosing the architecture of the neural network for different datasets, consider the following guidelines based on the dimensions of the data:

  • Small-scale data (features < 10):
    • Hidden layers: 2-3 layers
    • Neurons per layer: 5-10
    • Example: hidden = c(5, 2, 5)
  • Medium-scale data (features 10-50):
    • Hidden layers: 3-4 layers
    • Neurons per layer: 10-50
    • Example: hidden = c(20, 10, 2, 10, 20)
  • Large-scale data (features > 50):
    • Hidden layers: 4-5 layers
    • Neurons per layer: 50-200
    • Example: hidden = c(100, 50, 10, 2, 10, 50, 100)

Adjust the number of neurons and layers according to the complexity and size of your dataset. More complex data may require deeper networks with more neurons, while simpler data can often be effectively modeled with shallower networks.

Summary

While NMDS is a powerful and interpretable method for dimensionality reduction, particularly suited for preserving the rank order of dissimilarities, autoencoders offer a more flexible, scalable, and powerful approach, especially for complex, high-dimensional, and non-linear data. Autoencoders are also more adaptable to integration within larger machine learning and deep learning frameworks. However, the choice between NMDS and autoencoders should be guided by the specific requirements of the task, the nature of the data, and the computational resources available.

Certainly! Here’s the continuation of your Rmd file, including the methods for evaluating the performance of the autoencoder in representing the data:

Evaluating the Autoencoder

To diagnose how well the data is represented in the autoencoder, we can use several methods:

1. Reconstruction Error

Reconstruction error measures the difference between the original input data and the data reconstructed by the autoencoder. A lower reconstruction error indicates better performance.

# Calculate reconstruction error
original_data <- as.data.frame(data_h2o)
reconstructed_data <- as.data.frame(h2o.predict(autoencoder, data_h2o))
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
mse <- colMeans((original_data - reconstructed_data)^2)
mean_mse <- mean(mse)

the mean square error is 0.057

2. Visualization

Visualizing the lower-dimensional representation can provide a qualitative assessment of how well the structure of the data is preserved.

ggplot(encoded_data, aes(x = Dim1, y = Dim2, color = species)) + 
  geom_point() + 
  labs(title = "2D Representation of Iris Data using Autoencoder")

3. Silhouette Score

The silhouette score measures how similar an object is to its own cluster compared to other clusters, which can be used to assess the quality of clustering in the reduced dimensional space.

silhouette_score <- silhouette(as.integer(factor(encoded_data$species)), dist(encoded_data[, 1:2]))
avg_silhouette_width <- mean(silhouette_score[, 3])  # Average silhouette width

The average Silhouette Width is 0.5

4. Explained Variance

Explained variance assesses how much information (variance) is retained in the lower-dimensional representation.

explained_variance <- var(encoded_data$Dim1) + var(encoded_data$Dim2)
total_variance <- sum(apply(Data, 2, var))
explained_variance_ratio <- explained_variance / total_variance

The Explained Variance Ratio is 0.045

5. Comparison with Other Methods

Comparing the autoencoder’s performance with other dimensionality reduction methods, such as t-SNE, can provide additional insights.

#tsne_data <- Rtsne(Data, dims = 2)$Y
#tsne_df <- data.frame(Dim1 = tsne_data[, 1], Dim2 = tsne_data[, 2], species = iris$Species)

#TSNE <- ggplot(tsne_df, aes(x = Dim1, y = Dim2, color = species)) + 
#  geom_point() + 
#  labs(title = "2D Representation of Iris Data using t-SNE") + theme_bw()

Summary

By using these methods, you can evaluate how well the autoencoder captures the underlying structure of the data:

  • Reconstruction Error: Quantifies how well the autoencoder can recreate the input data.
  • Visualization: Provides a visual assessment of the clustering and structure.
  • Silhouette Score: Measures the quality of clustering in the reduced space.
  • Explained Variance: Assesses how much of the data’s variance is retained in the reduced dimensions.
  • Comparison with Other Methods: Provides context by comparing the autoencoder’s performance with other dimensionality reduction techniques.

Each method provides a different perspective, and using a combination of them will give a comprehensive understanding of the autoencoder’s effectiveness in representing the data.

at the end of the code shut of the connection

h2o.shutdown(prompt = FALSE)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published