seqClustR

seqClustR is a sequence clustering package: it provides access to different clustering algorithms to perform sequence clustering on one common data format.

Related paper: seqClustR: An R Package for Sequence Clustering

Overview

Sequence clustering is a data mining technique that groups similar sequences into clusters based on their similarities. Sequence clustering is useful when there are unknown number of similar sequences that need to be identified to gain valuable insights.

seqClustR package provides the means to perform different clustering algorithms on sequence data by reducing the complexity to prepare data for each algorithm in a different way, by just converting the sequence data into event logs you can run multiple clustering algorithms and compare them.

Installation

Install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("PlaypowerLabs/seqClustR")

Usage

The package uses event log as an input for the data. Event log are very commonly used to store the user behavior data. They indicate the sequence of actions a user takes over time, along with added metadata of the event.

Once the data is prepared in the event log format, we can run one of the following sequence clustering algorithms:

Edit Distance Clustering - seq_edit_distance_clustering
Markov Model Based Clustering - seq_markov_clustering
Dynamic Time Warping - seq_dtw_clustering
K-Means Clustering - seq_kmeans_clustering

The output of the function would be a list containing the fitted model and a data frame having the case to cluster assigned mapping. To do further analysis on individual clusters, we need event logs for each cluster for which we have written a function split_event_log which takes event logs and the clusters assigned data frame as inputs, and returns a list of event log by cluster.

library(tidyverse)
library(bupaR)
library(seqClustR)

event_log <- sequence_data %>% 
 arrange(EventTime) %>% 
 mutate(lifecycle_id = 'complete',
        resource = NA,
        row_num = 1:nrow(.)) %>% 
 eventlog(case_id = "learnerID", 
          activity_id = "Observable", 
          activity_instance_id = "row_num",
          lifecycle_id = "lifecycle_id",
          timestamp = "EventTime", 
          resource_id = "resource")

cluster <- seq_edit_distance_clustering(
           event_log)

# Get event log by cluster as a list.

event_log_2 <- split_event_log(eventlog, 
                               cluster$cluster_assignment)

You can visualize the clusters using fuzzymineR package.

library(fuzzymineR)

# Process Model for Cluster 1

metrics <- mine_fuzzy_model(event_log_2[["1"]])

viz_fuzzy_model(metrics = metrics,
                node_sig_threshold = 0.1,
                edge_sig_threshold = 0.3,
                edge_sig_to_corr_ratio = 1)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
seqClustR.Rproj		seqClustR.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seqClustR

Overview

Installation

Usage

About

Releases

Packages

Languages

PlaypowerLabs/seqClustR

Folders and files

Latest commit

History

Repository files navigation

seqClustR

Overview

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages