Skip to content

Latest commit

 

History

History
58 lines (39 loc) · 1.35 KB

README.md

File metadata and controls

58 lines (39 loc) · 1.35 KB

tmsamples

The goal of tmsamples is to simulate term frequency matrices (DTMs or TCMs) based on parameters from real or simulated probabilistic topic models.

This corpus simulation is a core part of my dissertation research. The basic idea is: if you can simulate a corpus using the functional form of a topic model, and that corpus retains the gross statistical properties of human language, then we can use simulated corpora to derive rules and metrics to fit well specified topic models. Conversely, we can avoid pathologically misspecified models.

Installation

You can install the GitHub version of tmsamples with:

library(remotes)
remotes::install_github("tommyjones/tmsamples")

Example

This is a basic example which shows you how to solve a common problem:

library(tmsamples)
#> Loading required package: Matrix


Nk <- 4
Nd <- 50
Nv <- 1000

alpha <- rgamma(Nk, 0.5)

beta <- generate_zipf(vocab_size = Nv, magnitude = 500, zipf_par = 1.1)

pars <- sample_parameters(alpha, beta, Nd)

doc_lengths <- rpois(Nd, 50)

dtm <- sample_documents(
  theta = pars$theta,
  phi = pars$phi,
  doc_lengths = doc_lengths,
  threads = 2 ## threads controls parallel computation
)
#> ==================================================