Approach to speed up `dist2` for a large number of documents #190

nsriram13 · 2017-05-23T06:22:49Z

I am using the RWMD feature to generate a document to document distance matrix. It works well on a small set, but is really slow when I try it on the full document set. I am monitoring my CPU useage and I see that it is not utilizing all cores of my computer. Is there an easy way to make this run faster?

The toy code below takes for ever to run and can be a good test case:

library(text2vec)

data("movie_review")

# tokenize
tokens = movie_review$review %>%
  tolower %>%
  word_tokenizer
v = create_vocabulary(itoken(tokens)) %>%
  prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.5)
corpus = create_corpus(itoken(tokens), vocab_vectorizer(v, skip_grams_window = 5))
dtm = get_dtm(corpus)
tcm = get_tcm(corpus)

# GloVe model training
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv = glove_model$fit(tcm, n_iter = 10)
word_vectors = wv$get_word_vectors()

# generate distance matrix
rwmd_model = RWMD$new(word_vectors)
rwmd_dist = dist2(dtm, dtm, method = rwmd_model, norm = 'none')

Really appreciate the help!

The text was updated successfully, but these errors were encountered:

dselivanov · 2017-05-23T06:36:46Z

It is possible to make it parallel, but the main problem is that dist2 function has quadratic complexity in terms of row numbers of input matrices. So for dist2(m1, m2) complexity is nrow(m1) * nrow(m2). How large are your matrices?

nsriram13 · 2017-05-23T07:05:47Z

I have about 20K documents.

dselivanov · 2017-05-23T07:49:15Z

Few more questions.

Which BLAS do you use? built-in R or some high-performance (openblas, apple accelerate, MKL)?
what is the size of vocabulary?

dselivanov · 2017-05-23T08:41:24Z

I made it faster, but still complexity is O(n^2) (and this will remain)

Summarising:

Use good BLAS - a lot of text2vec functions rely on linear algebra. BLAS is key component for high-performance computing.
I incorporated this idea. So things should be faster.
Now main bottleneck is rowMins() function. It is already written in Rcpp, but probably can be tuned/parallelized/simd-parallelized in future.

Following code runs for ~ 12 sec on my macbook with apple veclib BLAS (latest text2vec from github):

library(text2vec)

data("movie_review")

# tokenize
tokens = movie_review$review %>%
  tolower %>%
  word_tokenizer
v = create_vocabulary(itoken(tokens)) %>%
  prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.5)
dtm = create_dtm(itoken(tokens), vocab_vectorizer(v))
tcm = create_tcm(itoken(tokens), vocab_vectorizer(v), skip_grams_window = 5)

# GloVe model training
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv = glove_model$fit(tcm, n_iter = 10)
word_vectors = wv$get_word_vectors()

rwmd_model2 = RWMD$new(word_vectors)
system.time(rwmd_dist2 <- dist2(dtm[1:200, ], method = rwmd_model2, norm = 'none'))

nsriram13 · 2017-05-24T02:25:34Z

Thanks for the quick fix. I am running the default R on Windows and hence am not using any BLAS libraries. Tried installing MKL today; but it is just such a time sink figuring it out on Windows. Will switch to a Linux box and let you know how it goes.

dselivanov closed this as completed in dd63cb3 May 23, 2017

dselivanov self-assigned this May 23, 2017

dselivanov added the enhancement label May 23, 2017

dselivanov added this to the 0.5 milestone May 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approach to speed up `dist2` for a large number of documents #190

Approach to speed up `dist2` for a large number of documents #190

nsriram13 commented May 23, 2017 •

edited

Loading

dselivanov commented May 23, 2017

nsriram13 commented May 23, 2017

dselivanov commented May 23, 2017

dselivanov commented May 23, 2017

nsriram13 commented May 24, 2017

Approach to speed up dist2 for a large number of documents #190

Approach to speed up dist2 for a large number of documents #190

Comments

nsriram13 commented May 23, 2017 • edited Loading

dselivanov commented May 23, 2017

nsriram13 commented May 23, 2017

dselivanov commented May 23, 2017

dselivanov commented May 23, 2017

nsriram13 commented May 24, 2017

Approach to speed up `dist2` for a large number of documents #190

Approach to speed up `dist2` for a large number of documents #190

nsriram13 commented May 23, 2017 •

edited

Loading