Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach to speed up dist2 for a large number of documents #190

Closed
nsriram13 opened this issue May 23, 2017 · 5 comments
Closed

Approach to speed up dist2 for a large number of documents #190

nsriram13 opened this issue May 23, 2017 · 5 comments
Assignees
Milestone

Comments

@nsriram13
Copy link

nsriram13 commented May 23, 2017

I am using the RWMD feature to generate a document to document distance matrix. It works well on a small set, but is really slow when I try it on the full document set. I am monitoring my CPU useage and I see that it is not utilizing all cores of my computer. Is there an easy way to make this run faster?

The toy code below takes for ever to run and can be a good test case:

library(text2vec)

data("movie_review")

# tokenize
tokens = movie_review$review %>%
  tolower %>%
  word_tokenizer
v = create_vocabulary(itoken(tokens)) %>%
  prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.5)
corpus = create_corpus(itoken(tokens), vocab_vectorizer(v, skip_grams_window = 5))
dtm = get_dtm(corpus)
tcm = get_tcm(corpus)

# GloVe model training
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv = glove_model$fit(tcm, n_iter = 10)
word_vectors = wv$get_word_vectors()

# generate distance matrix
rwmd_model = RWMD$new(word_vectors)
rwmd_dist = dist2(dtm, dtm, method = rwmd_model, norm = 'none')

Really appreciate the help!

@dselivanov
Copy link
Owner

It is possible to make it parallel, but the main problem is that dist2 function has quadratic complexity in terms of row numbers of input matrices. So for dist2(m1, m2) complexity is nrow(m1) * nrow(m2). How large are your matrices?

@nsriram13
Copy link
Author

I have about 20K documents.

@dselivanov
Copy link
Owner

Few more questions.

  1. Which BLAS do you use? built-in R or some high-performance (openblas, apple accelerate, MKL)?
  2. what is the size of vocabulary?

@dselivanov
Copy link
Owner

I made it faster, but still complexity is O(n^2) (and this will remain)

Summarising:

  1. Use good BLAS - a lot of text2vec functions rely on linear algebra. BLAS is key component for high-performance computing.
  2. I incorporated this idea. So things should be faster.
  3. Now main bottleneck is rowMins() function. It is already written in Rcpp, but probably can be tuned/parallelized/simd-parallelized in future.

Following code runs for ~ 12 sec on my macbook with apple veclib BLAS (latest text2vec from github):

library(text2vec)

data("movie_review")

# tokenize
tokens = movie_review$review %>%
  tolower %>%
  word_tokenizer
v = create_vocabulary(itoken(tokens)) %>%
  prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.5)
dtm = create_dtm(itoken(tokens), vocab_vectorizer(v))
tcm = create_tcm(itoken(tokens), vocab_vectorizer(v), skip_grams_window = 5)

# GloVe model training
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv = glove_model$fit(tcm, n_iter = 10)
word_vectors = wv$get_word_vectors()

rwmd_model2 = RWMD$new(word_vectors)
system.time(rwmd_dist2 <- dist2(dtm[1:200, ], method = rwmd_model2, norm = 'none'))

@dselivanov dselivanov self-assigned this May 23, 2017
@dselivanov dselivanov added this to the 0.5 milestone May 23, 2017
@nsriram13
Copy link
Author

Thanks for the quick fix. I am running the default R on Windows and hence am not using any BLAS libraries. Tried installing MKL today; but it is just such a time sink figuring it out on Windows. Will switch to a Linux box and let you know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants