-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approach to speed up dist2
for a large number of documents
#190
Comments
It is possible to make it parallel, but the main problem is that |
I have about 20K documents. |
Few more questions.
|
I made it faster, but still complexity is O(n^2) (and this will remain) Summarising:
Following code runs for ~ 12 sec on my macbook with apple veclib BLAS (latest text2vec from github): library(text2vec)
data("movie_review")
# tokenize
tokens = movie_review$review %>%
tolower %>%
word_tokenizer
v = create_vocabulary(itoken(tokens)) %>%
prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.5)
dtm = create_dtm(itoken(tokens), vocab_vectorizer(v))
tcm = create_tcm(itoken(tokens), vocab_vectorizer(v), skip_grams_window = 5)
# GloVe model training
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = v, x_max = 10)
wv = glove_model$fit(tcm, n_iter = 10)
word_vectors = wv$get_word_vectors()
rwmd_model2 = RWMD$new(word_vectors)
system.time(rwmd_dist2 <- dist2(dtm[1:200, ], method = rwmd_model2, norm = 'none')) |
Thanks for the quick fix. I am running the default R on Windows and hence am not using any BLAS libraries. Tried installing MKL today; but it is just such a time sink figuring it out on Windows. Will switch to a Linux box and let you know how it goes. |
I am using the
RWMD
feature to generate a document to document distance matrix. It works well on a small set, but is really slow when I try it on the full document set. I am monitoring my CPU useage and I see that it is not utilizing all cores of my computer. Is there an easy way to make this run faster?The toy code below takes for ever to run and can be a good test case:
Really appreciate the help!
The text was updated successfully, but these errors were encountered: