2vec saveload fixes (#11)

* Make docs clearer on `alpha` parameter in LDA model * Update Hoffman paper link * rm whitespace * Update gensim/models/ldamodel.py * Update gensim/models/ldamodel.py * Update gensim/models/ldamodel.py * re-applying changes from piskvorky#2821 * migrating + regenerating changed docs * fix forgotten iteritems * remove extra `model.wv` * split overlong doc line * get rid of six in doc2vec * increase test timeout for Visdom server * add 32/64 bits report * add deprecations for init_sims() * remove vectors_norm + add link to migration guide to deprecation warnings * rename vectors_norm everywhere, update tests, regen docs * put back no-op property setter of deprecated vectors_norm * fix typo * fix flake8 * disable Keras tests - failing with weird errors on py3.7+3.8, see https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862 * test showing FT failure as W2V * set .vectors even when ngrams off * Update gensim/test/test_fasttext.py * Update gensim/test/test_fasttext.py * refresh docs for run_annoy tutorial * Reduce memory use of the term similarity matrix constructor, deprecate the positive_definite parameter, and extend normalization capabilities of the inner_product method (piskvorky#2783) * Deprecate SparseTermSimilarityMatrix's positive_definite parameter * Reference paper on efficient implementation of soft cosine similarity * Add example with Annoy indexer to SparseTermSimilarityMatrix * Add example of obtaining word embeddings from SparseTermSimilarityMatrix * Reduce space complexity of SparseTermSimilarityMatrix construction Build matrix using arrays and bitfields rather than DOK sparse format This work is based on the following blog post by @maciejkula: https://maciejkula.github.io/2015/02/22/incremental-construction-of-sparse-matrices/ * Fix a typo in the soft cosine similarity Jupyter notebook * Add human-readable string representation for TermSimilarityIndex * Avoid sparse term similarity matrix computation when nonzero_limit <= 0 * Extend normalization in the inner_product method Support the `maintain` vector normalization scheme. Support separate vector normalization schemes for queries and documents. * Remove a note in the docstring of SparseTermSimilarityMatrix * Rerun continuous integration tests * Use ==/!= to compare constant literals * Add human-readable string representation for TermSimilarityIndex (cont.) * Prod flake8 with a coding style violation in a docstring * Collapse two lambdas into one internal function * Revert "Prod flake8 with a coding style violation in a docstring" This reverts commit 6557b84. * Avoid str.format() * Slice SparseTermSimilarityMatrix.inner_product tests by input types * Remove similarity_type_code local variable * Remove starting underscore from local function name * Save indentation level and define populate_buffers function * Extract SparseTermSimilarityMatrix constructor body to _create_source * Extract NON_NEGATIVE_NORM_ASSERTION_MESSAGE to a module-level constant * Extract cell assignment logic to cell_full local function * Split variable swapping into three separate statements * Extract normalization from the body of SparseTermSimilarityMatrix.inner_product * Wrap overlong line * Add test_inner_product_zerovector_zerovector and test_inner_product_zerovector_vector tests * Further split test_inner_product into 63 test cases * Raise ValueError when dictionary is empty * Fix doc2vec crash for large sets of doc-vectors (piskvorky#2907) * Fix AttributeError in WikiCorpus (piskvorky#2901) * bug fix: wikicorpus getstream from data file-path \n Replace fname with input * refactor: use property decorator for input Co-authored-by: jshah02 <jenisnehal.shah@factset.com> * intensify cbow+hs tests; bulk testing method * use increment operator Co-authored-by: Radim Řehůřek <me@radimrehurek.com> * Change num_words to topn in dtm_coherence (piskvorky#2926) * docstirng fixes * get rid of python2 constructs Co-authored-by: S Mono <10430241+xh2@users.noreply.github.com> Co-authored-by: Gordon Mohr <gojogit@gmail.com> Co-authored-by: Vít Novotný <witiko@mail.muni.cz> Co-authored-by: jeni Shah <jenishah@users.noreply.github.com> Co-authored-by: jshah02 <jenisnehal.shah@factset.com> Co-authored-by: Megan <megan.stodel@bbc.co.uk>
gojomo · Sep 8, 2020 · b5794ee · b5794ee
1 parent 0316084
commit b5794ee
Show file tree

Hide file tree

Showing 65 changed files with 1,825 additions and 1,105 deletions.
diff --git a/ISSUE_TEMPLATE.md b/ISSUE_TEMPLATE.md
@@ -22,6 +22,7 @@ Please provide the output of:
 ```python
 import platform; print(platform.platform())
 import sys; print("Python", sys.version)
+import struct; print("Bits", 8 * struct.calcsize("P"))
 import numpy; print("NumPy", numpy.__version__)
 import scipy; print("SciPy", scipy.__version__)
 import gensim; print("gensim", gensim.__version__)

diff --git a/docs/notebooks/soft_cosine_tutorial.ipynb b/docs/notebooks/soft_cosine_tutorial.ipynb
@@ -225,7 +225,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Number of documents: 3\n",
+      "Number of documents: 2274338\n",
       "CPU times: user 2min 1s, sys: 1.9 s, total: 2min 3s\n",
       "Wall time: 2min 56s\n"
      ]
@@ -259,7 +259,7 @@
     "        [preprocess(relcomment[\"RelCText\"]) for relcomment in thread[\"RelComments\"]])\n",
     "    for thread in api.load(\"semeval-2016-2017-task3-subtaskA-unannotated\")]))\n",
     "\n",
-    "print(\"Number of documents: %d\" % len(documents))"
+    "print(\"Number of documents: %d\" % len(corpus))"
    ]
   },
   {

diff --git a/docs/src/_matutils.rst b/docs/src/_matutils.rst
@@ -1,8 +1,8 @@
-:mod:`_matutils` -- Cython matutils
-===================================
+:mod:`_matutils` -- Compiled extension for math utils
+=====================================================
 
 .. automodule:: gensim._matutils
-    :synopsis: Cython math utils
+    :synopsis: Compiled extension for math utils
     :members:
     :inherited-members:
     :undoc-members:

diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -50,6 +50,7 @@ Modules:
     models/_fasttext_bin
     models/phrases
     models/poincare
+    viz/poincare
     models/coherencemodel
     models/basemodel
     models/callbacks
@@ -63,7 +64,8 @@ Modules:
     models/wrappers/varembed
     similarities/docsim
     similarities/termsim
-    similarities/index
+    similarities/annoy
+    similarities/nmslib
     sklearn_api/atmodel
     sklearn_api/d2vmodel
     sklearn_api/hdp
@@ -102,4 +104,3 @@ Modules:
     summarization/summariser
     summarization/syntactic_unit
     summarization/textcleaner
-    viz/poincare
diff --git a/docs/src/auto_examples/core/images/sphx_glr_run_similarity_queries_001.png b/docs/src/auto_examples/core/images/sphx_glr_run_similarity_queries_001.png
diff --git a/docs/src/auto_examples/core/images/thumb/sphx_glr_run_similarity_queries_thumb.png b/docs/src/auto_examples/core/images/thumb/sphx_glr_run_similarity_queries_thumb.png
diff --git a/docs/src/auto_examples/core/run_similarity_queries.py.md5 b/docs/src/auto_examples/core/run_similarity_queries.py.md5
@@ -1 +1 @@
-a3eaf7347874a32d1d25a455753206dc
+54804120deb345715247f0eed42b5e0e
diff --git a/docs/src/auto_examples/core/run_similarity_queries.rst b/docs/src/auto_examples/core/run_similarity_queries.rst
@@ -142,7 +142,7 @@ no random-walk static ranks, just a semantic extension over the boolean keyword
 
  .. code-block:: none
 
-    [(0, 0.4618210045327158), (1, 0.07002766527900064)]
+    [(0, 0.46182100453271613), (1, 0.07002766527900031)]
 
 
 
@@ -254,15 +254,15 @@ order, and obtain the final answer to the query `"Human computer interaction"`:
 
  .. code-block:: none
 
-    (2, 0.9984453) Human machine interface for lab abc computer applications
-    (0, 0.998093) A survey of user opinion of computer system response time
-    (3, 0.9865886) The EPS user interface management system
-    (1, 0.93748635) System and human system engineering testing of EPS
-    (4, 0.90755945) Relation of user perceived response time to error measurement
-    (8, 0.050041765) The generation of random binary unordered trees
-    (7, -0.09879464) The intersection graph of paths in trees
-    (6, -0.10639259) Graph minors IV Widths of trees and well quasi ordering
-    (5, -0.12416792) Graph minors A survey
+    0.9984453 The EPS user interface management system
+    0.998093 Human machine interface for lab abc computer applications
+    0.9865886 System and human system engineering testing of EPS
+    0.93748635 A survey of user opinion of computer system response time
+    0.90755945 Relation of user perceived response time to error measurement
+    0.050041765 Graph minors A survey
+    -0.09879464 Graph minors IV Widths of trees and well quasi ordering
+    -0.10639259 The intersection graph of paths in trees
+    -0.12416792 The generation of random binary unordered trees
 
 
 
@@ -319,17 +319,17 @@ on large datasets easily, and to facilitate prototyping of new algorithms for re
 
  .. code-block:: none
 
-    /Volumes/work/workspace/gensim_misha/docs/src/gallery/core/run_similarity_queries.py:194: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
-      plt.show()
+    /Volumes/work/workspace/vew/gensim3.6/lib/python3.6/site-packages/matplotlib/figure.py:445: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
+      % get_backend())
 
 
 
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** ( 0 minutes  0.663 seconds)
+   **Total running time of the script:** ( 0 minutes  1.211 seconds)
 
-**Estimated memory usage:**  6 MB
+**Estimated memory usage:**  39 MB
 
 
 .. _sphx_glr_download_auto_examples_core_run_similarity_queries.py:

diff --git a/docs/src/auto_examples/core/sg_execution_times.rst b/docs/src/auto_examples/core/sg_execution_times.rst
@@ -5,9 +5,9 @@
 
 Computation times
 =================
-**00:00.844** total execution time for **auto_examples_core** files:
+**00:01.211** total execution time for **auto_examples_core** files:
 
-- **00:00.844**: :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``)
+- **00:01.211**: :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``)
 - **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``)
 - **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``)
-- **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``)
+- **00:00.000**: :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``)