From 01635ac292d33cfaead11140f52cf969e24ba172 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Fri, 22 Jun 2018 12:29:47 +0200 Subject: [PATCH 01/14] update non-API docs --- docs/src/about.rst | 46 +++++++++++----------- docs/src/distributed.rst | 8 ++-- docs/src/install.rst | 62 ++++++++++++++---------------- docs/src/intro.rst | 82 ++++++++++++++++------------------------ docs/src/support.rst | 25 ++++++------ 5 files changed, 100 insertions(+), 123 deletions(-) diff --git a/docs/src/about.rst b/docs/src/about.rst index 64a65bd333..25194b9404 100644 --- a/docs/src/about.rst +++ b/docs/src/about.rst @@ -2,12 +2,12 @@ .. _about: -============ +===== About -============ +===== History --------- +------- Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics Library `dml.cz `_ in 2008, where it served to generate a short list of the most similar articles to a given article (**gensim = "generate similar"**). @@ -15,19 +15,18 @@ I also wanted to try these fancy "Latent Semantic Methods", but the libraries th realized the necessary computation were `not much fun to work with `_. Naturally, I set out to reinvent the wheel. Our `2010 LREC publication `_ -describes the initial design decisions behind gensim (clarity, efficiency and scalability) -and is fairly representative of how gensim works even today. +describes the initial design decisions behind Gensim: clarity, efficiency and scalability. It is fairly representative of how Gensim works even today. Later versions of gensim improved this efficiency and scalability tremendously. In fact, I made algorithmic scalability of distributional semantics the topic of my `PhD thesis `_. -By now, gensim is---to my knowledge---the most robust, efficient and hassle-free piece +By now, Gensim is---to my knowledge---the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text. It stands in contrast to brittle homework-assignment-implementations that do not scale on one hand, and robust java-esque projects that take forever just to run "hello world". In 2011, I started using `Github `_ for source code hosting -and the gensim website moved to its present domain. In 2013, gensim got its current logo and website design. +and the Gensim website moved to its present domain. In 2013, Gensim got its current logo and website design. Licensing @@ -35,39 +34,40 @@ Licensing Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_. This means that it's free for both personal and commercial use, but if you make any -modification to gensim that you distribute to other people, you have to disclose +modification to Gensim that you distribute to other people, you have to disclose the source code of these modifications. -Apart from that, you are free to redistribute gensim in any way you like, though you're +Apart from that, you are free to redistribute Gensim in any way you like, though you're not allowed to modify its license (doh!). -My intent here is, of course, to **get more help and community involvement** with the development of gensim. +My intent here is to **get more help and community involvement** with the development of Gensim. The legalese is therefore less important to me than your input and contributions. -Contact me if LGPL doesn't fit your bill but you'd still like to use gensim -- we'll work something out. + +`Contact me `_ if LGPL doesn't fit your bill and you'd like the open source restrictions lifted. .. seealso:: - I also host a document similarity package `gensim.simserver`. This is a high-level - interface to `gensim` functionality, and offers transactional remote (web-based) - document similarity queries and indexing. It uses gensim to do the heavy lifting: - you don't need the `simserver` to use gensim, but you do need gensim to use the `simserver`. - Note that unlike gensim, `gensim.simserver` is licensed under `Affero GPL `_, - which is much more restrictive for inclusion in commercial projects. + We also built a high performance commercial server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. ScaleText is available both on-prem and as SaaS. + + Reach out at info@scaletext.com if you need an industry-grade NLP tool with professional support. + Contributors --------------- +------------ -Credit goes to all the people who contributed to gensim, be it in `discussions `_, +Credit goes to the many people who contributed to Gensim, be it in `discussions `_, ideas, `code contributions `_ or `bug reports `_. + It's really useful and motivating to get feedback, in any shape or form, so big thanks to you all! Some honorable mentions are included in the `CHANGELOG.txt `_. Academic citing ----------------- +--------------- -Gensim has been used in `many students' final theses as well as research papers `_. When citing gensim, -please use `this BibTeX entry `_:: +Gensim has been used in `over a thousand research paper and student theses `_. + +When citing Gensim, please use `this BibTeX entry `_:: @inproceedings{rehurek_lrec, title = {{Software Framework for Topic Modelling with Large Corpora}}, @@ -83,5 +83,3 @@ please use `this BibTeX entry `_:: note={\url{http://is.muni.cz/publication/884893/en}}, language={English} } - - diff --git a/docs/src/distributed.rst b/docs/src/distributed.rst index 38f243222f..eaefe27a5f 100644 --- a/docs/src/distributed.rst +++ b/docs/src/distributed.rst @@ -1,7 +1,7 @@ .. _distributed: Distributed Computing -=================================== +===================== Why distributed computing? --------------------------- @@ -42,15 +42,15 @@ installation is quite painless and only involves copying its `*.py` files somewh sudo easy_install Pyro4 -You don't have to install `Pyro` to run `gensim`, but if you don't, you won't be able +You don't have to install Pyro to run Gensim, but if you don't, you won't be able to access the distributed features (i.e., everything will always run in serial mode, the examples on this page don't apply). Core concepts ------------------------------------ +------------- -As always, `gensim` strives for a clear and straightforward API (see :ref:`design`). +As always, Gensim strives for a clear and straightforward API (see :ref:`design`). To this end, *you do not need to make any changes in your code at all* in order to run it over a cluster of computers! diff --git a/docs/src/install.rst b/docs/src/install.rst index 69d101e430..3f3b2984b9 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -9,11 +9,11 @@ Quick install Run in your terminal:: - easy_install -U gensim + pip install --upgrade gensim or, alternatively:: - pip install --upgrade gensim + easy_install -U gensim In case that fails, make sure you're installing into a writeable location (or use `sudo`), or read on. @@ -28,9 +28,6 @@ platform that supports Python 2.6+ and NumPy. Gensim depends on the following so * `NumPy `_ >= 1.3. Tested with version 1.9.0, 1.7.1, 1.7.0, 1.6.2, 1.6.1rc2, 1.5.0rc1, 1.4.0, 1.3.0, 1.3.0rc2. * `SciPy `_ >= 0.7. Tested with version 0.14.0, 0.12.0, 0.11.0, 0.10.1, 0.9.0, 0.8.0, 0.8.0b1, 0.7.1, 0.7.0. -**Windows users** are well advised to try the `Enthought distribution `_, -which conveniently includes Python & NumPy & SciPy in a single bundle, and is free for academic use. - Install Python and `easy_install` --------------------------------- @@ -50,20 +47,19 @@ Install SciPy & NumPy ---------------------- These are quite popular Python packages, so chances are there are pre-built binary -distributions available for your platform. You can try installing from source using easy_install:: +distributions available for your platform. You can try installing from source using `pip` or `easy_install`:: - easy_install numpy - easy_install scipy + easy_install install numpy + easy_install install scipy -If that doesn't work or if you'd rather install using a binary package, consult -http://www.scipy.org/Download. +If that doesn't work or if you'd rather install using a binary package, consult http://www.scipy.org/Download. -Install `gensim` ------------------ +Install Gensim +-------------- -You can now install (or upgrade) `gensim` with:: +You can now install (or upgrade) Gensim with:: - easy_install --upgrade gensim + easy_install -U gensim That's it! Congratulations, you can proceed to the :doc:`tutorials `. @@ -74,53 +70,51 @@ of computers, in :doc:`distributed`, you should install with:: easy_install gensim[distributed] -The optional `distributed` feature installs `Pyro (PYthon Remote Objects) `_. -If you don't know what distributed computing means, you can ignore it: -`gensim` will work fine for you anyway. +The optional ``distributed`` feature installs `Pyro (PYthon Remote Objects) `_. +If you don't know what distributed computing means, you can ignore it: Gensim will work fine for you anyway. + This optional extension can also be installed separately later with:: - easy_install Pyro4 + pip install Pyro4 ----- There are also alternative routes to install: 1. If you have downloaded and unzipped the `tar.gz source `_ - for `gensim` (or you're installing `gensim` from `github `_), + for Gensim (or you're installing Gensim from `Github `_), you can run:: python setup.py install - to install `gensim` into your ``site-packages`` folder. -2. If you wish to make local changes to the `gensim` code (`gensim` is, after all, a - package which targets research prototyping and modifications), a preferred - way may be installing with:: + to install Gensim into your ``site-packages`` folder. +2. If you wish to make local changes to the Gensim code, a preferred way may be installing with:: python setup.py develop + or:: + + pip install -e . + This will only place a symlink into your ``site-packages`` directory. The actual files will stay wherever you unpacked them. -3. If you don't have root priviledges (or just don't want to put the package into - your ``site-packages``), simply unpack the source package somewhere and that's it! No - compilation or installation needed. Just don't forget to set your PYTHONPATH - (or modify ``sys.path``), so that Python can find the unpacked package when importing. -Testing `gensim` ----------------- +Testing Gensim +-------------- To test the package, unzip the `tar.gz source `_ and run:: python setup.py test -Gensim uses Travis CI for continuous integration: |Travis|_ +Gensim uses Travis CI for continuous integration, automatically running the full test suite on each pull request and commit: |Travis|_ -.. |Travis| image:: https://api.travis-ci.org/piskvorky/gensim.png?branch=develop -.. _Travis: https://travis-ci.org/piskvorky/gensim +.. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop +.. _Travis: https://travis-ci.org/RaRe-Technologies/gensim Problems? --------- -Use the `gensim discussion group `_ for -questions and troubleshooting. See the :doc:`support page `. +Use the `Gensim discussion group `_ for +questions and troubleshooting. See the :doc:`support page ` for commercial support. diff --git a/docs/src/intro.rst b/docs/src/intro.rst index 3ffc724267..176c866fec 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -9,14 +9,10 @@ topics from documents, as efficiently (computer-wise) and painlessly (human-wise Gensim is designed to process raw, unstructured digital texts ("*plain text*"). -The algorithms in `gensim`, such as **Latent Semantic Analysis**, **Latent Dirichlet Allocation** and **Random Projections** -discover semantic structure of documents by examining statistical -co-occurrence patterns of the words within a corpus of training documents. -These algorithms are unsupervised, which means no human input is necessary -- you only need a corpus of plain text documents. -Once these statistical patterns are found, any plain text documents can be succinctly -expressed in the new, semantic representation and queried for topical similarity -against other documents. +The algorithms in Gensim, such as **Word2Vec**, **FastText**, **Latent Semantic Analysis**, **Latent Dirichlet Allocation** and **Random Projections**, discover semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents. + +Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents, words or phrases. .. note:: If the previous paragraphs left you confused, you can read more about the `Vector @@ -27,66 +23,60 @@ against other documents. .. _design: Features ------------------- +-------- * **Memory independence** -- there is no need for the whole training corpus to reside fully in RAM at any one time (can process large, web-scale corpora). +* **Memory sharing** -- trained models can be persisted to disk and loaded back via mmap. Multiple processes can share the same data, cutting down RAM footprint. * Efficient implementations for several popular vector space algorithms, - including **Tf-Idf**, distributed incremental **Latent Semantic Analysis**, - distributed incremental **Latent Dirichlet Allocation (LDA)** or **Random Projection**; adding new ones is easy (really!). -* I/O wrappers and converters around **several popular data formats**. -* **Similarity queries** for documents in their semantic representation. - -The creation of `gensim` was motivated by a perceived lack of available, scalable software -frameworks that realize topic modelling, and/or their overwhelming internal complexity (hail Java!). -You can read more about the motivation in our `LREC 2010 workshop paper `_. -If you want to cite `gensim` in your own work, please refer to that article (`BibTeX `_). - -You're welcome to share your results and experiments on the `mailing list `_. + including Word2Vec, Doc2Vec, FastText, TF-IDF, Latent Semantic Analysis (LSI, LSA), + Latent Dirichlet Allocation (LDA) or Random Projection. +* I/O wrappers and readers from several popular data formats. +* Fast similarity queries for documents in their semantic representation. -The **principal design objectives** behind `gensim` are: +The **principal design objectives** behind Gensim are: -1. Straightforward interfaces and low API learning curve for developers. Good - for prototyping. +1. Straightforward interfaces and low API learning curve for developers. Good for prototyping. 2. Memory independence with respect to the size of the input corpus; all intermediate steps and algorithms operate in a streaming fashion, accessing one document at a time. .. seealso:: - If you're interested in document indexing/similarity retrieval, I also maintain a higher-level package - of `document similarity server `_. It uses `gensim` internally. + We also built a high performance commercial server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. ScaleText is available both on-prem and as SaaS. + + Reach out at info@scaletext.com if you need an industry-grade NLP tool with professional support. + .. _availability: Availability ------------ -Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_ -and can be downloaded either from its `github repository `_ +Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_ and can be downloaded either from its `Github repository `_ or from the `Python Package Index `_. .. seealso:: - See the :doc:`install ` page for more info on `gensim` deployment. + See the :doc:`install ` page for more info on Gensim deployment. Core concepts ------------- -The whole `gensim` package revolves around the concepts of :term:`corpus`, :term:`vector` and -:term:`model`. +The whole Gensim package revolves around the concepts of :term:`corpus`, :term:`vector` and :term:`model`. .. glossary:: Corpus A collection of digital documents. This collection is used to automatically - infer the structure of the documents, their topics, etc. For + infer the vector structure of the documents, their topics, etc. For this reason, the collection is also called a *training corpus*. This inferred latent structure can be later used to assign topics to new documents, which did - not appear in the training corpus. - No human intervention (such as tagging the documents by hand, or creating - other metadata) is required. + not appear in the *training corpus*. + + This inferred latent structure can be later used to discovert topics for new documents, which did + not appear in the training corpus. No human intervention (such as annotating or tagging documents by hand, or creating other metadata) is required. Vector In the Vector Space Model (VSM), each document is represented by an @@ -100,8 +90,10 @@ The whole `gensim` package revolves around the concepts of :term:`corpus`, :term The question is usually represented only by its integer id (such as `1`, `2` and `3` here), so that the representation of this document becomes a series of pairs like ``(1, 0.0), (2, 2.0), (3, 5.0)``. + If we know all the questions in advance, we may leave them implicit and simply write ``(0.0, 2.0, 5.0)``. + This sequence of answers can be thought of as a *vector* (in this case a 3-dimensional vector). For practical purposes, only questions to which the answer is (or can be converted to) a single real number are allowed. @@ -120,27 +112,17 @@ The whole `gensim` package revolves around the concepts of :term:`corpus`, :term Gensim does not prescribe any specific corpus format; a corpus is anything that, when iterated over, successively yields these sparse vectors. - For example, `set((((2, 2.0), (3, 5.0)), ((0, 1.0), (3, 1.0))))` is a trivial - corpus of two documents, each with two non-zero `feature-answer` pairs. - + For example, ``[ [(2, 2.0), (3, 5.0)], [(0, 1.0), (3, 1.0)] ]`` + is a simple corpus of two documents, each with two non-zero `feature-answer` pairs. Model - We use **model** as an abstract term referring to a transformation from - one document representation to another. In `gensim` documents are + We use **model** as an abstract term referring to the code and associated data + required to transform one document representation to another. In Gensim, documents are represented as vectors so a model can be thought of as a transformation - between two vector spaces. The details of this transformation are - learned from the training corpus. - - - For example, consider a transformation that takes a raw count of word - occurrences and weights them so that common words are discounted and - rare words are promoted. The exact amount that any particular word is - weighted by is determined by the relative frequency of that word in the - training corpus. When we apply this model we transform from one vector - space (containing the raw word counts) to another (containing the - weighted counts). + between two vector spaces. The parameters of this transformation are learned from the training corpus. Gensim + implements multiple models, such as Word2Vec, Latent Semantic Indexing, LDA, FastText etc. .. seealso:: - For some examples on how this works out in code, go to :doc:`tutorials `. + For some examples on how this works out in code, go to :doc:`Tutorials `. diff --git a/docs/src/support.rst b/docs/src/support.rst index a9f83e1380..431fd654c2 100644 --- a/docs/src/support.rst +++ b/docs/src/support.rst @@ -1,18 +1,19 @@ .. _support: -============= +======= Support -============= +======= Open source support --------------------- +------------------- + +The main communication channel is the `Gensim mailing list `_. -The main communication channel is the `gensim mailing list `_. This is the preferred way to **ask for help**, **report problems** and **share insights** with the community. Newbie questions are perfectly fine, just make sure you've read the :doc:`tutorials `. -I discourage sending private emails, because the mailing list serves as a knowledge base for all gensim users, cutting maintenance efforts needed for support. If you feel your problem is too special, data too sensitive, technical scope too demanding, **see the "business" section below**. +I discourage sending private emails, because the mailing list serves as a knowledge base for all Gensim users, cutting maintenance efforts needed for support. If you feel your problem is too special, data too sensitive, technical scope too demanding, **see the "business" section below**. -When posting on the mailing list, try to include all relevant information, such as what it is you are trying to achieve, what went wrong, relevant gensim logs, package versions etc. +When posting on the mailing list, try to include all relevant information, such as what it is you are trying to achieve, what went wrong, relevant Gensim logs, package versions etc. **FAQ** and some useful **snippets of code** are maintained on GitHub: https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ. @@ -20,14 +21,16 @@ You can also try asking on StackOverflow, using the `gensim tag `_. +We run a consulting R&D company focused on data mining and unstructured text processing, https://rare-technologies.com. + +If you need commercial support, design validation, technical training or custom system development, `get in touch `_ for a quote. -In case you need commercial support, design validation, technical training or custom system development, `get in touch `_ for a quote. Developer support ------------------ -Developers who `tweak gensim internals `_ are encouraged to report issues at the `GitHub issue tracker `_. -Note that this is not a medium for discussions or asking open-ended questions; please use the mailing list for that. +Developers who `tweak Gensim internals `_ are encouraged to report issues at the `GitHub issue tracker `_. + +Note that Github is not a medium for discussions or asking open-ended questions; please use the `mailing list `_ for that. From 9349437505611a89d6a06e857778dd6a02b61da4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 25 Jun 2018 17:58:24 +0200 Subject: [PATCH 02/14] Ivan's review comments --- docs/src/install.rst | 52 ++++++++++++++++++++++---------------------- docs/src/intro.rst | 8 +++++-- 2 files changed, 32 insertions(+), 28 deletions(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index 3f3b2984b9..b9a183e498 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -7,29 +7,30 @@ Installation Quick install -------------- -Run in your terminal:: +Run in your terminal (recommended):: pip install --upgrade gensim -or, alternatively:: +or, alternatively for `conda` environments:: - easy_install -U gensim + conda install -c conda-forge gensim -In case that fails, make sure you're installing into a writeable location (or use `sudo`), or read on. +In case that fails, make sure you're installing into a writeable location (or use `sudo`). ----- Dependencies ------------- -Gensim is known to run on Linux, Windows and Mac OS X and should run on any other + +Gensim runs on Linux, Windows and Mac OS X, and should run on any other platform that supports Python 2.6+ and NumPy. Gensim depends on the following software: -* `Python `_ >= 2.6. Tested with versions 2.6, 2.7, 3.3, 3.4 and 3.5. Support for Python 2.5 was discontinued starting gensim 0.10.0; if you *must* use Python 2.5, install gensim 0.9.1. +* `Python `_ >= 2.6. Tested with versions 2.6, 2.7, 3.3, 3.4 and 3.5. Support for Python 2.5 was discontinued starting Gensim 0.10.0; if you *must* use Python 2.5, install Gensim version 0.9.1. * `NumPy `_ >= 1.3. Tested with version 1.9.0, 1.7.1, 1.7.0, 1.6.2, 1.6.1rc2, 1.5.0rc1, 1.4.0, 1.3.0, 1.3.0rc2. * `SciPy `_ >= 0.7. Tested with version 0.14.0, 0.12.0, 0.11.0, 0.10.1, 0.9.0, 0.8.0, 0.8.0b1, 0.7.1, 0.7.0. -Install Python and `easy_install` +Install Python and `pip` --------------------------------- Check what version of Python you have with:: @@ -40,17 +41,21 @@ You can download Python from http://python.org/download. .. note:: Gensim requires Python 2.6 / 3.3 or greater, and will not run under earlier versions. -Next, install the `easy_install utility `_, -which will make installing other Python programs easier. +Make sure you have `pip`, Python's recommended tool for installing and managing Python dependencies:: + + pip --version + +Pip typically comes pre-installed with Python. If not, refer to `Installing pip `_. + Install SciPy & NumPy ---------------------- -These are quite popular Python packages, so chances are there are pre-built binary -distributions available for your platform. You can try installing from source using `pip` or `easy_install`:: +These are popular Python packages, so chances are there are pre-built binary +distributions available for your platform. Install them using `pip`:: - easy_install install numpy - easy_install install scipy + pip install numpy + pip install scipy If that doesn't work or if you'd rather install using a binary package, consult http://www.scipy.org/Download. @@ -59,16 +64,15 @@ Install Gensim You can now install (or upgrade) Gensim with:: - easy_install -U gensim + pip install --upgrade gensim That's it! Congratulations, you can proceed to the :doc:`tutorials `. ----- -If you also want to run the algorithms over a cluster -of computers, in :doc:`distributed`, you should install with:: +If you also want to run the algorithms over a cluster of computers, in :doc:`distributed`, you should install with:: - easy_install gensim[distributed] + pip install 'gensim[distributed]' The optional ``distributed`` feature installs `Pyro (PYthon Remote Objects) `_. If you don't know what distributed computing means, you can ignore it: Gensim will work fine for you anyway. @@ -85,19 +89,15 @@ There are also alternative routes to install: for Gensim (or you're installing Gensim from `Github `_), you can run:: - python setup.py install + pip install . to install Gensim into your ``site-packages`` folder. 2. If you wish to make local changes to the Gensim code, a preferred way may be installing with:: - python setup.py develop - - or:: - - pip install -e . + pip install --editable . This will only place a symlink into your ``site-packages`` directory. The actual - files will stay wherever you unpacked them. + files will stay wherever you unpacked them, ready for editing. Testing Gensim @@ -107,12 +107,12 @@ To test the package, unzip the `tar.gz source `, :doc:`FastText `, +:doc:`Latent Semantic Analysis `, :doc:`Latent Dirichlet Allocation ` etc, +automatically discover the semantic structure of documents by examining statistical +co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, +which means no human input is necessary -- you only need a corpus of plain text documents. -Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents, words or phrases. +Once these statistical patterns are found, any plain text documents (sentence, phrase, word…) can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents (words, phrases…). .. note:: If the previous paragraphs left you confused, you can read more about the `Vector From e644cbbf0dbfff9c736ef0338dc934e415c2dfa7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 25 Jun 2018 18:16:15 +0200 Subject: [PATCH 03/14] remove algo links (broken) --- docs/src/intro.rst | 5 ++--- docs/src/models/doc2vec.rst | 6 +++--- docs/src/models/word2vec.rst | 6 +++--- 3 files changed, 8 insertions(+), 9 deletions(-) diff --git a/docs/src/intro.rst b/docs/src/intro.rst index 03d8ba8ed2..b9e4167b0b 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -10,9 +10,8 @@ topics from documents, as efficiently (computer-wise) and painlessly (human-wise Gensim is designed to process raw, unstructured digital texts ("*plain text*"). -The algorithms in Gensim, such as :doc:`Word2Vec `, :doc:`FastText `, -:doc:`Latent Semantic Analysis `, :doc:`Latent Dirichlet Allocation ` etc, -automatically discover the semantic structure of documents by examining statistical +The algorithms in Gensim, such as Word2Vec, FastText, Latent Semantic Analysis (LSI, LSA), Latent Dirichlet +Allocation (LDA) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents. diff --git a/docs/src/models/doc2vec.rst b/docs/src/models/doc2vec.rst index f2da8bb722..b5d2e290b5 100644 --- a/docs/src/models/doc2vec.rst +++ b/docs/src/models/doc2vec.rst @@ -1,8 +1,8 @@ -:mod:`models.doc2vec` -- Deep learning with paragraph2vec -========================================================= +:mod:`models.doc2vec` -- Doc2vec paragraph embeddings +===================================================== .. automodule:: gensim.models.doc2vec - :synopsis: Deep learning with doc2vec + :synopsis: Doc2vec paragraph embeddings :members: :inherited-members: :undoc-members: diff --git a/docs/src/models/word2vec.rst b/docs/src/models/word2vec.rst index 1679429e22..62117c1a6b 100644 --- a/docs/src/models/word2vec.rst +++ b/docs/src/models/word2vec.rst @@ -1,8 +1,8 @@ -:mod:`models.word2vec` -- Deep learning with word2vec -====================================================== +:mod:`models.word2vec` -- Word2vec embeddings +============================================= .. automodule:: gensim.models.word2vec - :synopsis: Deep learning with word2vec + :synopsis: Word2vec embeddings :members: :inherited-members: :undoc-members: From 03571a38b23356173610db8fca641fa0a44e83a4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 25 Jun 2018 18:22:38 +0200 Subject: [PATCH 04/14] remove "local testing" section from install.rst --- docs/src/install.rst | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index b9a183e498..ca9a307ece 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -15,7 +15,7 @@ or, alternatively for `conda` environments:: conda install -c conda-forge gensim -In case that fails, make sure you're installing into a writeable location (or use `sudo`). +In case that fails, make sure you're installing into a writeable location (or use `sudo`), or keep reading. ----- @@ -31,7 +31,7 @@ platform that supports Python 2.6+ and NumPy. Gensim depends on the following so Install Python and `pip` ---------------------------------- +------------------------ Check what version of Python you have with:: @@ -103,16 +103,12 @@ There are also alternative routes to install: Testing Gensim -------------- -To test the package, unzip the `tar.gz source `_ and run:: - - python setup.py test - -Gensim uses continuous integration, automatically running a full test suite and documentation build -on each pull request: |Travis|_ +Gensim uses continuous integration, automatically running a full test suite on each pull request: |Travis|_ .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop .. _Travis: https://travis-ci.org/RaRe-Technologies/gensim + Problems? --------- From ea7ae56f5d61e39869d7031ed9b6685d34c52248 Mon Sep 17 00:00:00 2001 From: ivan Date: Tue, 26 Jun 2018 08:07:44 +0500 Subject: [PATCH 05/14] add links into intro page --- docs/src/intro.rst | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/src/intro.rst b/docs/src/intro.rst index b9e4167b0b..0335226582 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -10,8 +10,9 @@ topics from documents, as efficiently (computer-wise) and painlessly (human-wise Gensim is designed to process raw, unstructured digital texts ("*plain text*"). -The algorithms in Gensim, such as Word2Vec, FastText, Latent Semantic Analysis (LSI, LSA), Latent Dirichlet -Allocation (LDA) etc, automatically discover the semantic structure of documents by examining statistical +The algorithms in Gensim, such as :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.fasttext.FastText`, +Latent Semantic Analysis (LSI, LSA, see :class:`~gensim.models.lsimodel.LsiModel`), Latent Dirichlet +Allocation (LDA, see :class:`~gensim.models.ldamodel.LdaModel`) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents. @@ -32,8 +33,9 @@ Features reside fully in RAM at any one time (can process large, web-scale corpora). * **Memory sharing** -- trained models can be persisted to disk and loaded back via mmap. Multiple processes can share the same data, cutting down RAM footprint. * Efficient implementations for several popular vector space algorithms, - including Word2Vec, Doc2Vec, FastText, TF-IDF, Latent Semantic Analysis (LSI, LSA), - Latent Dirichlet Allocation (LDA) or Random Projection. + including :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.doc2vec.Doc2Vec`, :class:`~gensim.models.fasttext.FastText`, + TF-IDF, Latent Semantic Analysis (LSI, LSA, see :class:`~gensim.models.lsimodel.LsiModel`), + Latent Dirichlet Allocation (LDA, see :class:`~gensim.models.ldamodel.LdaModel`) or Random Projection (see :class:`~gensim.models.rpmodel.RpModel`). * I/O wrappers and readers from several popular data formats. * Fast similarity queries for documents in their semantic representation. @@ -124,7 +126,9 @@ The whole Gensim package revolves around the concepts of :term:`corpus`, :term:` required to transform one document representation to another. In Gensim, documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The parameters of this transformation are learned from the training corpus. Gensim - implements multiple models, such as Word2Vec, Latent Semantic Indexing, LDA, FastText etc. + implements multiple models, such as :class:`~gensim.models.word2vec.Word2Vec`, + :class:`~gensim.models.lsimodel.LsiModel`, :class:`~gensim.models.ldamodel.LdaModel`, + :class:`~gensim.models.fasttext.FastText` etc. .. seealso:: From cc2bf97eed12f5e306218cb62ab11860ea547bef Mon Sep 17 00:00:00 2001 From: ivan Date: Tue, 26 Jun 2018 08:16:32 +0500 Subject: [PATCH 06/14] add other channels to support --- docs/src/support.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/src/support.rst b/docs/src/support.rst index 431fd654c2..44b213b865 100644 --- a/docs/src/support.rst +++ b/docs/src/support.rst @@ -9,6 +9,8 @@ Open source support The main communication channel is the `Gensim mailing list `_. +Additional channels are `twitter @gensim_py `_ and `Gitter piskvorky/gensim `_. + This is the preferred way to **ask for help**, **report problems** and **share insights** with the community. Newbie questions are perfectly fine, just make sure you've read the :doc:`tutorials `. I discourage sending private emails, because the mailing list serves as a knowledge base for all Gensim users, cutting maintenance efforts needed for support. If you feel your problem is too special, data too sensitive, technical scope too demanding, **see the "business" section below**. From 2ac13cef509fcb76a21a74f3bda807e94a25083b Mon Sep 17 00:00:00 2001 From: ivan Date: Tue, 26 Jun 2018 08:56:07 +0500 Subject: [PATCH 07/14] fix install page (correct dependencies, badges for all CI, drop useless parts) --- docs/src/install.rst | 96 +++++++++----------------------------------- 1 file changed, 18 insertions(+), 78 deletions(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index ca9a307ece..65cbd2b9c2 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -17,97 +17,37 @@ or, alternatively for `conda` environments:: In case that fails, make sure you're installing into a writeable location (or use `sudo`), or keep reading. ------ - -Dependencies -------------- - -Gensim runs on Linux, Windows and Mac OS X, and should run on any other -platform that supports Python 2.6+ and NumPy. Gensim depends on the following software: - -* `Python `_ >= 2.6. Tested with versions 2.6, 2.7, 3.3, 3.4 and 3.5. Support for Python 2.5 was discontinued starting Gensim 0.10.0; if you *must* use Python 2.5, install Gensim version 0.9.1. -* `NumPy `_ >= 1.3. Tested with version 1.9.0, 1.7.1, 1.7.0, 1.6.2, 1.6.1rc2, 1.5.0rc1, 1.4.0, 1.3.0, 1.3.0rc2. -* `SciPy `_ >= 0.7. Tested with version 0.14.0, 0.12.0, 0.11.0, 0.10.1, 0.9.0, 0.8.0, 0.8.0b1, 0.7.1, 0.7.0. - - -Install Python and `pip` ------------------------- - -Check what version of Python you have with:: - - python --version - -You can download Python from http://python.org/download. - -.. note:: Gensim requires Python 2.6 / 3.3 or greater, and will not run under earlier versions. - -Make sure you have `pip`, Python's recommended tool for installing and managing Python dependencies:: - - pip --version - -Pip typically comes pre-installed with Python. If not, refer to `Installing pip `_. - - -Install SciPy & NumPy ----------------------- - -These are popular Python packages, so chances are there are pre-built binary -distributions available for your platform. Install them using `pip`:: - - pip install numpy - pip install scipy - -If that doesn't work or if you'd rather install using a binary package, consult http://www.scipy.org/Download. - -Install Gensim --------------- - -You can now install (or upgrade) Gensim with:: - - pip install --upgrade gensim - -That's it! Congratulations, you can proceed to the :doc:`tutorials `. +That's it! Congratulations, you can proceed to the :doc:`tutorials ` ----- -If you also want to run the algorithms over a cluster of computers, in :doc:`distributed`, you should install with:: +Code dependencies +----------------- - pip install 'gensim[distributed]' - -The optional ``distributed`` feature installs `Pyro (PYthon Remote Objects) `_. -If you don't know what distributed computing means, you can ignore it: Gensim will work fine for you anyway. - -This optional extension can also be installed separately later with:: - - pip install Pyro4 - ------ - -There are also alternative routes to install: - -1. If you have downloaded and unzipped the `tar.gz source `_ - for Gensim (or you're installing Gensim from `Github `_), - you can run:: - - pip install . - - to install Gensim into your ``site-packages`` folder. -2. If you wish to make local changes to the Gensim code, a preferred way may be installing with:: - - pip install --editable . - - This will only place a symlink into your ``site-packages`` directory. The actual - files will stay wherever you unpacked them, ready for editing. +Gensim runs on Linux, Windows and Mac OS X, and should run on any other +platform that supports Python 2.7+ and NumPy. Gensim depends on the following software: +* `Python `_ >= 2.7 (tested with versions 2.7, 3.5 and 3.6) +* `NumPy `_ >= 1.11.3 +* `SciPy `_ >= 0.18.1 +* `Six `_ >= 1.5.0 +* `smart_open `_ >= 1.2.1 Testing Gensim -------------- -Gensim uses continuous integration, automatically running a full test suite on each pull request: |Travis|_ +Gensim uses continuous integration, automatically running a full test suite on each pull request: +|Travis|_ |CircleCI|_ |AppVeyor|_ .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop .. _Travis: https://travis-ci.org/RaRe-Technologies/gensim +.. |CircleCI| image:: https://circleci.com/gh/RaRe-Technologies/gensim/tree/develop.svg?style=shield +.. _CircleCI: https://circleci.com/gh/RaRe-Technologies/gensim + +.. |AppVeyor| image:: https://ci.appveyor.com/api/projects/status/r2au32ucpn8gr0tl/branch/develop?svg=true +.. _AppVeyor: https://ci.appveyor.com/api/projects/status/r2au32ucpn8gr0tl/branch/develop?svg=true + Problems? --------- From 005035087dc25e31cbc33c228e9f98bf3b12fdb6 Mon Sep 17 00:00:00 2001 From: ivan Date: Tue, 26 Jun 2018 08:59:23 +0500 Subject: [PATCH 08/14] fix distributed --- docs/src/distributed.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/distributed.rst b/docs/src/distributed.rst index eaefe27a5f..8b27fedc4f 100644 --- a/docs/src/distributed.rst +++ b/docs/src/distributed.rst @@ -37,10 +37,10 @@ Prerequisites For communication between nodes, `gensim` uses `Pyro (PYthon Remote Objects) `_, version >= 4.27. This is a library for low-level socket communication -and remote procedure calls (RPC) in Python. `Pyro` is a pure-Python library, so its +and remote procedure calls (RPC) in Python. `Pyro4` is a pure-Python library, so its installation is quite painless and only involves copying its `*.py` files somewhere onto your Python's import path:: - sudo easy_install Pyro4 + pip install Pyro4 You don't have to install Pyro to run Gensim, but if you don't, you won't be able to access the distributed features (i.e., everything will always run in serial mode, From 282133658ceaa7a14886205db983ffd4ff3d446a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Tue, 26 Jun 2018 07:29:28 +0200 Subject: [PATCH 09/14] minor doc fixes --- docs/src/install.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index 65cbd2b9c2..d4dc6545e2 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -15,9 +15,9 @@ or, alternatively for `conda` environments:: conda install -c conda-forge gensim -In case that fails, make sure you're installing into a writeable location (or use `sudo`), or keep reading. +That's it! Congratulations, you can proceed to the :doc:`tutorials `. -That's it! Congratulations, you can proceed to the :doc:`tutorials ` +In case that failed, make sure you're installing into a writeable location (or use `sudo`). ----- @@ -36,8 +36,8 @@ platform that supports Python 2.7+ and NumPy. Gensim depends on the following so Testing Gensim -------------- -Gensim uses continuous integration, automatically running a full test suite on each pull request: -|Travis|_ |CircleCI|_ |AppVeyor|_ +Gensim uses continuous integration, automatically running a full test suite on each pull request with +Travis |Travis|_, CircleCI |CircleCI|_ and AppVeyor |AppVeyor|_. .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop .. _Travis: https://travis-ci.org/RaRe-Technologies/gensim From cde078980accccfeb08576463d2983498d06b1d8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Tue, 26 Jun 2018 07:34:30 +0200 Subject: [PATCH 10/14] update twitter account --- docs/src/gensim_theme/layout.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/gensim_theme/layout.html b/docs/src/gensim_theme/layout.html index feac733791..4b4bd9fc43 100644 --- a/docs/src/gensim_theme/layout.html +++ b/docs/src/gensim_theme/layout.html @@ -174,7 +174,7 @@

Get Expert Help From The Gensim Authors

From 56ec4e9b99e84814ebbacd3af046c69cb310fdfd Mon Sep 17 00:00:00 2001 From: ivan Date: Tue, 26 Jun 2018 10:55:42 +0500 Subject: [PATCH 11/14] Add description for each CI service --- docs/src/install.rst | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index d4dc6545e2..c4f236ad7c 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -37,7 +37,16 @@ Testing Gensim -------------- Gensim uses continuous integration, automatically running a full test suite on each pull request with -Travis |Travis|_, CircleCI |CircleCI|_ and AppVeyor |AppVeyor|_. + ++------------+-----------------------------------------------------------------------------------------+--------------+ +| CI service | Task | Build badge | ++============+=========================================================================================+==============+ +| Travis | Run tests on Linux and check `code-style `_ | |Travis|_ | ++------------+-----------------------------------------------------------------------------------------+--------------+ +| AppVeyor | Run tests on Windows | |AppVeyor|_ | ++------------+-----------------------------------------------------------------------------------------+--------------+ +| CicleCI | Build documentation | |CircleCI|_ | ++------------+-----------------------------------------------------------------------------------------+--------------+ .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop .. _Travis: https://travis-ci.org/RaRe-Technologies/gensim From c7ecb82bea9df9d160d8b3864d6a4597e0985826 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Tue, 26 Jun 2018 08:28:12 +0200 Subject: [PATCH 12/14] expand glossary --- docs/src/install.rst | 2 +- docs/src/intro.rst | 93 ++++++++++++++++++++++++++++---------------- 2 files changed, 61 insertions(+), 34 deletions(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index c4f236ad7c..2a7f3d0790 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -45,7 +45,7 @@ Gensim uses continuous integration, automatically running a full test suite on e +------------+-----------------------------------------------------------------------------------------+--------------+ | AppVeyor | Run tests on Windows | |AppVeyor|_ | +------------+-----------------------------------------------------------------------------------------+--------------+ -| CicleCI | Build documentation | |CircleCI|_ | +| CircleCI | Build documentation | |CircleCI|_ | +------------+-----------------------------------------------------------------------------------------+--------------+ .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop diff --git a/docs/src/intro.rst b/docs/src/intro.rst index 0335226582..e0329fee73 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -7,7 +7,6 @@ Introduction Gensim is a :ref:`free ` Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. - Gensim is designed to process raw, unstructured digital texts ("*plain text*"). The algorithms in Gensim, such as :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.fasttext.FastText`, @@ -74,17 +73,26 @@ The whole Gensim package revolves around the concepts of :term:`corpus`, :term:` .. glossary:: Corpus - A collection of digital documents. This collection is used to automatically - infer the vector structure of the documents, their topics, etc. For - this reason, the collection is also called a *training corpus*. This inferred - latent structure can be later used to assign topics to new documents, which did - not appear in the *training corpus*. + A collection of digital documents. Corpora serve two roles in Gensim: + + 1. Input for model training + The corpus is used to automatically train a machine learning model, such as + :class:`~gensim.models.lsimodel.LsiModel` or :class:`~gensim.models.ldamodel.LdaModel`. + + The models use this *training corpus* to look for common themes and topics, initializing + their internal model parameters. + + Gensim in unique in its focus on *unsupervised* models so that no human intervention, + such as costly annotations or tagging documents by hand, is required. - This inferred latent structure can be later used to discovert topics for new documents, which did - not appear in the training corpus. No human intervention (such as annotating or tagging documents by hand, or creating other metadata) is required. + 2. Documents to organize. + After training, a topic model can be used to extract topics from new documents (documents + not seen in the training corpus). - Vector - In the Vector Space Model (VSM), each document is represented by an + Such corpora can be :doc:`indexed `, queried by semantic similarity, clustered etc. + + Vector space model + In a Vector Space Model (VSM), each document is represented by an array of features. For example, a single feature may be thought of as a question-answer pair: @@ -99,8 +107,9 @@ The whole Gensim package revolves around the concepts of :term:`corpus`, :term:` If we know all the questions in advance, we may leave them implicit and simply write ``(0.0, 2.0, 5.0)``. - This sequence of answers can be thought of as a *vector* (in this case a 3-dimensional vector). For practical purposes, only questions to which the answer is (or - can be converted to) a single real number are allowed. + This sequence of answers can be thought of as a **vector** (in this case a 3-dimensional dense vector). + For practical purposes, only questions to which the answer is (or + can be converted to) a *single floating point number* are allowed in Gensim. The questions are the same for each document, so that looking at two vectors (representing two documents), we will hopefully be able to make @@ -108,28 +117,46 @@ The whole Gensim package revolves around the concepts of :term:`corpus`, :term:` therefore the original documents must be similar, too". Of course, whether such conclusions correspond to reality depends on how well we picked our questions. - Sparse Vector - Typically, the answer to most questions will be ``0.0``. To save space, - we omit them from the document's representation, and write only ``(2, 2.0), - (3, 5.0)`` (note the missing ``(1, 0.0)``). - Since the set of all questions is known in advance, all the missing features - in a sparse representation of a document can be unambiguously resolved to zero, ``0.0``. - - Gensim does not prescribe any specific corpus format; - a corpus is anything that, when iterated over, successively yields these sparse vectors. - - For example, ``[ [(2, 2.0), (3, 5.0)], [(0, 1.0), (3, 1.0)] ]`` - is a simple corpus of two documents, each with two non-zero `feature-answer` pairs. - - Model - We use **model** as an abstract term referring to the code and associated data - required to transform one document representation to another. In Gensim, documents are - represented as vectors so a model can be thought of as a transformation - between two vector spaces. The parameters of this transformation are learned from the training corpus. Gensim - implements multiple models, such as :class:`~gensim.models.word2vec.Word2Vec`, + Gensim Sparse Vector, Bag-of-words Vector + To save space, in Gensim we omit all vector elements with value 0.0. For example, instead of the + 3-dimensional dense vector ``(0.0, 2.0, 5.0)``, we write only ``[(2, 2.0), (3, 5.0)]`` (note the missing ``(1, 0.0)``). Each vector element is a pair (2-tuple) of ``(feature_id, feature_value)``. The values of all missing features in this sparse representation can be unambiguously resolved to zero, ``0.0``. + + Documents in Gensim are represented by such sparse vectors (sometimes called bag-of-words vectors). + + Gensim streamed corpus + Gensim does not prescribe any specific corpus format. A corpus is simply a sequence + of sparse vector (see above). + + For example, ``[ [(2, 2.0), (3, 5.0)], [(3, 1.0)] ]`` + is a simple corpus of two documents = two sparse vectors: the first with two non-zero elements, + the second with one non-zero element. This particular corpus is represented as a plain Python ``list``. + + However, the full power of Gensim comes from the fact that a corpus doesn't have to be a ``list``, + or a ``NumPy`` array, or a ``Pandas`` dataframe, or whatever. Gensim *accepts any object that, + when iterated over, successively yields these sparse bag-of-word vectors*. + + This flexibility allows you to create your own corpus classes that stream the sparse vectors directly from disk, network, database, dataframes…. The models in Gensim are implemented such that they don't require all vectors to reside in RAM at once. You can even create the sparse vectors on the fly! + + See our `tutorial on streamed data processing in Python `_. + + For a built-in example of an efficient corpus format streamed directly from disk, see + the Matrix Market format in :mod:`~gensim.corpora.mmcorpus`. For a minimal blueprint example on + how to create your own streamed corpora, check out the `source code of CSV corpus `_. + + Model, Transformation + Gensim uses **model** to refer to the code and associated data (model parameters) + required to transform one document representation to another. + + In Gensim, documents are represented as vectors (see above) so a model can be thought of as a transformation + from one vector space to another. The parameters of this transformation are learned from the training corpus. + + Trained models (the data parameters) can be persisted to disk and later loaded back, either to continue + training on new training documents or to transform new documents. + + Gensim implements multiple models, such as :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.lsimodel.LsiModel`, :class:`~gensim.models.ldamodel.LdaModel`, - :class:`~gensim.models.fasttext.FastText` etc. + :class:`~gensim.models.fasttext.FastText` etc. See the :doc:`API reference ` for a full list. .. seealso:: - For some examples on how this works out in code, go to :doc:`Tutorials `. + For some examples on how all this works out in code, go to :doc:`Tutorials `. From 5162fe1a6d7b5b05660d460375a5561498c117ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Tue, 26 Jun 2018 08:29:55 +0200 Subject: [PATCH 13/14] drop caps --- docs/src/intro.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/intro.rst b/docs/src/intro.rst index e0329fee73..5b601b20b2 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -117,7 +117,7 @@ The whole Gensim package revolves around the concepts of :term:`corpus`, :term:` therefore the original documents must be similar, too". Of course, whether such conclusions correspond to reality depends on how well we picked our questions. - Gensim Sparse Vector, Bag-of-words Vector + Gensim sparse vector, Bag-of-words vector To save space, in Gensim we omit all vector elements with value 0.0. For example, instead of the 3-dimensional dense vector ``(0.0, 2.0, 5.0)``, we write only ``[(2, 2.0), (3, 5.0)]`` (note the missing ``(1, 0.0)``). Each vector element is a pair (2-tuple) of ``(feature_id, feature_value)``. The values of all missing features in this sparse representation can be unambiguously resolved to zero, ``0.0``. From 51c79c97addf41c50cf8366a1f09f36f560a1675 Mon Sep 17 00:00:00 2001 From: ivan Date: Tue, 26 Jun 2018 12:47:33 +0500 Subject: [PATCH 14/14] fix documentation building --- docs/src/install.rst | 2 +- docs/src/intro.rst | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/src/install.rst b/docs/src/install.rst index 2a7f3d0790..61039cb1d8 100644 --- a/docs/src/install.rst +++ b/docs/src/install.rst @@ -45,7 +45,7 @@ Gensim uses continuous integration, automatically running a full test suite on e +------------+-----------------------------------------------------------------------------------------+--------------+ | AppVeyor | Run tests on Windows | |AppVeyor|_ | +------------+-----------------------------------------------------------------------------------------+--------------+ -| CircleCI | Build documentation | |CircleCI|_ | +| CircleCI | Build documentation | |CircleCI|_ | +------------+-----------------------------------------------------------------------------------------+--------------+ .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/gensim.svg?branch=develop diff --git a/docs/src/intro.rst b/docs/src/intro.rst index 5b601b20b2..bcb60efa27 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -68,8 +68,6 @@ or from the `Python Package Index `_. Core concepts ------------- -The whole Gensim package revolves around the concepts of :term:`corpus`, :term:`vector` and :term:`model`. - .. glossary:: Corpus