diff --git a/docs/src/_index.rst.unused b/docs/src/_index.rst.unused new file mode 100644 index 0000000000..71390c1060 --- /dev/null +++ b/docs/src/_index.rst.unused @@ -0,0 +1,100 @@ + +:github_url: https://github.com/RaRe-Technologies/gensim + +Gensim documentation +=================================== + +============ +Introduction +============ + +Gensim is a free Python library designed to automatically extract semantic +topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. + +Gensim is designed to process raw, unstructured digital texts ("plain text"). + +The algorithms in Gensim, such as **Word2Vec**, **FastText**, **Latent Semantic Analysis**, **Latent Dirichlet Allocation** and **Random Projections**, discover semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents. + +Once these statistical patterns are found, any plain text documents can be succinctly +expressed in the new, semantic representation and queried for topical similarity +against other documents, words or phrases. + +.. note:: + If the previous paragraphs left you confused, you can read more about the `Vector + Space Model `_ and `unsupervised + document analysis `_ on Wikipedia. + + +.. _design: + +Features +-------- + +* **Memory independence** -- there is no need for the whole training corpus to + reside fully in RAM at any one time (can process large, web-scale corpora). +* **Memory sharing** -- trained models can be persisted to disk and loaded back via mmap. Multiple processes can share the same data, cutting down RAM footprint. +* Efficient implementations for several popular vector space algorithms, + including Word2Vec, Doc2Vec, FastText, TF-IDF, Latent Semantic Analysis (LSI, LSA), + Latent Dirichlet Allocation (LDA) or Random Projection. +* I/O wrappers and readers from several popular data formats. +* Fast similarity queries for documents in their semantic representation. + +The **principal design objectives** behind Gensim are: + +1. Straightforward interfaces and low API learning curve for developers. Good for prototyping. +2. Memory independence with respect to the size of the input corpus; all intermediate + steps and algorithms operate in a streaming fashion, accessing one document + at a time. + +.. seealso:: + + We built a high performance server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. + ScaleText is a commercial product, available both on-prem or as SaaS. + Reach out at info@scaletext.com if you need an industry-grade tool with professional support. + +.. _availability: + +Availability +------------ + +Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_ and can be downloaded either from its `github repository `_ or from the `Python Package Index `_. + +.. seealso:: + + See the :doc:`install ` page for more info on Gensim deployment. + + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Getting started + + install + intro + support + about + license + citing + + +.. toctree:: + :maxdepth: 1 + :caption: Tutorials + + tutorial + tut1 + tut2 + tut3 + + +.. toctree:: + :maxdepth: 1 + :caption: API Reference + + apiref + +Indices and tables +================== + +* :ref:`genindex` +* :ref:`modindex` diff --git a/docs/src/_license.rst.unused b/docs/src/_license.rst.unused new file mode 100644 index 0000000000..d85983aa44 --- /dev/null +++ b/docs/src/_license.rst.unused @@ -0,0 +1,26 @@ +:orphan: + +.. _license: + +Licensing +--------- + +Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_. + +This means that it's free for both personal and commercial use, but if you make any +modification to Gensim that you distribute to other people, you have to disclose +the source code of these modifications. + +Apart from that, you are free to redistribute Gensim in any way you like, though you're +not allowed to modify its license (doh!). + +My intent here is to **get more help and community involvement** with the development of Gensim. +The legalese is therefore less important to me than your input and contributions. + +`Contact me `_ if LGPL doesn't fit your bill but you'd like the LGPL restrictions liften. + +.. seealso:: + + We built a high performance server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. + ScaleText is a commercial product, available both on-prem or as SaaS. + Reach out at info@scaletext.com if you need an industry-grade tool with professional support. diff --git a/gensim/models/doc2vec.py b/gensim/models/doc2vec.py index bf1eac8264..ef29142230 100644 --- a/gensim/models/doc2vec.py +++ b/gensim/models/doc2vec.py @@ -20,13 +20,13 @@ `_. **Make sure you have a C compiler before installing Gensim, to use the optimized doc2vec routines** (70x speedup -compared to plain NumPy implementation `_). +compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/). -Examples --------- +Usage examples +============== -Initialize & train a model +Initialize & train a model: >>> from gensim.test.utils import common_texts >>> from gensim.models.doc2vec import Doc2Vec, TaggedDocument @@ -34,7 +34,7 @@ >>> documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)] >>> model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4) -Persist a model to disk +Persist a model to disk: >>> from gensim.test.utils import get_tmpfile >>> @@ -43,11 +43,11 @@ >>> model.save(fname) >>> model = Doc2Vec.load(fname) # you can continue training with the loaded model! -If you're finished training a model (=no more updates, only querying, reduce memory usage), you can do +If you're finished training a model (=no more updates, only querying, reduce memory usage), you can do: >>> model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True) -Infer vector for new document +Infer vector for a new document: >>> vector = model.infer_vector(["system", "response"]) diff --git a/gensim/models/fasttext.py b/gensim/models/fasttext.py index 0b69ddcc8b..503c35ef7c 100644 --- a/gensim/models/fasttext.py +++ b/gensim/models/fasttext.py @@ -13,6 +13,7 @@ This module contains a fast native C implementation of Fasttext with Python interfaces. It is **not** only a wrapper around Facebook's implementation. + For a tutorial see `this noteboook `_. @@ -22,14 +23,14 @@ Usage examples -------------- -Initialize and train a model +Initialize and train a model: >>> from gensim.test.utils import common_texts >>> from gensim.models import FastText >>> >>> model = FastText(common_texts, size=4, window=3, min_count=1, iter=10) -Persist a model to disk with +Persist a model to disk with: >>> from gensim.test.utils import get_tmpfile >>> @@ -38,7 +39,7 @@ >>> model.save(fname) >>> model = FastText.load(fname) # you can continue training with the loaded model! -Retrieve word-vector for vocab and out-of-vocab word +Retrieve word-vector for vocab and out-of-vocab word: >>> existent_word = "computer" >>> existent_word in model.wv.vocab @@ -50,7 +51,7 @@ False >>> oov_vec = model.wv[oov_word] # numpy vector for OOV word -You can perform various NLP word tasks with the model, some of them are already built-in +You can perform various NLP word tasks with the model, some of them are already built-in: >>> similarities = model.wv.most_similar(positive=['computer', 'human'], negative=['interface']) >>> most_similar = similarities[0] @@ -62,13 +63,13 @@ >>> >>> sim_score = model.wv.similarity('computer', 'human') -Correlation with human opinion on word similarity +Correlation with human opinion on word similarity: >>> from gensim.test.utils import datapath >>> >>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv')) -And on word analogies +And on word analogies: >>> analogies_result = model.wv.accuracy(datapath('questions-words.txt')) diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py index d163784c1c..d319de123a 100755 --- a/gensim/models/word2vec.py +++ b/gensim/models/word2vec.py @@ -27,12 +27,12 @@ visit https://rare-technologies.com/word2vec-tutorial/. **Make sure you have a C compiler before installing Gensim, to use the optimized word2vec routines** -(70x speedup compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/. +(70x speedup compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/). Usage examples ============== -Initialize a model with e.g. +Initialize a model with e.g.: >>> from gensim.test.utils import common_texts, get_tmpfile >>> from gensim.models import Word2Vec @@ -45,13 +45,13 @@ The training is streamed, meaning `sentences` can be a generator, reading input data from disk on-the-fly, without loading the entire corpus into RAM. -It also means you can continue training the model later +It also means you can continue training the model later: >>> model = Word2Vec.load("word2vec.model") >>> model.train([["hello", "world"]], total_examples=1, epochs=1) (0, 2) -The trained word vectors are stored in a :class:`~gensim.models.KeyedVectors` instance in `model.wv`: +The trained word vectors are stored in a :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `model.wv`: >>> vector = model.wv['computer'] # numpy vector of a word @@ -68,7 +68,8 @@ >>> wv = KeyedVectors.load("model.wv", mmap='r') >>> vector = wv['computer'] # numpy vector of a word -Gensim can also load word vectors in the "word2vec C format", as this :class:`~gensim.models.KeyedVectors` instance:: +Gensim can also load word vectors in the "word2vec C format", as a +:class:`~gensim.models.keyedvectors.KeyedVectors` instance:: >>> from gensim.test.utils import datapath >>> @@ -84,7 +85,7 @@ are already built-in - you can see it in :mod:`gensim.models.keyedvectors`. If you're finished training a model (i.e. no more updates, only querying), -you can switch to the :class:`~gensim.models.KeyedVectors` instance +you can switch to the :class:`~gensim.models.keyedvectors.KeyedVectors` instance: >>> word_vectors = model.wv >>> del model