Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix documentation for various modules #2096

Merged
merged 34 commits into from
Jun 22, 2018
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6504f9d
doc fixes
piskvorky Jun 20, 2018
329c9de
doc fixes to matutils
piskvorky Jun 20, 2018
0214fc3
docsim doc fixes
piskvorky Jun 20, 2018
a68dedf
doc fixes to interfaces module
piskvorky Jun 20, 2018
6dcbcaf
doc fixes to Dictionary
piskvorky Jun 20, 2018
c0c4f90
doc fixes to MatrixMarket classes
piskvorky Jun 20, 2018
0780f3c
doc fixes to WikiCorpus
piskvorky Jun 20, 2018
0e94769
Merge branch 'develop' into docstring_fixes
piskvorky Jun 20, 2018
bc68562
minor code style changes in HashDictionary
piskvorky Jun 20, 2018
2737a53
fixing TfidfModel bugs + docs
piskvorky Jun 20, 2018
25bc583
fixes to phrases docs
piskvorky Jun 20, 2018
1e886de
fix PEP8
menshikh-iv Jun 20, 2018
4a2db73
fix documentation building
menshikh-iv Jun 20, 2018
5fcdf2c
cleanup mmcorpus-related
menshikh-iv Jun 20, 2018
d8d055a
cleanup dictionary
menshikh-iv Jun 20, 2018
d0e8417
cleanup hashdictionary
menshikh-iv Jun 20, 2018
489b4cf
cleanup wikicorpus
menshikh-iv Jun 20, 2018
b8c3f4b
cleanup interfaces
menshikh-iv Jun 20, 2018
a2c5ff3
cleanup matutils
menshikh-iv Jun 20, 2018
034206e
rename smartirs signature
piskvorky Jun 20, 2018
cc0fa64
Merge branch 'docstring_fixes' of https://github.com/RaRe-Technologie…
menshikh-iv Jun 20, 2018
3646c90
minor docs style fixes
piskvorky Jun 20, 2018
31849c1
Merge branch 'docstring_fixes' of https://github.com/RaRe-Technologie…
menshikh-iv Jun 20, 2018
2040093
regenerate *.c for mmreader (after last Radim fix)
menshikh-iv Jun 20, 2018
aa27a5f
fix bool parameters
menshikh-iv Jun 21, 2018
191af45
regenerate _mmreader.c again
menshikh-iv Jun 21, 2018
1686544
cleanup phrases
menshikh-iv Jun 21, 2018
64580f3
cleanup utils
menshikh-iv Jun 21, 2018
86c0190
Fix paper for phrases according to #2098, catch by @davidchall
menshikh-iv Jun 21, 2018
b1353bf
cleanup docsim
menshikh-iv Jun 21, 2018
3bb51a2
- cleanup tfidfmodel
menshikh-iv Jun 21, 2018
27b0e66
typo fix
piskvorky Jun 21, 2018
e2c72fa
add back smartirs tests
piskvorky Jun 21, 2018
a569b76
retrying saved test files
piskvorky Jun 21, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,292 changes: 744 additions & 548 deletions gensim/corpora/_mmreader.c

Large diffs are not rendered by default.

30 changes: 15 additions & 15 deletions gensim/corpora/_mmreader.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ logger = logging.getLogger(__name__)


cdef class MmReader(object):
"""Matrix market file reader (fast Cython version), used for :class:`~gensim.corpora.mmcorpus.MmCorpus`.
"""Matrix market file reader (fast Cython version), used internally in :class:`~gensim.corpora.mmcorpus.MmCorpus`.

Wrap a term-document matrix on disk (in matrix-market format), and present it
as an object which supports iteration over the rows (~documents).
Expand All @@ -32,10 +32,10 @@ cdef class MmReader(object):
Number of non-zero terms.

Notes
----------
Note that the file is read into memory one document at a time, not the whole
matrix at once (unlike scipy.io.mmread). This allows us to process corpora
which are larger than the available RAM.
-----
Note that the file is read into memory one document at a time, not the whole matrix at once
(unlike e.g. `scipy.io.mmread` and other implementations).
This allows us to process corpora which are larger than the available RAM.

"""
cdef public input
Expand All @@ -48,11 +48,11 @@ cdef class MmReader(object):
Parameters
----------
input : {str, file-like object}
Path to input file in MM format or a file-like object that supports `seek()`
(e.g. :class:`~gzip.GzipFile`, :class:`~bz2.BZ2File`).
Path to the input file in MM format or a file-like object that supports `seek()`
(e.g. smart_open objects).

transposed : bool, optional
if True, expects lines to represent doc_id, term_id, value. Else, expects term_id, doc_id, value.
Do lines represent `doc_id, term_id, value`, instead of `term_id, doc_id, value`?

"""
logger.info("initializing cython corpus reader from %s", input)
Expand Down Expand Up @@ -83,7 +83,7 @@ cdef class MmReader(object):
)

def __len__(self):
"""Get size of corpus (number of documents)."""
"""Get the corpus size: total number of documents."""
return self.num_docs

def __str__(self):
Expand All @@ -105,18 +105,18 @@ cdef class MmReader(object):
break

def __iter__(self):
"""Iterate through corpus.
"""Iterate through all documents in the corpus.

Notes
------
Note that the total number of vectors returned is always equal to the number of rows specified
in the header, empty documents are inserted and yielded where appropriate, even if they are not explicitly
in the header: empty documents are inserted and yielded where appropriate, even if they are not explicitly
stored in the Matrix Market file.

Yields
------
(int, list of (int, number))
Document id and Document in BoW format
Document id and document in sparse bag-of-words format.

"""
cdef long long docid, termid, previd
Expand Down Expand Up @@ -165,17 +165,17 @@ cdef class MmReader(object):
yield previd, []

def docbyoffset(self, offset):
"""Get document at file offset `offset` (in bytes).
"""Get the document at file offset `offset` (in bytes).

Parameters
----------
offset : int
Offset, in bytes, of desired document.
File offset, in bytes, of the desired document.

Returns
------
list of (int, str)
Document in BoW format.
Document in sparse bag-of-words format.

"""
# empty documents are not stored explicitly in MM format, so the index marks
Expand Down
Loading