Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add mxnet.text APIs #8763

Merged
merged 65 commits into from
Jan 11, 2018
Merged

Add mxnet.text APIs #8763

merged 65 commits into from
Jan 11, 2018

Conversation

astonzhang
Copy link
Member

@astonzhang astonzhang commented Nov 22, 2017

Description

Add mxnet.text APIs (new features). This is intended to be used in natural language processing applications.

  • Text processing utilities.

    • count_tokens_from_str
  • Text indexer class

    • Build indices for the unknown token, reserved tokens, and input counter keys. Indexed tokens can be used by instances of mxnet.text.embeddings.TextEmbedding, such as instances of mxnet.text.glossary.Glossary.
  • Text pre-trained embedding class.

    • This is the text embedding base class. To load text embeddings from an externally hosted pre-trained text embedding file, such as those of GloVe and FastText, use TextEmbedding.create(embedding_name, pretrained_file_name). To get all the available embedding_name and pretrained_file_name, use TextEmbedding.get_embedding_and_pretrained_file_names().
    • Alternatively, to load embedding vectors from a custom pre-trained text embedding file, use mxnet.text.embeddings.CustomEmbedding.
    • For the same token, its index and embedding vector may vary across different instances of mxnet.text.embedding.TextEmbedding.
  • GloVe pre-trained text embedding

    • GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)
  • The fastText pre-trained text embedding

    • FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)
  • Custom pre-trained text embedding

    • This is to load embedding vectors from a user-defined pre-trained text embedding file.
  • Text glossary class.

    • This provides indexing and embedding for text and special tokens in a glossary. For each indexed token in a glossary, an embedding vector will be associated with the it. Such embedding vectors can be loaded from externally hosted or custom pre-trained text embedding files, such as via instances of mxnet.text.embedding.TextEmbedding.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • For user-facing API changes, API doc string has been updated. For new C++ functions in header files, their functionalities and arguments are well-documented.
  • To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Text utils, tests, and API doc
  • Text indexer, tests, and API doc
  • Text embedding, tests, and API doc
  • FastText embedding, tests, and API doc
  • Glove embedding, tests, and API doc
  • Custom embedding, tests, and API doc
  • Text glossary, tests, and API doc
  • get_registry(base_class) in mxnet.registry and API doc. Tested in the test cases of Text embeddings.

Comments

  • New feature: this is the first version of mxnet.text APIs.
  • Extend use cases in mxnet.registry

"""Text utilities."""

from . import text
from .text import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just follow images/images. Which one is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are utility functions, so a name space like mx.text.utils seems to be a good fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree. Maybe images.images needs to change to images.utils



def count_tokens_from_str(tokens, token_delim=" ", seq_delim="\n",
to_lower=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding the counter as an optional argument, with the default value being an empty counter. this way, the same function can be used to either create new counter or update existing counter, which effectively removes the assumption of having to store a whole corpus in memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

@piiswrong
Copy link
Contributor

These looks too adhoc to be put in to the repo.

I'm considering making a text preprocessing package similar to part of NLTK and pytorch's text package.

@szha
Copy link
Member

szha commented Nov 22, 2017

Aston is already doing it.

seqs = _get_test_str_of_tokens(token_delim, seq_delim)

with open(os.path.join(path, '1.txt'), 'w') as fout:
fout.write(seqs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to mock all file operations in unit tests. See https://docs.python.org/3/library/unittest.mock.html#mock-open.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


from mxnet.test_utils import *
from mxnet.text import utils as tu
from mxnet.text import glossary as glos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some missing tests for the Glossary class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, working in progress.

@szha szha self-assigned this Dec 7, 2017
@marcoabreu
Copy link
Contributor

Hello @astonzhang, please rebase your PR.

from . import utils
from .utils import *
from . import glossary
from .glossary import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from .glossary import * unless the classes are intended to exist in mx.text namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I probably will import it because mxnet.text.Embedding, mxnet.text.Glossary looks fine and are consistently used in the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then I think the utils and glossary namespaces are unnecessary. There’s no point in keeping both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

@marcoabreu
Copy link
Contributor

@astonzhang Again, please rebase your PR as it creates invalid CI requests

@astonzhang
Copy link
Member Author

@marcoabreu resolved.

top_k_freq : None or int, default None
The number of top frequent tokens in the keys of `counter` that will be
indexed. If None, all the tokens in the keys of `counter` will be
indexed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the behavior when counter size is smaller than k?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

----------
counter : collections.Counter
Counts text and special token frequencies in the text data, where
special token frequency is clear to zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this came from user, it’s probably not necessary to expose the counter as a property. Otherwise we need to define how this property should change based on topk or other constructor arguments, and define how mutating this property means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

assert self.idx_to_vec is not None, \
'mxnet.text.Glossary._idx_to_vec has not been initialized. Use ' \
'mxnet.text.Glossary.__init__() or ' \
'mxnet.text.Glossary.set_idx_to_embed() to initialize it.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion message includes internal implementation details. Consider removing such reference and change to something like “Glossary has not been initialized. Do X...”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

'token, please specify it explicitly as the '
'unknown special token %s in tokens. This is '
'to avoid unintended updates.' %
(token, self.idx_to_token[Glossary.unk_idx()]))
Copy link
Member

@szha szha Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s common to use one set (i.e. training set) for generating the vocabulary and reuse the same vocabulary on another set for indexing. Returning the index for unknown and warning the user would likely make this interface easier to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Since update_idx_to_vec is a member function of a specific Glossary instance, it is fair to assume that the input tokens match the token indices of this specific Glossary instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

self._idx_to_vec[nd.array(indices)] = new_vectors


class Embedding(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that embedding is already the name of layer in gluon, using the same name could result in unintended name collision, especially when doing wild-card imports. Consider using another name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

# data format is changed.
assert check_sha1(download_file_path, expected_download_hash), \
'The downloaded file %s does not match its expected SHA-1 ' \
'hash. This is caused by the changes at the externally ' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“This is caused” -> “This is likely caused” since failed/partial/interrupted download can also cause this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The immediately previous if-block makes sure that it can only caused by external changes.

vector for every special token, such as an unknown token and a padding
token.
"""
with open(pretrain_file_path, 'r', encoding='utf8') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure code is tested on python2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will test on Py2 when adding test cases.

vec_len = None
all_elems = []
idx_to_token = []
for line in tqdm(lines, total=len(lines)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s avoid adding dependencies such as tqdm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Since https://github.com/apache/incubator-mxnet/blob/master/example/gluon/tree_lstm/dataset.py uses tqdm, I assume that tqdm dependency is added. Let me know if you still prefer removing such dependency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. (tqdm is removed)

for i in elems[1:]]

if len(elems) == 1:
logging.warning('WARNING: Token %s with 1-dimensional vector '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warnings.warn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

@astonzhang astonzhang changed the title [WIP] Add text apis Add mxnet.text APIs Dec 11, 2017
@piiswrong
Copy link
Contributor

can we move this into gluon.data and make use of Dataset?

@astonzhang
Copy link
Member Author

Thanks. Embedding vectors is a little different from text data (training corpus data). This package is expecting all text-related utilities such as indexer and embedding.

if reserved_tokens is not None:
for reserved_token in reserved_tokens:
assert reserved_token != unknown_token, \
'`reserved_token` cannot contain `unknown_token`.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert unknown_token not in reserved_token.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

self._reserved_tokens = None
else:
# Python 2 does not support list.copy().
self._reserved_tokens = reserved_tokens[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment. comment should be about what's in the code instead of what's absent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

`counter` that can be indexed. Note that this argument does not count
any token from `reserved_tokens`. If this argument is None or larger
than its largest possible value restricted by `counter` and
`reserved_tokens`, this argument becomes positive infinity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this argument becomes positive infinity -> it has no effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the intention of having this argument is to put a maximum size limit on the index. This would incur some complexity, especially when there is a tie in the counter. For example, suppose you want to limit it to 3 in the case where reserved_tokens = []; counter = {'a': 5, 'b': 5, 'c': 3, 'd': 3}, you would need to further clarify whether 'c' or 'd' is kept. The secondary alphabetic ordering should be documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

# 1 is the unknown token count.
token_cap = 1 + len(reserved_tokens) + len(counter)
else:
token_cap = 1 + len(reserved_tokens) + most_freq_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_cap = 1 + len(reserved_tokens) + (most_freq_count if most_freq_count else len(counter))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

token_cap = 1 + len(reserved_tokens) + most_freq_count

for token, freq in token_freqs:
if freq < min_freq or len(self._idx_to_token) == token_cap:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use for i in range(token_cap) in the loop to avoid evaluating the second condition token_cap times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Due to if token not in reserved_tokens: condition where self._idx_to_token length may not always self-increment by 1, we probably cannot use for i in range(token_cap) here.

'The length of new_vectors must be equal to the number of tokens.'
assert new_vectors.shape[1] == self.vec_len, \
'The width of new_vectors must be equal to the dimension of ' \
'embeddings of the glossary.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert new_vectors.shape == (len(tokens), self.vec_len)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

else:
raise ValueError('Token %s is unknown. To update the embedding '
'vector for an unknown token, please specify '
'it explicitly as the `unknown_token` %s in '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a user add a new token in embedding? Should there be a separate method for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Embedding class always loads from pre-trained files. User can add/set new tokens via glossary rather than via embedding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from mxnet import ndarray as nd
y = nd.array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.]])

x = nd.array([1, 3])

%timeit y[x]
%timeit nd.Embedding(x, y, 4, y.shape[1])

196 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
83.5 µs ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y[x] is advanced indexing which is usually slower than other indexing operations due to two reasons.

  1. Overhead of sanity checking and preprocessing advanced indices before calling backend ops.
  2. In this case, the backend op used is gather_nd, which is expected for retrieving scattered elements from an ndarray. For a regular shape index like [1, 3] indexing the first dimension, operators such as take or slice are much more efficient than gather_nd.

self._idx_to_vec[nd.array(indices)] = new_vectors

@staticmethod
def register(embed_cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is following what was done in optimizer module. I think both embedding and optimizer should reuse mx.registry here instead of creating a new one. See example in mx.initializer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

return embed_cls

@staticmethod
def create(embed_name, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is following what was done in optimizer module. I think both embedding and optimizer should reuse mx.registry here instead of creating a new one. See example in mx.initializer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

', '.join(embed_cls.pretrain_file_sha1.keys())))

@staticmethod
def get_embed_names_and_pretrain_files():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_pretrained_embeddings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It returns a string rather than prints the string. Thus I guess "get_" is better than "list_"

@szha szha merged commit 6c1f4f7 into apache:master Jan 11, 2018
@szha
Copy link
Member

szha commented Jan 11, 2018

@astonzhang great contribution. This should help greatly reduce the commonly repeated effort of text indexing and embedding. Thanks for going through these many iterations and keeping pushing for higher quality.

larroy pushed a commit to larroy/mxnet that referenced this pull request Jan 11, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
@astonzhang
Copy link
Member Author

@szha Thank you very much for repeatedly going through this PR with me!

@astonzhang astonzhang deleted the text branch January 12, 2018 15:39
piiswrong added a commit to piiswrong/mxnet that referenced this pull request Jan 12, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jan 12, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
szha pushed a commit that referenced this pull request Jan 12, 2018
@szha szha mentioned this pull request Jan 12, 2018
7 tasks
szha added a commit to szha/mxnet that referenced this pull request Jan 12, 2018
szha added a commit to szha/mxnet that referenced this pull request Jan 12, 2018
szha added a commit to szha/mxnet that referenced this pull request Jan 12, 2018
szha added a commit to szha/mxnet that referenced this pull request Jan 12, 2018
szha added a commit to szha/mxnet that referenced this pull request Jan 12, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jan 12, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jan 12, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
piiswrong pushed a commit that referenced this pull request Jan 13, 2018
CodingCat pushed a commit to CodingCat/mxnet that referenced this pull request Jan 16, 2018
CodingCat pushed a commit to CodingCat/mxnet that referenced this pull request Jan 16, 2018
yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018
yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Add text utils

* Leftovers

* revise

* before load embeddings

* glossary done

* Add/revise text utils, revise test cases

* Add docstrings

* clean package init

* remove play

* Resolve issues and complete docstrings

* disable pylint

* Remove tqdm dependency

* Add encoding

utf8

utf

utf

utf

* remove non-ascii

* fix textcase

* remove decode in glossary

* py2 unicode

* Fix py2 error

* add tests

* Test all embds

* test some embeds

* Add getter for glossary

* remove util from path, revise interfaces of glossary

* skip some test, before major revise

* Add TextIndexer, only TextEmbed needs revised before major revise

* before major revise

* minor update

* Revise TextIndexer with test

* lint

* lint

* Revise TextEmbed, FastText, Glove, CustmonEmbed with test

* Revision done except for docstr

* Add unit tests for utils

* almost no pylint disable, yeah

* doc minor updates

* re-run

* re-run

* except for register

* except for register

* Revise register/create, add get_registry

* revise

* More readability

* py2 compatibility

* Update doc

* Revise based on feedbacks from NLP team

* add init

* Support indexing for any hashable and comparable token

* Add test cases

* remove type cmp

* Fix doc error and add API descriptions

* Fix api doc error

* add members explicitly

* re-order modules in text.md

* url in one line

* add property desc for all inherited classes for rst parsing

* escape \n

* update glossary example

* escape \n

* add use case

* Make doc more user-friendly

* proper imports, gluon.nn.Embedding use case

* fix links

* re-org link level

* tokens_to_indices

* to_indices, to_tokens
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants