-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
python/mxnet/text/__init__.py
Outdated
"""Text utilities.""" | ||
|
||
from . import text | ||
from .text import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just follow images/images. Which one is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are utility functions, so a name space like mx.text.utils seems to be a good fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree. Maybe images.images needs to change to images.utils
python/mxnet/text/text.py
Outdated
|
||
|
||
def count_tokens_from_str(tokens, token_delim=" ", seq_delim="\n", | ||
to_lower=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding the counter as an optional argument, with the default value being an empty counter. this way, the same function can be used to either create new counter or update existing counter, which effectively removes the assumption of having to store a whole corpus in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
These looks too adhoc to be put in to the repo. I'm considering making a text preprocessing package similar to part of NLTK and pytorch's text package. |
Aston is already doing it. |
tests/python/unittest/test_text.py
Outdated
seqs = _get_test_str_of_tokens(token_delim, seq_delim) | ||
|
||
with open(os.path.join(path, '1.txt'), 'w') as fout: | ||
fout.write(seqs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to mock all file operations in unit tests. See https://docs.python.org/3/library/unittest.mock.html#mock-open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to be consistent with our existing unit tests:
https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_image.py#L108
tests/python/unittest/test_text.py
Outdated
|
||
from mxnet.test_utils import * | ||
from mxnet.text import utils as tu | ||
from mxnet.text import glossary as glos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there some missing tests for the Glossary class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, working in progress.
Hello @astonzhang, please rebase your PR. |
python/mxnet/text/__init__.py
Outdated
from . import utils | ||
from .utils import * | ||
from . import glossary | ||
from .glossary import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove from .glossary import *
unless the classes are intended to exist in mx.text
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I probably will import it because mxnet.text.Embedding, mxnet.text.Glossary looks fine and are consistently used in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, then I think the utils and glossary namespaces are unnecessary. There’s no point in keeping both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
@astonzhang Again, please rebase your PR as it creates invalid CI requests |
@marcoabreu resolved. |
python/mxnet/text/glossary.py
Outdated
top_k_freq : None or int, default None | ||
The number of top frequent tokens in the keys of `counter` that will be | ||
indexed. If None, all the tokens in the keys of `counter` will be | ||
indexed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the behavior when counter size is smaller than k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
python/mxnet/text/glossary.py
Outdated
---------- | ||
counter : collections.Counter | ||
Counts text and special token frequencies in the text data, where | ||
special token frequency is clear to zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this came from user, it’s probably not necessary to expose the counter as a property. Otherwise we need to define how this property should change based on topk or other constructor arguments, and define how mutating this property means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
python/mxnet/text/glossary.py
Outdated
assert self.idx_to_vec is not None, \ | ||
'mxnet.text.Glossary._idx_to_vec has not been initialized. Use ' \ | ||
'mxnet.text.Glossary.__init__() or ' \ | ||
'mxnet.text.Glossary.set_idx_to_embed() to initialize it.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion message includes internal implementation details. Consider removing such reference and change to something like “Glossary has not been initialized. Do X...”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
python/mxnet/text/glossary.py
Outdated
'token, please specify it explicitly as the ' | ||
'unknown special token %s in tokens. This is ' | ||
'to avoid unintended updates.' % | ||
(token, self.idx_to_token[Glossary.unk_idx()])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s common to use one set (i.e. training set) for generating the vocabulary and reuse the same vocabulary on another set for indexing. Returning the index for unknown and warning the user would likely make this interface easier to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Since update_idx_to_vec is a member function of a specific Glossary instance, it is fair to assume that the input tokens match the token indices of this specific Glossary instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
python/mxnet/text/glossary.py
Outdated
self._idx_to_vec[nd.array(indices)] = new_vectors | ||
|
||
|
||
class Embedding(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that embedding is already the name of layer in gluon, using the same name could result in unintended name collision, especially when doing wild-card imports. Consider using another name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
python/mxnet/text/glossary.py
Outdated
# data format is changed. | ||
assert check_sha1(download_file_path, expected_download_hash), \ | ||
'The downloaded file %s does not match its expected SHA-1 ' \ | ||
'hash. This is caused by the changes at the externally ' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“This is caused” -> “This is likely caused” since failed/partial/interrupted download can also cause this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The immediately previous if-block makes sure that it can only caused by external changes.
python/mxnet/text/glossary.py
Outdated
vector for every special token, such as an unknown token and a padding | ||
token. | ||
""" | ||
with open(pretrain_file_path, 'r', encoding='utf8') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure code is tested on python2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will test on Py2 when adding test cases.
python/mxnet/text/glossary.py
Outdated
vec_len = None | ||
all_elems = [] | ||
idx_to_token = [] | ||
for line in tqdm(lines, total=len(lines)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s avoid adding dependencies such as tqdm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Since https://github.com/apache/incubator-mxnet/blob/master/example/gluon/tree_lstm/dataset.py uses tqdm, I assume that tqdm dependency is added. Let me know if you still prefer removing such dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. (tqdm is removed)
python/mxnet/text/glossary.py
Outdated
for i in elems[1:]] | ||
|
||
if len(elems) == 1: | ||
logging.warning('WARNING: Token %s with 1-dimensional vector ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warnings.warn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
can we move this into gluon.data and make use of Dataset? |
Thanks. Embedding vectors is a little different from text data (training corpus data). This package is expecting all text-related utilities such as indexer and embedding. |
python/mxnet/text/embedding.py
Outdated
if reserved_tokens is not None: | ||
for reserved_token in reserved_tokens: | ||
assert reserved_token != unknown_token, \ | ||
'`reserved_token` cannot contain `unknown_token`.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert unknown_token not in reserved_token
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
python/mxnet/text/embedding.py
Outdated
self._reserved_tokens = None | ||
else: | ||
# Python 2 does not support list.copy(). | ||
self._reserved_tokens = reserved_tokens[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment. comment should be about what's in the code instead of what's absent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
python/mxnet/text/embedding.py
Outdated
`counter` that can be indexed. Note that this argument does not count | ||
any token from `reserved_tokens`. If this argument is None or larger | ||
than its largest possible value restricted by `counter` and | ||
`reserved_tokens`, this argument becomes positive infinity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this argument becomes positive infinity -> it has no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the intention of having this argument is to put a maximum size limit on the index. This would incur some complexity, especially when there is a tie in the counter. For example, suppose you want to limit it to 3 in the case where reserved_tokens = []; counter = {'a': 5, 'b': 5, 'c': 3, 'd': 3}
, you would need to further clarify whether 'c' or 'd' is kept. The secondary alphabetic ordering should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
python/mxnet/text/embedding.py
Outdated
# 1 is the unknown token count. | ||
token_cap = 1 + len(reserved_tokens) + len(counter) | ||
else: | ||
token_cap = 1 + len(reserved_tokens) + most_freq_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token_cap = 1 + len(reserved_tokens) + (most_freq_count if most_freq_count else len(counter))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
python/mxnet/text/embedding.py
Outdated
token_cap = 1 + len(reserved_tokens) + most_freq_count | ||
|
||
for token, freq in token_freqs: | ||
if freq < min_freq or len(self._idx_to_token) == token_cap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use for i in range(token_cap)
in the loop to avoid evaluating the second condition token_cap times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Due to if token not in reserved_tokens:
condition where self._idx_to_token
length may not always self-increment by 1, we probably cannot use for i in range(token_cap)
here.
python/mxnet/text/embedding.py
Outdated
'The length of new_vectors must be equal to the number of tokens.' | ||
assert new_vectors.shape[1] == self.vec_len, \ | ||
'The width of new_vectors must be equal to the dimension of ' \ | ||
'embeddings of the glossary.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert new_vectors.shape == (len(tokens), self.vec_len)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
else: | ||
raise ValueError('Token %s is unknown. To update the embedding ' | ||
'vector for an unknown token, please specify ' | ||
'it explicitly as the `unknown_token` %s in ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a user add a new token in embedding? Should there be a separate method for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Embedding class always loads from pre-trained files. User can add/set new tokens via glossary rather than via embedding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from mxnet import ndarray as nd
y = nd.array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.]])
x = nd.array([1, 3])
%timeit y[x]
%timeit nd.Embedding(x, y, 4, y.shape[1])
196 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
83.5 µs ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y[x]
is advanced indexing which is usually slower than other indexing operations due to two reasons.
- Overhead of sanity checking and preprocessing advanced indices before calling backend ops.
- In this case, the backend op used is
gather_nd
, which is expected for retrieving scattered elements from an ndarray. For a regular shape index like[1, 3]
indexing the first dimension, operators such astake
orslice
are much more efficient thangather_nd
.
python/mxnet/text/embedding.py
Outdated
self._idx_to_vec[nd.array(indices)] = new_vectors | ||
|
||
@staticmethod | ||
def register(embed_cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is following what was done in optimizer module. I think both embedding and optimizer should reuse mx.registry
here instead of creating a new one. See example in mx.initializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
python/mxnet/text/embedding.py
Outdated
return embed_cls | ||
|
||
@staticmethod | ||
def create(embed_name, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is following what was done in optimizer module. I think both embedding and optimizer should reuse mx.registry
here instead of creating a new one. See example in mx.initializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
python/mxnet/text/embedding.py
Outdated
', '.join(embed_cls.pretrain_file_sha1.keys()))) | ||
|
||
@staticmethod | ||
def get_embed_names_and_pretrain_files(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list_pretrained_embeddings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It returns a string rather than prints the string. Thus I guess "get_" is better than "list_"
@astonzhang great contribution. This should help greatly reduce the commonly repeated effort of text indexing and embedding. Thanks for going through these many iterations and keeping pushing for higher quality. |
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
@szha Thank you very much for repeatedly going through this PR with me! |
This reverts commit 6c1f4f7.
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
This reverts commit 6c1f4f7.
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
This reverts commit 6c1f4f7.
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
This reverts commit 6c1f4f7.
* Add text utils * Leftovers * revise * before load embeddings * glossary done * Add/revise text utils, revise test cases * Add docstrings * clean package init * remove play * Resolve issues and complete docstrings * disable pylint * Remove tqdm dependency * Add encoding utf8 utf utf utf * remove non-ascii * fix textcase * remove decode in glossary * py2 unicode * Fix py2 error * add tests * Test all embds * test some embeds * Add getter for glossary * remove util from path, revise interfaces of glossary * skip some test, before major revise * Add TextIndexer, only TextEmbed needs revised before major revise * before major revise * minor update * Revise TextIndexer with test * lint * lint * Revise TextEmbed, FastText, Glove, CustmonEmbed with test * Revision done except for docstr * Add unit tests for utils * almost no pylint disable, yeah * doc minor updates * re-run * re-run * except for register * except for register * Revise register/create, add get_registry * revise * More readability * py2 compatibility * Update doc * Revise based on feedbacks from NLP team * add init * Support indexing for any hashable and comparable token * Add test cases * remove type cmp * Fix doc error and add API descriptions * Fix api doc error * add members explicitly * re-order modules in text.md * url in one line * add property desc for all inherited classes for rst parsing * escape \n * update glossary example * escape \n * add use case * Make doc more user-friendly * proper imports, gluon.nn.Embedding use case * fix links * re-org link level * tokens_to_indices * to_indices, to_tokens
This reverts commit 6c1f4f7.
Description
Add mxnet.text APIs (new features). This is intended to be used in natural language processing applications.
Text processing utilities.
Text indexer class
mxnet.text.embeddings.TextEmbedding
, such as instances ofmxnet.text.glossary.Glossary
.Text pre-trained embedding class.
TextEmbedding.create(embedding_name, pretrained_file_name)
. To get all the availableembedding_name
andpretrained_file_name
, useTextEmbedding.get_embedding_and_pretrained_file_names()
.mxnet.text.embeddings.CustomEmbedding
.mxnet.text.embedding.TextEmbedding
.GloVe pre-trained text embedding
The fastText pre-trained text embedding
Custom pre-trained text embedding
Text glossary class.
mxnet.text.embedding.TextEmbedding
.Checklist
Essentials
make lint
)Changes
get_registry(base_class)
in mxnet.registry and API doc. Tested in the test cases of Text embeddings.Comments