This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add vocabulary and embedding (#10074)
* [MXNET-67] Sync master with v1.1.0 branch (#10031) * [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (#10040) * Bug Fix and performance optimized for rtc (#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test
- Loading branch information
1 parent
911ff3b
commit 9e4f9d1
Showing
7 changed files
with
2,739 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,347 @@ | ||
# Gluon Text API | ||
|
||
## Overview | ||
|
||
The `mxnet.gluon.text` APIs refer to classes and functions related to text data processing, such | ||
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the | ||
`mxnet.ndarray.NDArray` format. | ||
|
||
This document lists the text APIs in `mxnet.gluon`: | ||
|
||
```eval_rst | ||
.. autosummary:: | ||
:nosignatures: | ||
mxnet.gluon.text.embedding | ||
mxnet.gluon.text.vocab | ||
mxnet.gluon.text.utils | ||
``` | ||
|
||
All the code demonstrated in this document assumes that the following modules or packages are | ||
imported. | ||
|
||
```python | ||
>>> from mxnet import gluon | ||
>>> from mxnet import nd | ||
>>> from mxnet.gluon import text | ||
>>> import collections | ||
|
||
``` | ||
|
||
### Indexing words and using pre-trained word embeddings in `gluon` | ||
|
||
As a common use case, let us index words, attach pre-trained word embeddings for them, and use | ||
such embeddings in `gluon` in just a few lines of code. | ||
|
||
To begin with, suppose that we have a simple text data set in the string format. We can count word | ||
frequency in the data set. | ||
|
||
```python | ||
>>> text_data = " hello world \n hello nice world \n hi world \n" | ||
>>> counter = text.count_tokens_from_str(text_data) | ||
|
||
``` | ||
|
||
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. | ||
This allows us to filter out infrequent words (See details at | ||
[Vocabulary API specifications](#mxnet.gluon.text.vocab.Vocabulary)). | ||
Suppose that we want to build indices for all the keys in `counter`. We need a Vocabulary instance | ||
with `counter` as its argument. | ||
|
||
```python | ||
>>> my_vocab = text.Vocabulary(counter) | ||
|
||
``` | ||
|
||
To attach word embedding to indexed words in `my_vocab`, let us go on to create a fastText word | ||
embedding instance by specifying the embedding name `fasttext` and the pre-trained file name | ||
`wiki.simple.vec`. | ||
|
||
```python | ||
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec') | ||
|
||
``` | ||
|
||
So we can attach word embedding `fasttext` to indexed words `my_vocab`. | ||
|
||
```python | ||
>>> my_vocab.set_embedding(fasttext) | ||
|
||
``` | ||
|
||
Now we are ready to access the fastText word embedding vectors for indexed words, such as 'hello' | ||
and 'world'. | ||
|
||
```python | ||
>>> my_vocab.embedding[['hello', 'world']] | ||
|
||
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 | ||
... | ||
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] | ||
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 | ||
... | ||
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] | ||
<NDArray 2x300 @cpu(0)> | ||
|
||
``` | ||
|
||
To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain | ||
indices of the words 'hello' and 'world'. | ||
|
||
```python | ||
>>> my_vocab[['hello', 'world']] | ||
[2, 1] | ||
|
||
``` | ||
|
||
We can obtain the vector representation for the words 'hello' and 'world' by specifying their | ||
indices (2 and 1) and the weight matrix `my_vocab.embedding.idx_to_vec` in | ||
`mxnet.gluon.nn.Embedding`. | ||
|
||
```python | ||
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape | ||
>>> layer = gluon.nn.Embedding(input_dim, output_dim) | ||
>>> layer.initialize() | ||
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec) | ||
>>> layer(nd.array([2, 1])) | ||
|
||
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 | ||
... | ||
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] | ||
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 | ||
... | ||
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] | ||
<NDArray 2x300 @cpu(0)> | ||
|
||
``` | ||
|
||
## Vocabulary | ||
|
||
The vocabulary builds indices for text tokens and can be attached with token embeddings. The input | ||
counter whose keys are candidate indices may be obtained via | ||
[`count_tokens_from_str`](#mxnet.gluon.text.utils.count_tokens_from_str). | ||
|
||
|
||
```eval_rst | ||
.. currentmodule:: mxnet.gluon.text.vocab | ||
.. autosummary:: | ||
:nosignatures: | ||
Vocabulary | ||
``` | ||
|
||
Suppose that we have a simple text data set in the string format. We can count word frequency in the | ||
data set. | ||
|
||
```python | ||
>>> text_data = " hello world \n hello nice world \n hi world \n" | ||
>>> counter = text.utils.count_tokens_from_str(text_data) | ||
|
||
``` | ||
|
||
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. | ||
This allows us to filter out infrequent words. Suppose that we want to build indices for the 2 most | ||
frequent keys in `counter` with the unknown token representation '(unk)' and a reserved token | ||
'(pad)'. | ||
|
||
```python | ||
>>> my_vocab = text.Vocabulary(counter, max_size=2, unknown_token='(unk)', | ||
... reserved_tokens=['(pad)']) | ||
|
||
``` | ||
|
||
We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping | ||
indices to tokens), `unknown_token` (representation of any unknown token) and `reserved_tokens` | ||
(reserved tokens). | ||
|
||
|
||
```python | ||
>>> my_vocab.token_to_idx | ||
{'(unk)': 0, '(pad)': 1, 'world': 2, 'hello': 3} | ||
>>> my_vocab.idx_to_token | ||
['(unk)', '(pad)', 'world', 'hello'] | ||
>>> my_vocab.unknown_token | ||
'(unk)' | ||
>>> my_vocab.reserved_tokens | ||
['(pad)'] | ||
>>> len(my_vocab) | ||
4 | ||
>>> my_vocab[['hello', 'world']] | ||
[3, 2] | ||
``` | ||
|
||
Besides the specified unknown token '(unk)' and reserved_token '(pad)' are indexed, the 2 most | ||
frequent words 'world' and 'hello' are also indexed. | ||
|
||
|
||
### Attach token embedding to vocabulary | ||
|
||
A vocabulary instance can be attached with token embedding. | ||
|
||
To begin with, suppose that we have a simple text data set in the string format. We can count word | ||
frequency in the data set. | ||
|
||
```python | ||
>>> text_data = " hello world \n hello nice world \n hi world \n" | ||
>>> counter = text.count_tokens_from_str(text_data) | ||
|
||
``` | ||
|
||
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. | ||
This allows us to filter out infrequent words. | ||
Suppose that we want to build indices for the most frequent 2 keys in `counter`. | ||
|
||
```python | ||
>>> my_vocab = text.Vocabulary(counter, max_size=2) | ||
|
||
``` | ||
|
||
Let us define the fastText word embedding instance with the pre-trained file `wiki.simple.vec`. | ||
|
||
```python | ||
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec') | ||
|
||
``` | ||
|
||
So we can attach word embedding `fasttext` to indexed words `my_vocab`. | ||
|
||
```python | ||
>>> my_vocab.set_embedding(fasttext) | ||
|
||
``` | ||
|
||
Now we are ready to access the fastText word embedding vectors for the indexed words. | ||
|
||
```python | ||
>>> my_vocab.embedding[['hello', 'world']] | ||
|
||
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 | ||
... | ||
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] | ||
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 | ||
... | ||
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] | ||
<NDArray 2x300 @cpu(0)> | ||
|
||
``` | ||
|
||
Let us define the GloVe word embedding with the pre-trained file `glove.6B.50d.txt`. Then, | ||
we can re-attach a GloVe text embedding instance to the vocabulary. | ||
|
||
```python | ||
>>> glove = text.embedding.create('glove', file_name='glove.6B.50d.txt') | ||
>>> my_vocab.set_embedding(glove) | ||
|
||
``` | ||
|
||
Now we are ready to access the GloVe word embedding vectors for the indexed words. | ||
|
||
```python | ||
>>> my_vocab.embedding[['hello', 'world']] | ||
|
||
[[ -0.38497001 0.80092001 | ||
... | ||
0.048833 0.67203999] | ||
[ -0.41486001 0.71847999 | ||
... | ||
-0.37639001 -0.67541999]] | ||
<NDArray 2x50 @cpu(0)> | ||
|
||
``` | ||
|
||
If a token is unknown to `my_vocab`, its embedding vector is initialized according to the default | ||
specification in `glove` (all elements are 0). | ||
|
||
```python | ||
|
||
>>> my_vocab.embedding['nice'] | ||
|
||
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. | ||
... | ||
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] | ||
<NDArray 50 @cpu(0)> | ||
|
||
``` | ||
|
||
|
||
|
||
## Text token embedding | ||
|
||
To load token embeddings from an externally hosted pre-trained token embedding file, such as those | ||
of GloVe and FastText, use | ||
[`embedding.create(embedding_name, file_name)`](#mxnet.gluon.text.embedding.create). | ||
|
||
To get all the available `embedding_name` and `file_name`, use | ||
[`embedding.get_file_names()`](#mxnet.gluon.text.embedding.get_file_names). | ||
|
||
```python | ||
>>> text.embedding.get_file_names() | ||
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...], | ||
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]} | ||
|
||
``` | ||
|
||
Alternatively, to load embedding vectors from a custom pre-trained text token embedding file, use | ||
[`TokenEmbedding.from_file`](#mxnet.gluon.text.embedding.TokenEmbedding.from_file). | ||
|
||
|
||
```eval_rst | ||
.. currentmodule:: mxnet.gluon.text.embedding | ||
.. autosummary:: | ||
:nosignatures: | ||
register | ||
create | ||
get_file_names | ||
TokenEmbedding | ||
GloVe | ||
FastText | ||
``` | ||
|
||
See [Assign token embedding to vocabulary](#assign-token-embedding-to-vocabulary) for how to attach | ||
token embeddings to vocabulary and use token embeddings. | ||
|
||
|
||
### Implement a new text token embedding | ||
|
||
For ``embedding``, create a subclass of `mxnet.gluon.text.embedding.TokenEmbedding`. | ||
Also add ``@mxnet.gluon.text.embedding.TokenEmbedding.register`` before this class. See | ||
[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/text/embedding.py) | ||
for examples. | ||
|
||
|
||
## Text utilities | ||
|
||
The following functions provide utilities for text data processing. | ||
|
||
```eval_rst | ||
.. currentmodule:: mxnet.gluon.text.utils | ||
.. autosummary:: | ||
:nosignatures: | ||
count_tokens_from_str | ||
``` | ||
|
||
|
||
## API Reference | ||
|
||
<script type="text/javascript" src='../../_static/js/auto_module_index.js'></script> | ||
|
||
```eval_rst | ||
.. automodule:: mxnet.gluon.text.embedding | ||
:members: register, create, get_file_names | ||
.. autoclass:: mxnet.gluon.text.embedding.TokenEmbedding | ||
:members: from_file | ||
.. autoclass:: mxnet.gluon.text.embedding.GloVe | ||
.. autoclass:: mxnet.gluon.text.embedding.FastText | ||
.. automodule:: mxnet.gluon.text.vocab | ||
.. autoclass:: mxnet.gluon.text.vocab.Vocabulary | ||
:members: set_embedding, to_tokens | ||
.. automodule:: mxnet.gluon.text.utils | ||
:members: count_tokens_from_str | ||
``` | ||
<script>auto_index("api-reference");</script> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
# coding: utf-8 | ||
# pylint: disable=wildcard-import | ||
"""This module includes utilities for indexing and embedding text.""" | ||
|
||
from .vocab import * | ||
|
||
from . import embedding | ||
|
||
from .utils import * |
Oops, something went wrong.