Using pre-trained word vectors in embedding layer #490

qingqing01 · 2016-11-16T13:11:03Z

The following issue comes from email.

Thank you for your work on Paddle. I think the design is very interesting.

I would like to use pretrained word vectors in an embedding layer. I want the weights to be static, because my training data is small. For clarity, here's how I would implement the desired behaviour with Keras:

    model.add(
        Embedding(
            embeddings.shape[0],
            embeddings.shape[1],
            input_length=shape['max_length'],
            trainable=False,
            weights=[embeddings],
            mask_zero=True
        )
    )

Is there a way to implement this with the Paddle Python bindings? Unfortunately I haven't been able to find this in the documentation or source yet.

The text was updated successfully, but these errors were encountered:

qingqing01 · 2016-11-16T13:13:41Z

The following answer comes from Jie. I paste the answer here to help more users.

Thanks for your interests in paddle。 The situation you gave us below is very popular in NLP tasks, and PaddlePaddle definitely supports such type of requirement.
If the variable name of word vectors is “embeddings”, you need to do the followings:

In running scripts, add two contrals:
- --init_model_path=’place_of_the_existed_word_vectors’ # you can also put other file for initialization here. Keep variable names the same as those used in network config.
- --load_missing_parameter_strategy='rand' # to ignore the other variables when loading the pre-trained part
In network config file:
In word vectors layer, at input line, you need to set parameter_name='embeddings', and add is_static=True or more straightforwardly add learning_rate=0

qingqing01 · 2016-11-16T13:29:03Z

Layer Config

The embedding_layer is as follows:

emb = embedding_layer(
    input=data_layer
    size=word_vector_dim,
    param_attr=ParamAttr(name='embeddings', is_static=True))

or

emb = embedding_layer(
    input=data_layer
    size=word_vector_dim,
    param_attr=ParamAttr(name='embeddings', learning_rate=0.0))

The language_embedding is the name of embedding weight file. This is a Chinese word embedding model tutorial. http://www.paddlepaddle.org/doc/demo/embedding_model/index.html

How to save model as PaddlePaddle format

If you need to convert your model into PaddlePaddle format, you can use following python function.
Note,

import struct
import numpy as np

def write_parameter(outfile, weights):
    """
    :param outfile: Output file name. **Note**, it should be the same as it in the above config.
    :type outfile: string.
    :param weights: parameter.
    :type weights: 1-dimension array of float type or list.
    """
    version = 0
    value_size  = 4; # means float type
    ret = ""
    for w in weights:
        ret += w.tostring()
    size = len(ret) / 4
    fo = open(outfile, 'wb')
    fo.write(struct.pack('iIQ', version, value_size, size))
    fo.write(ret)

# The weights is a 2-dimensional array.
weights=np.array([[w_11,w_12,w_13,w_14],
                        [w_21, w_22, w_23, w_24],
                         ...])  # each of line is one word vector
write_parameter("embeddings", weights.flatten())

The command line arguments

The first point described above can refer to the document
http://www.paddlepaddle.org/doc/ui/cmd_argument/use_case.html#use-model-to-initialize-network

backyes · 2016-11-16T15:01:00Z

@qingqing01

If we need more explainations about def write_parameterfunction, especially forfeats``

def c(outfile, feats):

I guess the feats is array type in python, but the code fragment you paste explains nothing about it.

In addition, what's the feats format? The second float data of feats is the weight of second value of first row, or second value of first column? Different answer means different implementation fo write_parameter, right?

qingqing01 · 2016-11-17T02:01:30Z

@backyes Thanks. You are right. I fixed the code above and modify feats -> weights. is it well understood?

backyes · 2016-11-17T02:43:04Z

@qingqing01 That's clear now.

OleNet · 2016-11-30T04:02:07Z

I1130 10:24:10.250222 18388 Util.cpp:113] Calling runInitFunctions 
I1130 10:24:10.250636 18388 Util.cpp:126] Call runInitFunctions done. 
I1130 10:24:10.493953 18388 Trainer.cpp:169] trainer mode: Normal 
I1130 10:24:10.494329 18388 MultiGradientMachine.cpp:108] numLogicalDevices=1 numThreads=4 numDevices=4 
I1130 10:24:10.555085 18388 PyDataProvider2.cpp:219] loading dataprovider dataprovider::process 
I1130 10:24:10.565850 18388 PyDataProvider2.cpp:219] loading dataprovider dataprovider::process 
I1130 10:24:10.566042 18388 GradientMachine.cpp:123] Loading parameters from test/sentiment/thirdparty/emb/embeddings 
I1130 10:24:10.566069 18388 Parameter.cpp:344] **missing parameters** [test/sentiment/thirdparty/emb/embeddings/embeddings] while loading model. 
I1130 10:24:10.566082 18388 Parameter.cpp:354] embeddings missing, set to random. 
I1130 10:24:10.721670 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/_lstm_transform___bidirectional_lstm_0___fw.w0] while loading model. 
I1130 10:24:10.721714 18388 Parameter.cpp:354] _lstm_transform___bidirectional_lstm_0___fw.w0 missing, set to random. 
I1130 10:24:10.743702 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___bidirectional_lstm_0___fw.w0] while loading model. 
I1130 10:24:10.743721 18388 Parameter.cpp:354] ___bidirectional_lstm_0___fw.w0 missing, set to random. 
I1130 10:24:10.829339 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___bidirectional_lstm_0___fw.wbias] while loading model. 
I1130 10:24:10.829360 18388 Parameter.cpp:354] ___bidirectional_lstm_0___fw.wbias missing, set to random. 
I1130 10:24:10.831689 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/_lstm_transform___bidirectional_lstm_0___bw.w0] while loading model. 
I1130 10:24:10.831707 18388 Parameter.cpp:354] _lstm_transform___bidirectional_lstm_0___bw.w0 missing, set to random. 
I1130 10:24:10.851577 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___bidirectional_lstm_0___bw.w0] while loading model. 
I1130 10:24:10.851593 18388 Parameter.cpp:354] ___bidirectional_lstm_0___bw.w0 missing, set to random. 
I1130 10:24:10.930817 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___bidirectional_lstm_0___bw.wbias] while loading model.
I1130 10:24:10.930835 18388 Parameter.cpp:354] ___bidirectional_lstm_0___bw.wbias missing, set to random. 
I1130 10:24:10.931156 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___fc_layer_0__.w0] while loading model. 
I1130 10:24:10.931169 18388 Parameter.cpp:354] ___fc_layer_0__.w0 missing, set to random. 
I1130 10:24:10.974350 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___fc_layer_0__.wbias] while loading model. 
I1130 10:24:10.974370 18388 Parameter.cpp:354] ___fc_layer_0__.wbias missing, set to random. 
I1130 10:24:10.976465 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___embedding_1__.w0] while loading model. 
I1130 10:24:10.976483 18388 Parameter.cpp:354] ___embedding_1__.w0 missing, set to random. 
I1130 10:24:11.120676 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/_lstm_transform___bidirectional_lstm_1___fw.w0] while loading model. 
I1130 10:24:11.120702 18388 Parameter.cpp:354] _lstm_transform___bidirectional_lstm_1___fw.w0 missing, set to random. 
I1130 10:24:11.140681 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___bidirectional_lstm_1___fw.w0] while loading model. 
I1130 10:24:11.140712 18388 Parameter.cpp:354] ___bidirectional_lstm_1___fw.w0 missing, set to random. 
I1130 10:24:11.220038 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/___bidirectional_lstm_1___fw.wbias] while loading model. 
I1130 10:24:11.220065 18388 Parameter.cpp:354] ___bidirectional_lstm_1___fw.wbias missing, set to random. 
I1130 10:24:11.220399 18388 Parameter.cpp:344] missing parameters [test/sentiment/thirdparty/emb/embeddings/_lstm_transform___bidirectional_lstm_1___bw.w0] while loading model.

paddle cluster_train \
  --config=test/sentiment/cluster_job_config/job_config.py \
 ...
  --init_model_path=test/sentiment/thirdparty/emb/ \
  --load_missing_parameter_strategy=rand

    word_data1 = data_layer("word1", input_dim)
    emb = embedding_layer(input=word_data1, size=emb_dim,
            param_attr=ParamAttr(name='embeddings'))

我的一系列代码如上，但是log中显示，missing parameters，是哪一步出错了呢？

luotao1 · 2016-11-30T07:16:47Z

可以检查下test/sentiment/thirdparty/emb/有没有模型

OleNet · 2016-11-30T07:30:41Z

我的test/sentiment/thirdparty/emb/下面，有一个embeddings文件，
是根据上文提到的这段代码生成的：

import struct
import numpy as np

def write_parameter(outfile, weights):
    """
    :param outfile: Output file name. **Note**, it should be the same as it in the above config.
   ...

luotao1 · 2016-11-30T07:34:52Z

missing parameters [test/sentiment/thirdparty/emb/embeddings/embeddings] 有两个embeddings，检查下路径对不对

OleNet · 2016-11-30T07:49:07Z

找到问题了，paddle cluster_train --init_model_path的这个路径是按照集群上面的路径来寻找模型，而非本地。
错误版本：

paddle cluster_train \
  --config=test/sentiment/cluster_job_config/job_config.py \
  --num_nodes=1 \
  --num_passes=20 \
  --log_period=100 \
  --dot_period=10 \
  --trainer_count=16 \
  --saving_period=1 \
  --thirdparty=./test/sentiment/thirdparty \
  --config_args=is_local=0 \
  --use_gpu gpu \
...
  --init_model_path=test/sentiment/thirdparty/emb/ \
  --load_missing_parameter_strategy=rand

修正版本：

paddle cluster_train \  --config=test/sentiment/cluster_job_config/job_config.py \
  --num_nodes=1 \
  --num_passes=20 \
  --log_period=100 \
  --dot_period=10 \
  --trainer_count=16 \
  --saving_period=1 \
  --thirdparty=./test/sentiment/thirdparty \
  --config_args=is_local=0 \
  --use_gpu gpu \
...
  --init_model_path=./thirdparty/thirdparty/emb/ \
  --load_missing_parameter_strategy=rand

keain · 2016-12-04T15:14:43Z

word = data_layer(name='word_data', size=word_dict_len);
word_embedding = embedding_layer(size=word_dim, input=word, param_attr=ptt);
假设输入数据是intervalue_value_sequene类型的，长度是L, 那么经过上面两个语句后，是不是就相当于将输入是L乘以word_dict_len大小的矩阵变成了L乘以word_dim大小的矩阵呢

qingqing01 · 2016-12-05T02:36:36Z

@keain 是将L个integer_value (整数index)变成 L * word_dim的矩阵。

将输入是L乘以word_dict_len大小的矩阵

这句不太准确，你理解的输出是正确的。

fix the ref for api_guides

fix ernie-gen model path

* cinn_builder pybind11 * export net_builder python

delete transformers.md

Merge develop

…e#490) Co-authored-by: Bai Yifan <me@ethanbai.com>

Co-authored-by: danthe3rd <danthe3rd>

luotao1 added the question label Nov 17, 2016

backyes mentioned this issue Nov 29, 2016

如何把word2vec得到的模型作为paddle中embedding层的初始化参数？ #658

Closed

backyes closed this as completed Nov 30, 2016

github-zj mentioned this issue Jan 19, 2017

将fastText产出的pretrain的word embedding装入首层报错 #1199

Closed

I0x0I mentioned this issue Mar 2, 2017

[Question] How to use PyDataProvider2 to input a static vector? #1493

Closed

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

Merge pull request PaddlePaddle#490 from tink2123/fix_ref_guides

9169334

fix the ref for api_guides

Meiyim pushed a commit to Meiyim/Paddle that referenced this issue May 21, 2021

Merge pull request PaddlePaddle#490 from zhanghan1992/develop

0806468

fix ernie-gen model path

thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021

cinn_builder pybind11 (PaddlePaddle#490)

f0405aa

* cinn_builder pybind11 * export net_builder python

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021

Delete transformers.md (PaddlePaddle#490)

f5449dc

delete transformers.md

gglin001 added a commit to graphcore/Paddle-fork that referenced this issue Mar 17, 2022

Merge pull request PaddlePaddle#490 from graphcore/merge_develop

bb50276

Merge develop

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

update mkldnn quant doc, test=develop, test=document_fix (PaddlePaddl…

72bcb1e

…e#490) Co-authored-by: Bai Yifan <me@ethanbai.com>

wwbitejotunn pushed a commit to wwbitejotunn/Paddle that referenced this issue Jun 21, 2024

Support MQA + MP for decoding (PaddlePaddle#490)

011ec32

Co-authored-by: danthe3rd <danthe3rd>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pre-trained word vectors in embedding layer #490

Using pre-trained word vectors in embedding layer #490

qingqing01 commented Nov 16, 2016

qingqing01 commented Nov 16, 2016 •

edited

Loading

qingqing01 commented Nov 16, 2016 •

edited

Loading

backyes commented Nov 16, 2016 •

edited

Loading

qingqing01 commented Nov 17, 2016

backyes commented Nov 17, 2016

OleNet commented Nov 30, 2016

luotao1 commented Nov 30, 2016

OleNet commented Nov 30, 2016

luotao1 commented Nov 30, 2016

OleNet commented Nov 30, 2016

keain commented Dec 4, 2016

qingqing01 commented Dec 5, 2016

Using pre-trained word vectors in embedding layer #490

Using pre-trained word vectors in embedding layer #490

Comments

qingqing01 commented Nov 16, 2016

qingqing01 commented Nov 16, 2016 • edited Loading

qingqing01 commented Nov 16, 2016 • edited Loading

Layer Config

How to save model as PaddlePaddle format

The command line arguments

backyes commented Nov 16, 2016 • edited Loading

qingqing01 commented Nov 17, 2016

backyes commented Nov 17, 2016

OleNet commented Nov 30, 2016

luotao1 commented Nov 30, 2016

OleNet commented Nov 30, 2016

luotao1 commented Nov 30, 2016

OleNet commented Nov 30, 2016

keain commented Dec 4, 2016

qingqing01 commented Dec 5, 2016

qingqing01 commented Nov 16, 2016 •

edited

Loading

qingqing01 commented Nov 16, 2016 •

edited

Loading

backyes commented Nov 16, 2016 •

edited

Loading