Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add network configuration and training script #19

Merged
merged 6 commits into from
May 11, 2017

Conversation

pkuyym
Copy link
Contributor

@pkuyym pkuyym commented May 2, 2017

resolve #18

param_attr=paddle.attr.Param(name='sigmoid_w'),
bias_attr=paddle.attr.Param(name='sigmoid_b'))

parameters = paddle.parameters.create([cost, prediction])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

61行的parameters.create,63行的optimizer和69行的trainer,逻辑上不在network_conf.py里面。这个py是网络配置,但那几行是运行逻辑,放到train_v2.py里面。相应的71行也需要修改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 Thx

print "Pass %d, Batch %d, Cost %f" % (
event.pass_id, event.batch_id, event.cost)

feeding = dict(zip(input_data_lst, xrange(len(input_data_lst))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 虽然feeding很有用,但对于小白用户讲,不是很好理解。而且网络配置里面要return input_list,感觉不是很舒服。

  2. 网络配置的return,应该是训练的时候return cost,预测的时候return prediction。不是像现在这样return 三个值。

@lcy-seso 你觉得呢?

import gzip


def decode_res(infer_res, dict_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数请加注释,从代码上看不好理解做了什么。可以给个简单的数据例子,来说明做了什么转换。

ins_lst = []
ins_lbls = []

ins_buffer = paddle.reader.shuffle(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预测的话,还需要做shuffle么?


for i, ins in enumerate(ins_lst):
print idx_word_dict[ins[0]] + ' ' + idx_word_dict[ins[1]] + \
' -> ' + predict_words[i] + ' ( ' + gt_words[i] + ' )'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段代码里面也请多添加注释。至少每个变量的名字应该解释下。

@@ -0,0 +1,77 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source code should be renamed into a more meaningful name, by considering that we will add more examples into this directory.

Copy link
Contributor Author

@pkuyym pkuyym May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcy-seso agree, how about rename predict_v2.py to hsigmoid_predict.py, train_v2.py to hsigmoid_train.py and network_conf.py to hsigmoid_conf.py



def network_conf(is_train, hidden_size, embed_size, dict_size):
def word_embed(in_layer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to define the function word_embed, you can directly use this API call,http://www.paddlepaddle.org/develop/doc/api/v2/config/layer.html#embedding.

first_word_embed = word_embed(first_word)
second_word_embed = word_embed(second_word)
third_word_embed = word_embed(third_word)
fourth_word_embed = word_embed(fourth_word)
Copy link
Collaborator

@lcy-seso lcy-seso May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 19 to 32, please directly call paddle.v2.layer.embedding.

Also for line 19 to 32, please use embedding instead of embed in line 29 to 32.

I think embed_first_word or first_word_embedding are both OK, but first_word_embed is not a good name.

third_word_embed = word_embed(third_word)
fourth_word_embed = word_embed(fourth_word)

context_embed = paddle.layer.concat(input=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename context_embed into context_embedding.

embed_context or context_embedding are both OK, but make sure to keep consistent in the entire source code.

import paddle.v2 as paddle


def network_conf(is_train, hidden_size, embed_size, dict_size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider providing a default value for is_train.


def main():
paddle.init(use_gpu=False, trainer_count=1)
word_dict = paddle.dataset.imikolov.build_dict(typo_freq=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to rename type_freq into min_word_freq, because the word typo is ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcy-seso Must update dataset imikolov, should create a pr?

Copy link
Collaborator

@lcy-seso lcy-seso May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is better to create anther PR to modify the dataset and make sure it is done. Besides, before finishing the PR, we can create an issue to track the progress we fix it.

with gzip.open('./models/model_pass_00000.tar.gz') as f:
parameters = paddle.parameters.Parameters.from_tar(f)

ins_num = 10 # total 10 instance for prediction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider iterating over the entire testing dataset.


# Ouput format: word1 word2 word3 word4 -> predict label
for i, ins in enumerate(ins_lst):
print idx_word_dict[ins[0]] + ' ' + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style of printing is not consistent with line 46.


def main():
paddle.init(use_gpu=False, trainer_count=1)
word_dict = paddle.dataset.imikolov.build_dict(typo_freq=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to rename type_freq into min_word_freq because the word typo is ambiguous.

@@ -0,0 +1,56 @@
#!/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source code should be renamed into a more meaningful name, by considering that we will add more examples into this directory.

@pkuyym pkuyym force-pushed the develop branch 2 times, most recently from 117cbe4 to 9195764 Compare May 10, 2017 06:28
@pkuyym pkuyym force-pushed the develop branch 2 times, most recently from 34c8da0 to f26f4a6 Compare May 10, 2017 11:33
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档修改最后两个小节后可以merge了。

return predict_lbls
```

函数的输入是一个batch样本的预测概率以及词表的大小,里面的循环是对每条样本的输出概率进行解码,解码方式就是按照左0右1的准则,不断遍历路径,直至到达叶子节点。需要注意的是,本文选用的数据集需要较长的时间训练才能得到较好的结果,预测程序选用第一轮的模型,仅为展示方便,不保证效果。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还需要两个小节:

  1. 再增加一节 “如何运行”吧,一两句话即可。例如:训练可直接运行python hsigmoid_train.py
  2. “使用自己的数据进行训练”,这一节贴一段 reader的代码吧,然后调用一下 create_reader。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是用一两句话解释一下原始数据长什么样子吧。
例如:假设原始数据一行是一句话,以空格分隔。下面一条示例数据。类似这样。

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look Good, 还有一些文字需要修改。

@@ -1 +1,125 @@
TBD
# Hsigmoid加速Word Embedding训练
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hsigmoid加速词向量训练

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


<p align="center">
<img src="images/binary_tree.png" width="220" hspace='10'/> <img src="images/path_to_1.png" width="220" hspace='10'/> <br/>
左图为平衡分类二叉树,右图展示了从根节点到类别1的路径
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图需要有标号。第9行改为:图1. (a)为平衡二叉树,(b)为根节点到类别1的路径
在两个子图下面画上(a)和(b)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


<p align="center">
<img src="images/network_conf.png" width = "70%" align="center"/><br/>
网络配置结构
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图需要有编号

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

TBD
# Hsigmoid加速Word Embedding训练
## 背景介绍
在自然语言处理领域中,传统做法通常使用one-hot向量来表示词,比如词典为['我', '你', '喜欢'],可以用[1,0,0]、[0,1,0]和[0,0,1]这三个向量分别表示'我'、'你'和'喜欢'。这种表示方式比较简洁,但是当词表很大时,容易产生维度爆炸问题,而且任意两个词的向量是正交的,向量包含的信息有限。为了避免或减轻one-hot表示的缺点,目前通常使用词嵌入向量来取代one-hot表示,词嵌入向量也就是word embedding,具体地,使用一个低维稠密的实向量取代高维稀疏的one-hot向量。训练embedding词表的方法有很多种,神经网络模型是其中之一,包括CBOW等,这些模型本质上是一个分类模型,当词表较大也即类别较多时,传统的softmax将非常消耗时间,针对这类场景,PaddlePaddle提供了hsigmoid等层,来加速模型的训练过程。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ,(逗号改成分号)而且任意两个词
  2. 词嵌入向量改成词向量
  3. 词向量也就是word embedding,即使用一个低维稠密的实向量取代高维稀疏的one-hot向量
  4. 训练embedding词表-》训练词向量的方法有很多种,神经网络模型是其中之一,包括CBOW(除了CBOW,再列举一些)等。
  5. 这些模型本质上都是一个分类模型,当词表较大即类别较多时,传统的softmax将非常消耗时间。(逗号改成句号)
  6. PaddlePaddle提供了hsigmoid等层。这里的“等”,还有哪些层,请列举出来,如果没有,请去掉这个等。

找个地方可以加上book中词向量的链接。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## 背景介绍
在自然语言处理领域中,传统做法通常使用one-hot向量来表示词,比如词典为['我', '你', '喜欢'],可以用[1,0,0]、[0,1,0]和[0,0,1]这三个向量分别表示'我'、'你'和'喜欢'。这种表示方式比较简洁,但是当词表很大时,容易产生维度爆炸问题,而且任意两个词的向量是正交的,向量包含的信息有限。为了避免或减轻one-hot表示的缺点,目前通常使用词嵌入向量来取代one-hot表示,词嵌入向量也就是word embedding,具体地,使用一个低维稠密的实向量取代高维稀疏的one-hot向量。训练embedding词表的方法有很多种,神经网络模型是其中之一,包括CBOW等,这些模型本质上是一个分类模型,当词表较大也即类别较多时,传统的softmax将非常消耗时间,针对这类场景,PaddlePaddle提供了hsigmoid等层,来加速模型的训练过程。
## Hsigmoid Layer
Hsigmoid Layer引用自论文\[[1](#参考文献)\],原理是通过构建一个分类二叉树来降低计算复杂度,二叉树中每个叶子节点代表一个类别,每个非叶子节点代表一个二类别分类器。例如我们一共有4个类别分别是0,1,2,3,softmax会分别计算4个类别的得分,然后归一化得到概率,当类别数很多时,计算每个类别的概率将非常耗时,Hsigmoid Layer会根据类别数构建一个平衡二叉树,如下:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Hsigmoid是Hierarchical-sigmoid的意思么,可以说一下H代表什么。
  2. 0,1,2,3中间的逗号都改成顿号
  3. ,(逗号改句号)当类别数
  4. 计算每个类别的概率(去掉将)非常耗时。可以说一下时间复杂度从n降到log(n)了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

二叉树中每个非叶子节点是一个二类别分类器(例如sigmoid),如果类别是0,则取左子节点继续分类判断,反之取右子节点,直至达到叶节点。按照这种方式,每个类别均对应一条路径,例如从root到类别1的路径编码为0,1。训练阶段我们按照真实类别对应的路径,依次计算对应分类器的损失,然后综合所有损失得到最终损失,详细理论细节可参照论文。预测阶段,模型会输出各个非叶节点分类器的概率,我们可以根据概率获取路径编码,然后遍历路径编码就可以得到最终预测类别,具体实现细节见下文。

# 数据准备
本文采用Penn Treebank (PTB)数据集(Tomas Mikolov预处理版本),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包paddle.dataset.imikolov,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\<s>以及结束符号\<e>。然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\<s> I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \<e>,PaddlePaddle会把词转换成id数据作为最终输入。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 数据集请放链接
  2. 即用每条数据的前4个词来预测第5个词
  3. python包paddle.dataset.imikolov,请放源码地址链接
  4. 预处理会把数据集中每一句话的前后XXX,(句号改逗号)然后XXX
  5. 最后一句“PaddlePaddle会把词转换成id数据作为最终输入。”最终输入-》预处理的输出。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个我解释一下。数据需要解释,我的意见不需要解释PTB这个数据。核心是,如果给定用户自己的一个输入数据,怎么使用这个配置重新在自己的数据上训练模型。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


二叉树中每个非叶子节点是一个二类别分类器(例如sigmoid),如果类别是0,则取左子节点继续分类判断,反之取右子节点,直至达到叶节点。按照这种方式,每个类别均对应一条路径,例如从root到类别1的路径编码为0,1。训练阶段我们按照真实类别对应的路径,依次计算对应分类器的损失,然后综合所有损失得到最终损失,详细理论细节可参照论文。预测阶段,模型会输出各个非叶节点分类器的概率,我们可以根据概率获取路径编码,然后遍历路径编码就可以得到最终预测类别,具体实现细节见下文。

# 数据准备
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据准备前面是##,一篇文章建议只有一个#

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# 数据准备
本文采用Penn Treebank (PTB)数据集(Tomas Mikolov预处理版本),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包paddle.dataset.imikolov,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\<s>以及结束符号\<e>。然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\<s> I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \<e>,PaddlePaddle会把词转换成id数据作为最终输入。
# 编程实现
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16行的编程实现可以去掉,直接17行的网络结构开始描述。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

本文采用Penn Treebank (PTB)数据集(Tomas Mikolov预处理版本),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,每条数据的前4个词用来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包paddle.dataset.imikolov,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\<s>以及结束符号\<e>。然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\<s> I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \<e>,PaddlePaddle会把词转换成id数据作为最终输入。
# 编程实现
## 网络结构
本文通过训练N-gram语言模型来获得词向量,具体地使用前4个词来预测当前词。网络输入为词的id,然后查询embedding词表获取embedding向量,接着拼接4个词的embedding向量,然后接入一个全连接隐层,最后是hsigmoid层。详细网络结构见下图:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 详细网络结构见图2
  2. embedding可以改成词向量。
  3. hsigmoid前面是Hsigmoid,请统一

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


需要注意,在预测阶段,我们需要对hsigmoid参数做一次转置,这里输出的类别数为词典大小减1,对应非叶节点的数量。

## 预测阶段
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预测阶段-》如何预测

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pkuyym
Copy link
Contributor Author

pkuyym commented May 11, 2017

@luotao1 @lcy-seso 感谢review,已参照review意见进行修补,增加了用户自定义数据说明。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is almost done. Only a small modification.

return predict_lbls
```

函数的输入是一个batch样本的预测概率以及词表的大小,里面的循环是对每条样本的输出概率进行解码,解码方式就是按照左0右1的准则,不断遍历路径,直至到达叶子节点。需要注意的是,本文选用的数据集需要较长的时间训练才能得到较好的结果,预测程序选用第一轮的模型,仅为展示方便,不保证效果。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是用一两句话解释一下原始数据长什么样子吧。
例如:假设原始数据一行是一句话,以空格分隔。下面一条示例数据。类似这样。

@pkuyym pkuyym force-pushed the develop branch 4 times, most recently from ddd86f5 to 070b34c Compare May 11, 2017 07:31
@pkuyym
Copy link
Contributor Author

pkuyym commented May 11, 2017

@lcy-seso 已经修改完毕,请review,THX

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small modifications.

return predict_lbls
```

预测程序的输入数据格式与训练阶段相同,如have a dream that one,程序会根据have a dream that生成一组概率,通过对概率解码生成预测词,one作为真实词,方便评估。解码函数的输入是一个batch样本的预测概率以及词表的大小,里面的循环是对每条样本的输出概率进行解码,解码方式就是按照左0右1的准则,不断遍历路径,直至到达叶子节点。需要注意的是,本文选用的数据集需要较长的时间训练才能得到较好的结果,预测程序选用第一轮的模型,仅为展示方便,不保证效果。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把“不保证效果” 改成“学习效果不能保证。”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return reader


def train(word_idx, n):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train --> train_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:param n: sliding window size
:type n: int
"""
return reader_creator('./data/ptb.train.txt', word_idx, n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里把参数 “./data/ptb.train.txt” 换成一个参数吧,我们并不提供这个文件,尽量减少这样的硬编码。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### 自定义数据
用户可以使用自己的数据集训练模型,自定义数据集最关键的地方是实现reader接口做数据处理,一个封装样例如下:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用大白话描述一下这个函数做了什么事情。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small modification.

本文采用Penn Treebank (PTB)数据集([Tomas Mikolov预处理版本](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz)),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,即用每条数据的前4个词来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包[paddle.dataset.imikolov](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py) ,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\<s>以及结束符号\<e>,然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\<s> I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \<e>,PaddlePaddle会把词转换成id数据作为预处理的输出。

### 自定义数据
用户可以使用自己的数据集训练模型,自定义数据集最关键的地方是实现reader接口做数据处理,reader需要产生一个迭代器,迭代器负责解析文件中的每一行数据,然后转换成程序的输入格式,一个封装样例如下:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

”然后转换成程序的输入格式“ 这里转换成清晰精确的描述吧。这样的描述没有信息量。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need modifications.

本文采用Penn Treebank (PTB)数据集([Tomas Mikolov预处理版本](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz)),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,即用每条数据的前4个词来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包[paddle.dataset.imikolov](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py) ,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\<s>以及结束符号\<e>,然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\<s> I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \<e>,PaddlePaddle会把词转换成id数据作为预处理的输出。

### 自定义数据
用户可以使用自己的数据集训练模型,自定义数据集最关键的地方是实现reader接口做数据处理,reader需要产生一个迭代器,迭代器负责解析文件中的每一行数据,返回一个paddle list,例如[1, 2, 3, 4, 5],分别是第一个到第四个词的id,paddle会进一步将该list转化成`paddle.data_type.inter_value`类型作为data layer的输入,一个封装样例如下:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 返回一个paddle list --> 返回一个 python list
  2. 分别是第一个到第四个词的id --> 分别是第一个到第四个词在字典中的id
  3. paddle --> PaddlePaddle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last comments.

</p>

二叉树中每个非叶子节点是一个二类别分类器(例如sigmoid),如果类别是0,则取左子节点继续分类判断,反之取右子节点,直至达到叶节点。按照这种方式,每个类别均对应一条路径,例如从root到类别1的路径编码为0,1。训练阶段我们按照真实类别对应的路径,依次计算对应分类器的损失,然后综合所有损失得到最终损失,详细理论细节可参照论文。预测阶段,模型会输出各个非叶节点分类器的概率,我们可以根据概率获取路径编码,然后遍历路径编码就可以得到最终预测类别,具体实现细节见下文。
二叉树中每个非叶子节点是一个二类别分类器(sigmoid),如果类别是0,则取左子节点继续分类判断,反之取右子节点,直至达到叶节点。按照这种方式,每个类别均对应一条路径,例如从root到类别1的路径编码为0、1。训练阶段我们按照真实类别对应的路径,依次计算对应分类器的损失,然后综合所有损失得到最终损失。预测阶段,模型会输出各个非叶节点分类器的概率,我们可以根据概率获取路径编码,然后遍历路径编码就可以得到最终预测类别。传统softmax的计算复杂度为N(N为词典大小),Hsigmoid可以将复杂度降至log(N),详细理论细节可参照论文\[[1](#参考文献)\]。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

二叉树中每个非叶子节点是一个二类别分类器(sigmoid) --> 二叉树中每个非叶子节点是一个二类别分类器(以sigmoid为激活)


## 数据准备
### PTB数据
本文采用Penn Treebank (PTB)数据集([Tomas Mikolov预处理版本](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz)),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,即用每条数据的前4个词来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包[paddle.dataset.imikolov](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py) ,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\<s>以及结束符号\<e>,然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\<s> I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \<e>,PaddlePaddle会把词转换成id数据作为预处理的输出。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle会把词转换成id数据作为预处理的输出。 --> PaddlePaddle会把词转换成词在字典中的id作为预处理的输出。

## 网络结构
本文通过训练N-gram语言模型来获得词向量,具体地使用前4个词来预测当前词。网络输入为词的id,然后查询embedding词表获取embedding向量,接着拼接4个词的embedding向量,然后接入一个全连接隐层,最后是hsigmoid层。详细网络结构见下图
本文通过训练N-gram语言模型来获得词向量,具体地使用前4个词来预测当前词。网络输入为词的id,然后查询词向量词表获取词向量,接着拼接4个词的词向量,然后接入一个全连接隐层,最后是Hsigmoid层。详细网络结构见图2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

词的id --> 词在字典中的id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example looks good to me. Thanks for the work.

@luotao1 luotao1 merged commit f5724ef into PaddlePaddle:develop May 11, 2017
frankwhzhang pushed a commit that referenced this pull request Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add basic network configuration and training script for word embedding task
3 participants