Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentiment analysis translation (#270) #444

Closed
wants to merge 23 commits into from
Closed

sentiment analysis translation (#270) #444

wants to merge 23 commits into from

Conversation

westeast
Copy link
Contributor

情感分析部分翻译成中文~

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.004%) to 62.901% when pulling 476e94f on westeast:master into ef5e483 on baidu:develop.

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外需要修改 Paddle/doc_cn/demo/index.rst里情感分析后面的链接。


### IMDB 数椐准备

在这个例子中,我们只使用已经标注过的训练集和测试集,且默认在测试集上构建字典,而不使用IMDB数椐集中的imdb.vocab做为字典。训练集已经做了随机打乱排序而测试集没有。 Moses 工具中的脚本`tokenizer.perl` 用于切分单单词和标点符号。执行下面的命令就可以预处理数椐。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这句翻译不对: "默认在测试集上构建字典" -> "默认在训练集上构建字典"。

```
dict.txt labels.list test.list test_part_000 train.list train_part_000
```
* test\_part\_000 and train\_part\_000: 所有标记的训练集和测试集. 训练集已经随机打乱。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"训练集和测试集" 调换下顺序吧, 和前面test_xx and train_xx对应。


## 训练模型

在这步任务中,我们使用了循环神经网络(RNN)的 LSTM 架构来训练情感分析模型。 引入LSTM模型主要是为了克服消失梯度的问题。 LSTM网络类似于具有隐藏层的标循环现神经网络, 但是隐藏层中的每个普通节点被一个记忆单元替换。 每个记忆单元包含四个主要的元素: 输入门, 具有自循环连接的神经元,忘记门和输出门。 更多的细节可以在文献中找到[4]。 LSTM架构的最大优点是它可以在长时间间隔内记忆信息,而没有短时记忆的损失。在有新的单词来临的每一个时间步骤内,存储在记忆单元区块的历史信息被更新用来迭代的学习单词以合理的序列程现。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"标循环现神经网络" -> "标准循环现神经网络"

<center>![LSTM](../../../doc/demo/sentiment_analysis/lstm.png)</center>
<center>图表 1. LSTM [3]</center>

情感分析是自然语言理解中最典型的问题之一。 它的目的是预测在一个序列中表达的情感态度。 通常, ,仅仅是一些关键词,如形容词和副词,在预测序列或段落的情感中起主要作用。然而有些评论上下文非常长,例如 IMDB的数椐集。 我们只所以使用LSTM来执行这个任务是因为其改进的设计并且具有门机制。 首先,它能够从字级到具有可变上下文长度的上下文级别(其通过门值来适配)来总结表示。 第二,它可以在句子级别利用可扩展的上下文, 而大多数方法只是利用n-gram级别的知识。第三,它直接学习段落表示,而不是组合上下文级别信息。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "word level" 应该翻译成 "词级别", "首先,它能够从字级" 这里要改。
  2. "(其通过门值来适配)" 这里翻译不准确,括号里的可以去掉。


#### 双向LSTM

图2是双向LSTM网络,由全连接层和softmax层组成。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"由全连接层和softmax层组成" -> "后面连全连接层和softmac层"。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成了->"后面连全连接层和softmax层"

- CurrentCost=xx: 最新的log_period批处理的当前时间花费。
- Eval: classification\_error\_evaluator=xx: 表示第0批次到当前批次的分类错误。
- CurrentEval: classification\_error\_evaluator: 最新日志周期批次的分类错误。
- Pass=0: 通过所有训练集一次称为一遍。 0表示第一次经过训练集。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面所有的批都改成batch吧。 log_period就直接写成log_period, 不要翻译成"日志周期"。

- Pass=0: 通过所有训练集一次称为一遍。 0表示第一次经过训练集。

默认情况下,我们使用`stacked_lstm_net`网络,当传递相同的样本数时,它的收敛速度比`bidirectional_lstm_net`快。If you want to use bidirectional LSTM, 如果要使用双向LSTM,只需删除最后一行中的注释并把“stacked_lstm_net”注释掉。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use bidirectional LSTM, 去掉。

2>&1 | tee 'test.log'
```

函数`get_best_pass`通过计算分类错误率获得最佳模型进行测试。 在本示例中,我们默认使用IMDB的测试数据集作为验证。 与训练不同,它需要在这里指定`--job = test`和模型路径,即`--model_list = $model_list`。如果运行成功,日志将保存在“demo / sentiment / test.log”的路径中。例如,在我们的测试中,最好的模型是`model_output / pass-00002`,分类误差是0.115645,如下:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过计算分类错误率-> 依据分类错误率。

* -d data/pre-imdb/dict.txt: 设置字段文件。
* -i data/aclImdb/test/pos/10014_7.txt: 设置一个要预测的示例文件。

注意你应该确保默认模型路径`model_output / pass-00002`存在或更改为其它模型路径。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"你" 去掉吧

:glob:

Training Locally <sentiment_analysis.md>
internal/cluster_train.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉 internal/cluster_train.md。

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 62.945% when pulling 56e31f7 on westeast:master into ef5e483 on baidu:develop.

另一方面,抓取产品的用户评论并分析他们的情感,有助于理解用户对不同公司,不同产品,甚至不同竞争对手产品的偏好。

本教程将指导您完成长期短期记忆(LSTM)网络的训练过程,以分类来自[大型电影评论数据集](http://ai.stanford.edu/~amaas/data/sentiment/)(有时称为[互联网电影数据库 (IMDB)](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf))的句子的情感 。 此数据集包含电影评论及其相关联的二进制情绪极性标签,即正面和负面。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

相关联的二进制情绪极性标签 -> 类别标签

```

* **数椐定义**:
* get\_config\_arg(): 获取通过 `--config_args=xx` i设置的命令行参数。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

设置前面的i去掉。

* -n $config : 设置网络配置。
* -w $model: 设置模型路径。
* -b $label: 设置标签类别字典,这个字典是整数标签和字符串标签的一个对应。
* -d data/pre-imdb/dict.txt: 设置字段文件。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

字段->字典。

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 62.933% when pulling 3fbdd49 on westeast:master into ef5e483 on baidu:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 62.933% when pulling 3fbdd49 on westeast:master into ef5e483 on baidu:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 62.974% when pulling 36d60e3 on westeast:master into ef5e483 on baidu:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 62.965% when pulling 7984320 on westeast:master into ef5e483 on baidu:develop.

Revert "sentiment  analysis translation detail fix"

This reverts commit 0f00419.

sentiment  analysis translation  fix errors

sentiment  analysis translation  fix errors

sentiment  analysis translation  fix errors

sentiment analysis translation fix errors

sentiment  analysis doc_cn update for qingqing01 review

sentiment  analysis translation  fix index link

sentiment analysis doc_cn update for qingqing01 review

sentiment  analysis translation  update link

sentiment  analysis translation  update link
@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 62.97% when pulling 20185fd on westeast:master into ef5e483 on baidu:develop.

@westeast westeast closed this Nov 15, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
Meiyim pushed a commit to Meiyim/Paddle that referenced this pull request May 21, 2021
* release ernie-gen

* add .meta

* del tag

* fix tag
Meiyim added a commit to Meiyim/Paddle that referenced this pull request May 21, 2021
* fix pretrain reader  encoding

* dygraph

* update bce

* fix readme

* update readme

* fix readme

* remove debug log

* remove chn example

* fix pretrain

* tokenizer can escape special token
rename mlm_bias

* update readme

* 1. MRC tasks
2. optimize truncate strategy in tokenzier
3. rename `ernie.cls_fc` to `ernie.classifyer`
4. reorg pretrain dir

* + doc string

* update readme

* Update README.md

* Update README.md

* Update README.md

* update pretrain

* Update README.md

* show gif

* update readme

* fix gif; disable save_dygrpah

* bugfix

* + save inf model
+ inference
+ distill
+ readme

* Update README.md

* + propeller server

* Update README.md

* Update README.md

* Update README.md

* + sementic analysis(another text classification)

* transformer + cache
fix tokenization bug

* update reamde; fix tokenization

* Update README.md

* Update README.md

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* infer senti analysis

* + readme header

* transformer cache has gradients

* + seq2seq

* +experimental

* reorg

* update readme

* Update README.md

* seq2seq

* + cnndm evluation scripts

* update README.md

* +zh readme

* + publish ernie gen model

* update README.md

* Update README.zh.md

* Update README.md

* Update README.zh.md

* Update README.md

* Update README.zh.md

* Update README.md

* Update README.zh.md

* Update README.md

* Update README.zh.md

* Update README.zh.md

* Add files via upload

* Update README.md

* Update README.zh.md

* Add files via upload

* release ernie-gen (PaddlePaddle#444)

* release ernie-gen

* add .meta

* del tag

* fix tag

* Add files via upload

* Update README.zh.md

* Update README.md

* Update and rename README.zh.md to README.eng.md

* Update README.eng.md

* Update README.md

* Update README.eng.md

* Update README.md

* Update README.eng.md

* Update README.md

* Update README.md

* Update README.md

* Update README.eng.md

* Update README.eng.md

* Update README.eng.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

Co-authored-by: kirayummy <shi.k.feng@gmail.com>
Co-authored-by: zhanghan <zhanghan17@baidu.com>
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
gglin001 added a commit to graphcore/Paddle-fork that referenced this pull request Mar 17, 2022
* remove IpuCustomOpIdentifier in python

* add getter

* fix ci error

* clean api

* update log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants