Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用 Python 进行自动化特征工程 #4262

Merged
merged 11 commits into from
Aug 10, 2018
Merged

使用 Python 进行自动化特征工程 #4262

merged 11 commits into from
Aug 10, 2018

Conversation

SimonMing47
Copy link
Contributor

@SimonMing47 SimonMing47 commented Aug 7, 2018

译文翻译完成,resolve #3990

@SimonMing47
Copy link
Contributor Author

SimonMing47 commented Aug 7, 2018

@leviding @fanyijihua 翻译完成

@leviding leviding added the 后端 label Aug 7, 2018
@leviding leviding changed the title Translate automated feature engineering in python Python 中的自动特征工程 Aug 7, 2018
@leviding leviding changed the title Python 中的自动特征工程 使用 Python 进行自动化特征工程 Aug 7, 2018
@yqian1991
Copy link
Contributor

校对认领

@fanyijihua
Copy link
Collaborator

@yqian1991 好的呢 🍺

Copy link
Contributor

@yqian1991 yqian1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mingxing47 @leviding 校对完成,翻译通畅,只是存在一些格式问题。

> * 校对者:

# Automated Feature Engineering in Python
# Python 中的自动特征工程
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

翻译没有问题,只是个人建议,“Python 中的特征工程自动化”


![](https://cdn-images-1.medium.com/max/1000/1*lg3OxWVYDsJFN-snBY7M5w.jpeg)

Machine learning is increasingly moving from hand-designed models to automatically optimized pipelines using tools such as [H20](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html), [TPOT](https://epistasislab.github.io/tpot/), and [auto-sklearn](https://automl.github.io/auto-sklearn/stable/). These libraries, along with methods such as [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), aim to simplify the model selection and tuning parts of machine learning by finding the best model for a dataset with little to no manual intervention. However, feature engineering, an [arguably more valuable aspect](https://www.featurelabs.com/blog/secret-to-data-science-success/) of the machine learning pipeline, remains almost entirely a human labor.
机器学习正在利用诸如 [H20](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)[TPOT](https://epistasislab.github.io/tpot/) [auto-sklearn](https://automl.github.io/auto-sklearn/stable/) 等工具越来越多地从手工设计模型向自动化优化管道迁移。以上这些类库,连同如 [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) 等方法一起,目的都是在通过找到适合于几乎不需要人工干预的数据集的最佳模型来简化机器学习的模型选择和调优部分。然而,特征工程,作为机器学习管道中一个[可以说是更有价值的方面](https://www.featurelabs.com/blog/secret-to-data-science-success/),几乎全部是手工活。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“目的都是通过找到适合于几乎不需要人工干预的数据集的最佳模型来简化机器学习的模型选择和调优部分”
=>
"目的是在不需要人工干预的情况下找到适合于数据集的最佳模型,以此来简化器学习的模型选择和调优部分"

分一些句,可能更好理解些。


[Feature engineering](https://en.wikipedia.org/wiki/Feature_engineering), also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial (see the excellent paper [“A Few Useful Things to Know about Machine Learning](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)).
[特征工程](https://en.wikipedia.org/wiki/Feature_engineering),也成为特征创建,是从已有数据中创建出新特征并且用于训练机器学习模型的过程。这个步骤可能要比实际使用的模型更加重要,因为机器学习算法仅仅从我们提供给他的数据中进行学习,创建出与任务相关的特征是非常关键的(可以参照这篇文章 ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) —— 《了解机器学习的一些有用的事》,译者注)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

”成为“ => "称为"


[Feature engineering](https://www.datacamp.com/community/tutorials/feature-engineering-kaggle) means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.
[特征工程](https://www.datacamp.com/community/tutorials/feature-engineering-kaggle)意味着从分布在多个相关表格中的现有数据集中构建出额外的特性。特征工程需要从数据中提取相关信息,并且将其放入一个单独的表中,然后可以用来训练机器学习模型。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“特性” => "特征"


The process of constructing features is very time-consuming because each new feature usually requires several steps to build, especially when using information from more than one table. We can group the operations of feature creation into two categories: **transformations** and **aggregations**. Let’s look at a few examples to see these concepts in action.
构建特征的过程非常耗时,因为每获取一项新的特征都需要很多步骤才能构建出来,尤其是当需要从多于一张表格中获取信息时。我们可以把特征创建的操作分成两类:**转换**和**聚集**。让我们通过几个例子的实战来看看这些概念。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

聚集 => 聚合


For more information on featuretools, including advanced usage, check out the [online documentation](https://docs.featuretools.com/). To see how featuretools is used in practice, read about the [work of Feature Labs](https://www.featurelabs.com/), the company behind the open-source library.
要获取更多关于特征工具的信息,包括这些工具的高级用法,可以查阅[在线文档](https://docs.featuretools.com/)。要查看特征工具如何在实践中应用,可以参见 [Feature Labs 的工作成果](https://www.featurelabs.com/),这是一个开源库背后的公司。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是一个开源库背后的公司
=> 这就是开发 featuretools 这个开源库的公司


This process involves grouping the loans table by the client, calculating the aggregations, and then merging the resulting data into the client data. Here’s how we would do that in Python using the [language of Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html).
这个过程包括了根据客户进行贷款表格分组、计算聚合、然后把计算结果数据合并到客户数据中。如下代码展示了我们如果使用 Python 中的 [language of Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)库进行计算的过程:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 中的 language of Pandas库进行计算的过程
=>
Python 中的 language of Pandas 库进行计算的过程
库之前加一个空格


The best way to think of a **relationship** between two tables is the [analogy of parent to child](https://stackoverflow.com/questions/7880921/what-is-a-parent-table-and-a-child-table-in-database). This is a one-to-many relationship: each parent can have multiple children. In the realm of tables, a parent table has one row for every parent, but the child table may have multiple rows corresponding to multiple children of the same parent.
考虑两个表之间的**关系**的最佳方式是[父亲与孩子的类比](https://stackoverflow.com/questions/7880921/what-is- par-table -and- child-table-in-database)。这是一对多的关系:每个父亲可以有多个孩子。在表领域中,父亲在每个父表中都有一行,但是子表中可能有多个行对应于同一个父亲的多个孩子。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[父亲与孩子的类比](https://stackoverflow.com/questions/7880921/what-is- par-table -and- child-table-in-database)
=>
父亲与孩子的类比

这个格式不对,请查看一下


To [formalize a relationship in featuretools](https://docs.featuretools.com/loading_data/using_entitysets.html#adding-a-relationship), we only need to specify the variable that links two tables together. The `clients` and the `loans` table are linked via the `client_id` variable and `loans` and `payments` are linked with the `loan_id`. The syntax for creating a relationship and adding it to the entityset are shown below:
要[在 featuretools 中格式化关系](https://docs.featuretools.com/loading_data/using_entitysets.html#add -a-relationship),我们只需指定将两个表链接在一起的变量。 `clients` `loans` 表通过 `loan_id` 变量链接, `loans` `payments` 通过 `loan_id` 联系在一起。创建关系并将其添加到实体集的语法如下所示:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[在 featuretools 中格式化关系](https://docs.featuretools.com/loading_data/using_entitysets.html#add -a-relationship)
=>
在 featuretools 中格式化关系

格式不对


New features are created in featuretools using these primitives either by themselves or stacking multiple primitives. Below is a list of some of the feature primitives in featuretools (we can also [define custom primitives](https://docs.featuretools.com/guides/advanced_custom_primitives.html)):
新特性是在 featruetools 中创建的,使用这些特征基元本身或叠加多个特征基元。下面是 featuretools 中的一些特征基元列表(我们还可以[定义自定义特征基元](https://docs.featuretools.com/guides/advanced_custom_basics .html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[定义自定义特征基元](https://docs.featuretools.com/guides/advanced_custom_basics .html)
=>
定义自定义特征基元

格式不对,这些格式问题应该是因为你修改了链接的格式

@leviding
Copy link
Member

leviding commented Aug 8, 2018

@mingxing47 可以修改啦

@SimonMing47
Copy link
Contributor Author

@yqian1991 感谢校对

修改完成
@SimonMing47
Copy link
Contributor Author

@fanyijihua @yqian1991 @leviding 校对修改完成

@ghost
Copy link

ghost commented Aug 9, 2018

校对认领

@fanyijihua
Copy link
Collaborator

@park-ma 妥妥哒 🍻

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议 代码高亮
同时译文质量很高,上一位校对的工作也很认真,几乎没有可以修改的地方。


Before we can quite get to deep feature synthesis, we need to understand [feature primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html). We already know what these are, but we have just been calling them by different names! These are simply the basic operations that we use to form new features:
在深入了解特性合成之前,我们需要了解[特征基元](https://docs.featuretools.com/automated_feature_engineering/primartives.html)。我们已经知道它们是什么了,但是我们只是用不同的名字称呼它们!这些是我们用来形成新特征的基本操作:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用全角冒号“:”

@ghost
Copy link

ghost commented Aug 9, 2018

@mingxing47 @leviding 校对完成

@leviding leviding added the enhancement 等待译者修改 label Aug 9, 2018
fix bug
增加 python 代码高亮
@SimonMing47
Copy link
Contributor Author

感谢 @park-ma 的校对。
@leviding 校对修改完成,增加 python 代码高亮。

@leviding leviding added the 标注 待管理员 Review label Aug 9, 2018
@leviding leviding removed the enhancement 等待译者修改 label Aug 9, 2018
@@ -2,109 +2,109 @@
> * 原文作者:[William Koehrsen](https://towardsdatascience.com/@williamkoehrsen?source=post_header_lockup)
> * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner)
> * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/automated-feature-engineering-in-python.md](https://github.com/xitu/gold-miner/blob/master/TODO1/automated-feature-engineering-in-python.md)
> * 译者:
> * 译者:[mingxing47](https://github.com/mingxing47)
> * 校对者:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

校对者信息

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经增加校对者信息

@leviding leviding added AI and removed 后端 labels Aug 10, 2018
Copy link
Member

@leviding leviding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@leviding leviding merged commit 08ae3a8 into xitu:master Aug 10, 2018
@leviding
Copy link
Member

@mingxing47 已经 merge 啦~ 快快麻溜发布到掘金然后给我发下链接,方便及时添加积分哟。

掘金翻译计划有自己的知乎专栏,你也可以投稿哈,推荐使用一个好用的插件
专栏地址:https://zhuanlan.zhihu.com/juejinfanyi

@leviding leviding added 翻译完成 and removed 标注 待管理员 Review labels Aug 10, 2018
@leviding
Copy link
Member

@mingxing47 发布的时候放到 人工智能 分类吧

@SimonMing47
Copy link
Contributor Author

@leviding 你好,已经发布到掘金:https://juejin.im/post/5b6ea0e4e51d4519044adff0
知乎专栏:https://zhuanlan.zhihu.com/p/41809504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

使用 Python 进行自动化特征工程
4 participants