-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用 Python 进行自动化特征工程 #4262
使用 Python 进行自动化特征工程 #4262
Conversation
@leviding @fanyijihua 翻译完成 |
校对认领 |
@yqian1991 好的呢 🍺 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mingxing47 @leviding 校对完成,翻译通畅,只是存在一些格式问题。
> * 校对者: | ||
|
||
# Automated Feature Engineering in Python | ||
# Python 中的自动特征工程 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
翻译没有问题,只是个人建议,“Python 中的特征工程自动化”
|
||
![](https://cdn-images-1.medium.com/max/1000/1*lg3OxWVYDsJFN-snBY7M5w.jpeg) | ||
|
||
Machine learning is increasingly moving from hand-designed models to automatically optimized pipelines using tools such as [H20](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html), [TPOT](https://epistasislab.github.io/tpot/), and [auto-sklearn](https://automl.github.io/auto-sklearn/stable/). These libraries, along with methods such as [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), aim to simplify the model selection and tuning parts of machine learning by finding the best model for a dataset with little to no manual intervention. However, feature engineering, an [arguably more valuable aspect](https://www.featurelabs.com/blog/secret-to-data-science-success/) of the machine learning pipeline, remains almost entirely a human labor. | ||
机器学习正在利用诸如 [H20](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)、[TPOT](https://epistasislab.github.io/tpot/) 和 [auto-sklearn](https://automl.github.io/auto-sklearn/stable/) 等工具越来越多地从手工设计模型向自动化优化管道迁移。以上这些类库,连同如 [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) 等方法一起,目的都是在通过找到适合于几乎不需要人工干预的数据集的最佳模型来简化机器学习的模型选择和调优部分。然而,特征工程,作为机器学习管道中一个[可以说是更有价值的方面](https://www.featurelabs.com/blog/secret-to-data-science-success/),几乎全部是手工活。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“目的都是通过找到适合于几乎不需要人工干预的数据集的最佳模型来简化机器学习的模型选择和调优部分”
=>
"目的是在不需要人工干预的情况下找到适合于数据集的最佳模型,以此来简化器学习的模型选择和调优部分"
分一些句,可能更好理解些。
|
||
[Feature engineering](https://en.wikipedia.org/wiki/Feature_engineering), also known as feature creation, is the process of constructing new features from existing data to train a machine learning model. This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial (see the excellent paper [“A Few Useful Things to Know about Machine Learning”](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)). | ||
[特征工程](https://en.wikipedia.org/wiki/Feature_engineering),也成为特征创建,是从已有数据中创建出新特征并且用于训练机器学习模型的过程。这个步骤可能要比实际使用的模型更加重要,因为机器学习算法仅仅从我们提供给他的数据中进行学习,创建出与任务相关的特征是非常关键的(可以参照这篇文章 ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) —— 《了解机器学习的一些有用的事》,译者注)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
”成为“ => "称为"
|
||
[Feature engineering](https://www.datacamp.com/community/tutorials/feature-engineering-kaggle) means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model. | ||
[特征工程](https://www.datacamp.com/community/tutorials/feature-engineering-kaggle)意味着从分布在多个相关表格中的现有数据集中构建出额外的特性。特征工程需要从数据中提取相关信息,并且将其放入一个单独的表中,然后可以用来训练机器学习模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“特性” => "特征"
|
||
The process of constructing features is very time-consuming because each new feature usually requires several steps to build, especially when using information from more than one table. We can group the operations of feature creation into two categories: **transformations** and **aggregations**. Let’s look at a few examples to see these concepts in action. | ||
构建特征的过程非常耗时,因为每获取一项新的特征都需要很多步骤才能构建出来,尤其是当需要从多于一张表格中获取信息时。我们可以把特征创建的操作分成两类:**转换**和**聚集**。让我们通过几个例子的实战来看看这些概念。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
聚集 => 聚合
|
||
For more information on featuretools, including advanced usage, check out the [online documentation](https://docs.featuretools.com/). To see how featuretools is used in practice, read about the [work of Feature Labs](https://www.featurelabs.com/), the company behind the open-source library. | ||
要获取更多关于特征工具的信息,包括这些工具的高级用法,可以查阅[在线文档](https://docs.featuretools.com/)。要查看特征工具如何在实践中应用,可以参见 [Feature Labs 的工作成果](https://www.featurelabs.com/),这是一个开源库背后的公司。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是一个开源库背后的公司
=> 这就是开发 featuretools 这个开源库的公司
|
||
This process involves grouping the loans table by the client, calculating the aggregations, and then merging the resulting data into the client data. Here’s how we would do that in Python using the [language of Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html). | ||
这个过程包括了根据客户进行贷款表格分组、计算聚合、然后把计算结果数据合并到客户数据中。如下代码展示了我们如果使用 Python 中的 [language of Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)库进行计算的过程: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 中的 language of Pandas库进行计算的过程
=>
Python 中的 language of Pandas 库进行计算的过程
库之前加一个空格
|
||
The best way to think of a **relationship** between two tables is the [analogy of parent to child](https://stackoverflow.com/questions/7880921/what-is-a-parent-table-and-a-child-table-in-database). This is a one-to-many relationship: each parent can have multiple children. In the realm of tables, a parent table has one row for every parent, but the child table may have multiple rows corresponding to multiple children of the same parent. | ||
考虑两个表之间的**关系**的最佳方式是[父亲与孩子的类比](https://stackoverflow.com/questions/7880921/what-is- par-table -and- child-table-in-database)。这是一对多的关系:每个父亲可以有多个孩子。在表领域中,父亲在每个父表中都有一行,但是子表中可能有多个行对应于同一个父亲的多个孩子。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[父亲与孩子的类比](https://stackoverflow.com/questions/7880921/what-is- par-table -and- child-table-in-database)
=>
父亲与孩子的类比
这个格式不对,请查看一下
|
||
To [formalize a relationship in featuretools](https://docs.featuretools.com/loading_data/using_entitysets.html#adding-a-relationship), we only need to specify the variable that links two tables together. The `clients` and the `loans` table are linked via the `client_id` variable and `loans` and `payments` are linked with the `loan_id`. The syntax for creating a relationship and adding it to the entityset are shown below: | ||
要[在 featuretools 中格式化关系](https://docs.featuretools.com/loading_data/using_entitysets.html#add -a-relationship),我们只需指定将两个表链接在一起的变量。 `clients` 和 `loans` 表通过 `loan_id` 变量链接, `loans` 和 `payments` 通过 `loan_id` 联系在一起。创建关系并将其添加到实体集的语法如下所示: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[在 featuretools 中格式化关系](https://docs.featuretools.com/loading_data/using_entitysets.html#add -a-relationship)
=>
在 featuretools 中格式化关系
格式不对
|
||
New features are created in featuretools using these primitives either by themselves or stacking multiple primitives. Below is a list of some of the feature primitives in featuretools (we can also [define custom primitives](https://docs.featuretools.com/guides/advanced_custom_primitives.html)): | ||
新特性是在 featruetools 中创建的,使用这些特征基元本身或叠加多个特征基元。下面是 featuretools 中的一些特征基元列表(我们还可以[定义自定义特征基元](https://docs.featuretools.com/guides/advanced_custom_basics .html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[定义自定义特征基元](https://docs.featuretools.com/guides/advanced_custom_basics .html)
=>
定义自定义特征基元
格式不对,这些格式问题应该是因为你修改了链接的格式
@mingxing47 可以修改啦 |
@yqian1991 感谢校对 |
@fanyijihua @yqian1991 @leviding 校对修改完成 |
校对认领 |
@park-ma 妥妥哒 🍻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议 代码高亮
同时译文质量很高,上一位校对的工作也很认真,几乎没有可以修改的地方。
|
||
Before we can quite get to deep feature synthesis, we need to understand [feature primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html). We already know what these are, but we have just been calling them by different names! These are simply the basic operations that we use to form new features: | ||
在深入了解特性合成之前,我们需要了解[特征基元](https://docs.featuretools.com/automated_feature_engineering/primartives.html)。我们已经知道它们是什么了,但是我们只是用不同的名字称呼它们!这些是我们用来形成新特征的基本操作: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用全角冒号“:”
@mingxing47 @leviding 校对完成 |
@@ -2,109 +2,109 @@ | |||
> * 原文作者:[William Koehrsen](https://towardsdatascience.com/@williamkoehrsen?source=post_header_lockup) | |||
> * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner) | |||
> * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/automated-feature-engineering-in-python.md](https://github.com/xitu/gold-miner/blob/master/TODO1/automated-feature-engineering-in-python.md) | |||
> * 译者: | |||
> * 译者:[mingxing47](https://github.com/mingxing47) | |||
> * 校对者: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
校对者信息
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经增加校对者信息
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
@mingxing47 已经 merge 啦~ 快快麻溜发布到掘金然后给我发下链接,方便及时添加积分哟。 掘金翻译计划有自己的知乎专栏,你也可以投稿哈,推荐使用一个好用的插件。 |
@mingxing47 发布的时候放到 人工智能 分类吧 |
译文翻译完成,resolve #3990