Skip to content

Commit

Permalink
Modify the Chinese and English documents of QuantileDiscretizer, Oneh…
Browse files Browse the repository at this point in the history
…otEncoder, Bucketizer.

Modify the adult example in pyalink

See #51
  • Loading branch information
lqb11 authored and qiuxiafei committed Feb 28, 2020
1 parent a94ab7c commit 9e2a716
Show file tree
Hide file tree
Showing 23 changed files with 365 additions and 381 deletions.
18 changes: 11 additions & 7 deletions docs/cn/bucketizer.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,23 @@
## 功能介绍
给定切分点,将连续变量分桶,可支持单列输入或多列输入,对应需要给出单列切分点或者多列切分点。

每列切分点需要严格递增,且至少有三个点。

## 参数说明

<!-- This is the start of auto-generated parameter info -->
<!-- DO NOT EDIT THIS PART!!! -->

| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 默认值 |
| --- | --- | --- | --- | --- | --- |
| handleInvalid | 如何处理无效值 | 可以选择skip:跳过,error:报错抛异常。 | String | | "error" |
| selectedCols | 计算列对应的列名列表 | 计算列对应的列名列表 | String[] | | |
| splitsArray | 多列的切分点 | 多列的切分点 | String[] | | |
| selectedCols | 选择的列名 | 计算列对应的列名列表 | String[] || |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |
| outputCols | 输出结果列列名数组 | 输出结果列列名数组,可选,默认null | String[] | | null |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |<!-- This is the end of auto-generated parameter info -->
| handleInvalid | 未知Token处理策略 | 未知Token处理策略,"keep", "skip", "error" | String | | "keep" |
| encode | 编码方式 | 编码方式,"INDEX", "VECTOR", "ASSEMBLED_VECTOR" | String | |INDEX |
| dropLast | 是否删除最后一个元素 | 是否删除最后一个元素 | Boolean | | true |
| leftOpen | 是否左开右闭 | 是否左开右闭 | Boolean | | true |
| cutsArray | 多列的切分点 | 多列的切分点 | double[][] || |

<!-- This is the end of auto-generated parameter info -->

## 脚本示例
#### 脚本代码
Expand All @@ -29,7 +33,7 @@ data = np.array([
df = pd.DataFrame({"double": data[:, 0], "bool": data[:, 1], "number": data[:, 2], "str": data[:, 3]})
inOp = BatchOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
bucketizer = Bucketizer().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
bucketizer = Bucketizer().setSelectedCols(["double"]).setCutsArray([[2]])
bucketizer.transform(inOp).print()
```
#### 脚本运行结果
Expand Down
17 changes: 9 additions & 8 deletions docs/cn/bucketizerbatchop.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
## 功能介绍
给定切分点,将连续变量分桶,可支持单列输入或多列输入,对应需要给出单列切分点或者多列切分点。

每列切分点需要严格递增,且至少有三个点。

## 参数说明

<!-- This is the start of auto-generated parameter info -->
<!-- DO NOT EDIT THIS PART!!! -->
| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 默认值 |
| --- | --- | --- | --- | --- | --- |
| handleInvalid | 如何处理无效值 | 可以选择skip:跳过,error:报错抛异常。 | String | | "error" |
| selectedCols | 计算列对应的列名列表 | 计算列对应的列名列表 | String[] | | |
| splitsArray | 多列的切分点 | 多列的切分点 | String[] | | |
| selectedCols | 选择的列名 | 计算列对应的列名列表 | String[] || |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |
| outputCols | 输出结果列列名数组 | 输出结果列列名数组,可选,默认null | String[] | | null |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |<!-- This is the end of auto-generated parameter info -->
| handleInvalid | 未知Token处理策略 | 未知Token处理策略,"keep", "skip", "error" | String | | "keep" |
| encode | 编码方式 | 编码方式,"INDEX", "VECTOR", "ASSEMBLED_VECTOR" | String | |INDEX |
| dropLast | 是否删除最后一个元素 | 是否删除最后一个元素 | Boolean | | true |
| leftOpen | 是否左开右闭 | 是否左开右闭 | Boolean | | true |
| cutsArray | 多列的切分点 | 多列的切分点 | double[][] | ✓ | |<!-- This is the end of auto-generated parameter info -->

## 脚本示例
#### 脚本代码
Expand All @@ -31,10 +32,10 @@ df = pd.DataFrame({"double": data[:, 0], "bool": data[:, 1], "number": data[:, 2
inOp1 = BatchOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
inOp2 = StreamOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
bucketizer = BucketizerBatchOp().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
bucketizer = BucketizerBatchOp().setSelectedCols(["double"]).setCutsArray([[2]])
bucketizer.linkFrom(inOp1).print()
bucketizer = BucketizerStreamOp().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
bucketizer = BucketizerStreamOp().setSelectedCols(["double"]).setCutsArray([[2]])
bucketizer.linkFrom(inOp2).print()
StreamOperator.execute()
Expand Down
18 changes: 10 additions & 8 deletions docs/cn/bucketizerstreamop.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
## 功能介绍
给定切分点,将连续变量分桶,可支持单列输入或多列输入,对应需要给出单列切分点或者多列切分点。

每列切分点需要严格递增,且至少有三个点。

## 参数说明

<!-- This is the start of auto-generated parameter info -->
<!-- DO NOT EDIT THIS PART!!! -->
| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 默认值 |
| --- | --- | --- | --- | --- | --- |
| handleInvalid | 如何处理无效值 | 可以选择skip:跳过,error:报错抛异常。 | String | | "error" |
| selectedCols | 计算列对应的列名列表 | 计算列对应的列名列表 | String[] | | |
| splitsArray | 多列的切分点 | 多列的切分点 | String[] | | |
| selectedCols | 选择的列名 | 计算列对应的列名列表 | String[] || |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |
| outputCols | 输出结果列列名数组 | 输出结果列列名数组,可选,默认null | String[] | | null |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |<!-- This is the end of auto-generated parameter info -->
| handleInvalid | 未知Token处理策略 | 未知Token处理策略,"keep", "skip", "error" | String | | "keep" |
| encode | 编码方式 | 编码方式,"INDEX", "VECTOR", "ASSEMBLED_VECTOR" | String | |INDEX |
| dropLast | 是否删除最后一个元素 | 是否删除最后一个元素 | Boolean | | true |
| leftOpen | 是否左开右闭 | 是否左开右闭 | Boolean | | true |
| cutsArray | 多列的切分点 | 多列的切分点 | double[][] | ✓ | |<!-- This is the end of auto-generated parameter info -->

## 脚本示例
#### 脚本代码
Expand All @@ -30,10 +32,10 @@ df = pd.DataFrame({"double": data[:, 0], "bool": data[:, 1], "number": data[:, 2
inOp1 = BatchOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
inOp2 = StreamOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
bucketizer = BucketizerBatchOp().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
bucketizer = BucketizerBatchOp().setSelectedCols(["double"]).setCutsArray([[2]])
bucketizer.linkFrom(inOp1).print()
bucketizer = BucketizerStreamOp().setSelectedCols(["double"]).setSplitsArray(["-Infinity:2:Infinity"])
bucketizer = BucketizerStreamOp().setSelectedCols(["double"]).setCutsArray([[2]])
bucketizer.linkFrom(inOp2).print()
StreamOperator.execute()
Expand Down
68 changes: 25 additions & 43 deletions docs/cn/onehotencoder.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,66 +9,48 @@ one-hot编码,也称独热编码,对于每一个特征,如果它有m个可
<!-- OLD_TABLE -->
<!-- This is the start of auto-generated parameter info -->
<!-- DO NOT EDIT THIS PART!!! -->
| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 默认值 |
名称 | 中文名称 | 描述 | 类型 | 是否必须? | 默认值 |
| --- | --- | --- | --- | --- | --- |
| dropLast | 是否删除最后一个元素 | 删除最后一个元素是为了保证线性无关性。默认true | Boolean | | true |
| ignoreNull | 受否忽略null | 忽略将不对null 编码 | Boolean | | false |
| discreteThresholdsArray | 离散个数阈值 | 离散个数阈值,每一列对应数组中一个元素 | Integer[] | | |
| discreteThresholds | 离散个数阈值 | 离散个数阈值,低于该阈值的离散样本将不会单独成一个组别 | Integer | | Integer.MIN_VALUE |
| selectedCols | 选择的列名 | 计算列对应的列名列表 | String[] || |
selectedCols | 选择的列名 | 计算列对应的列名列表 | String[] | ✓ | |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |
| outputCol | 输出结果列列名 | 输出结果列列名,必选 | String | ✓ | |<!-- This is the end of auto-generated parameter info -->
| outputCols | 输出结果列列名数组 | 输出结果列列名数组,可选,默认null | String[] | | null |
| handleInvalid | 未知Token处理策略 | 未知Token处理策略,"keep", "skip", "error" | String | | "keep" |
| encode | 编码方式 | 编码方式,"INDEX", "VECTOR", "ASSEMBLED_VECTOR" | String | | "ASSEMBLED_VECTOR" |
| dropLast | 是否删除最后一个元素 | 是否删除最后一个元素 | Boolean | | true |

<!-- This is the end of auto-generated parameter info -->


## 脚本示例
#### 运行脚本
```python
import numpy as np
import pandas as pd
data = np.array([
["assisbragasm", 1],
["assiseduc", 1],
["assist", 1],
["assiseduc", 1],
["assistebrasil", 1],
["assiseduc", 1],
["assistebrasil", 1],
["assistencialgsamsung", 1]
[1.1, True, "2", "A"],
[1.1, False, "2", "B"],
[1.1, True, "1", "B"],
[2.2, True, "1", "A"]
])
df = pd.DataFrame({"double": data[:, 0], "bool": data[:, 1], "number": data[:, 2], "str": data[:, 3]})

# load data
df = pd.DataFrame({"query": data[:, 0], "weight": data[:, 1]})

inOp = dataframeToOperator(df, schemaStr='query string, weight long', op_type='batch')

# one hot train
one_hot = OneHotEncoder()\
.setSelectedCols(["query"])\
.setDropLast(False)\
.setIgnoreNull(False)\
.setOutputCol("predicted_r")\
.setReservedCols(["weight"])


model = one_hot.fit(inOp)
model.transform(inOp).print()
inOp1 = BatchOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')

# stream predict
inOp2 = dataframeToOperator(df, schemaStr='query string, weight long', op_type='stream')
model.transform(inOp2).print()

StreamOperator.execute()
onehot = OneHotEncoder().setSelectedCols(["double", "bool"]).setDiscreteThresholds(2).setEncode("ASSEMBLED_VECTOR").setOutputCols(["pred"]).setDropLast(False)
onehot.fit(inOp).transform(inOp).collectToDataframe()
```

#### 运行结果

```python
weight predicted_r
0 1 $6$4:1.0
1 1 $6$3:1.0
2 1 $6$2:1.0
3 1 $6$3:1.0
4 1 $6$1:1.0
5 1 $6$3:1.0
6 1 $6$1:1.0
7 1 $6$0:1.0

double bool number str pred
0 1.1 True 2 A $6$0:1.0 3:1.0
1 1.1 False 2 B $6$0:1.0 5:1.0
2 1.1 True 1 B $6$0:1.0 3:1.0
3 2.2 True 1 A $6$2:1.0 3:1.0
```


Expand Down
76 changes: 39 additions & 37 deletions docs/cn/onehotpredictbatchop.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,59 +13,61 @@ one-hot编码,也称独热编码,对于每一个特征,如果它有m个可
<!-- DO NOT EDIT THIS PART!!! -->
| 名称 | 中文名称 | 描述 | 类型 | 是否必须? | 默认值 |
| --- | --- | --- | --- | --- | --- |
| selectedCols | 选择的列名 | 计算列对应的列名列表 | String[] || |
| reservedCols | 算法保留列名 | 算法保留列 | String[] | | null |
| outputCol | 输出结果列列名 | 输出结果列列名,必选 | String | ✓ | |<!-- This is the end of auto-generated parameter info -->
| outputCols | 输出结果列列名数组 | 输出结果列列名数组,可选,默认null | String[] | | null |
| handleInvalid | 未知Token处理策略 | 未知Token处理策略,"keep", "skip", "error" | String | | "keep" |
| encode | 编码方式 | 编码方式,"INDEX", "VECTOR", "ASSEMBLED_VECTOR" | String | | "ASSEMBLED_VECTOR" |
| dropLast | 是否删除最后一个元素 | 是否删除最后一个元素 | Boolean | | true |


<!-- This is the end of auto-generated parameter info -->

## 脚本示例
#### 运行脚本
```python
import numpy as np
import pandas as pd
data = np.array([
["assisbragasm", 1],
["assiseduc", 1],
["assist", 1],
["assiseduc", 1],
["assistebrasil", 1],
["assiseduc", 1],
["assistebrasil", 1],
["assistencialgsamsung", 1]
[1.1, True, "2", "A"],
[1.1, False, "2", "B"],
[1.1, True, "1", "B"],
[2.2, True, "1", "A"]
])

# load data
df = pd.DataFrame({"query": data[:, 0], "weight": data[:, 1]})

inOp = dataframeToOperator(df, schemaStr='query string, weight long', op_type='batch')

# one hot train
one_hot = OneHotTrainBatchOp().setSelectedCols(["query"]).setDropLast(False).setIgnoreNull(False)
model = inOp.link(one_hot)

# batch predict
predictor = OneHotPredictBatchOp().setOutputCol("predicted_r").setReservedCols(["weight"])
print(BatchOperator.collectToDataframe(predictor.linkFrom(model, inOp)))

# stream predict
inOp2 = dataframeToOperator(df, schemaStr='query string, weight long', op_type='stream')
predictor = OneHotPredictStreamOp(model).setOutputCol("predicted_r").setReservedCols(["weight"])
predictor.linkFrom(inOp2).print()

df = pd.DataFrame({"double": data[:, 0], "bool": data[:, 1], "number": data[:, 2], "str": data[:, 3]})

inOp1 = BatchOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')
inOp2 = StreamOperator.fromDataframe(df, schemaStr='double double, bool boolean, number int, str string')

onehot = OneHotTrainBatchOp().setSelectedCols(["double", "bool", "number", "str"]).setDiscreteThresholds(2)
predictBatch = OneHotPredictBatchOp().setSelectedCols(["double", "bool"]).setEncode("ASSEMBLED_VECTOR").setOutputCols(["pred"]).setDropLast(False)
onehot.linkFrom(inOp1)
predictBatch.linkFrom(onehot, inOp1)
[model,predict] = collectToDataframes(onehot, predictBatch)
print(model)
print(predict)

predictStream = OneHotPredictStreamOp(onehot).setSelectedCols(["double", "bool"]).setEncode("ASSEMBLED_VECTOR").setOutputCols(["vec"])
predictStream.linkFrom(inOp2)
predictStream.print(refreshInterval=-1)
StreamOperator.execute()
```
#### 运行结果

```python
weight predicted_r
0 1 $6$4:1.0
1 1 $6$3:1.0
2 1 $6$2:1.0
3 1 $6$3:1.0
4 1 $6$1:1.0
5 1 $6$3:1.0
6 1 $6$1:1.0
7 1 $6$0:1.0
double bool number str pred
0 1.1 True 2 A $6$0:1.0 3:1.0
1 1.1 False 2 B $6$0:1.0 5:1.0
2 1.1 True 1 B $6$0:1.0 3:1.0
3 2.2 True 1 A $6$2:1.0 3:1.0

```









Loading

0 comments on commit 9e2a716

Please sign in to comment.