Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Api draft][connector] Add Kudu source and sink connector #2250

Closed
wants to merge 78 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
770bade
Update .asf.yaml (#2103)
CalvinKirs Jul 1, 2022
df251bf
Merge pull request #2083 from apache/api-draft
CalvinKirs Jul 1, 2022
19867ee
Revert "Update .asf.yaml (#2103)" (#2104)
CalvinKirs Jul 1, 2022
a505d50
Ignore some dead-link check (#2108)
CalvinKirs Jul 1, 2022
cd34f1d
[Connector-V2]unify connector v2 modules name (#2106)
EricJoy2048 Jul 1, 2022
7ec0ada
[bugfix] Check isOpen before closing (#2107)
ic4y Jul 1, 2022
66f8161
[Feat][UI] Add multi-language related configuration. (#2116)
songjianet Jul 3, 2022
528c6fd
[Bug] [seatunnel-connector-flink-fake] Type is not supported: BigInte…
gleiyu Jul 3, 2022
418e408
Fix the bug Unrecognized field "TwoPhaseCommit" after doris 0.15 (#2054)
smallhibiscus Jul 4, 2022
da4ad17
ignore backend ci for ui update (#2121)
EricJoy2048 Jul 4, 2022
ffcf3f5
[Connector-V2]Hive Source (#2123)
CalvinKirs Jul 4, 2022
a31fdee
[SeaTunnel API] [Sink] remove useless context field (#2124)
Hisoka-X Jul 5, 2022
e807b34
[Bump]Upgrade commons version (#2131)
CalvinKirs Jul 5, 2022
e7cfa2b
[Connector-V2][Doc] Add File Sink Connector V2 document (#2120)
EricJoy2048 Jul 5, 2022
e1e4426
[seatunnel-1947][seatunnel-server] init & add interface for script/us…
dijiekstra Jul 5, 2022
3dd7dc0
[Core][Plugin] Fix same plugin can't create twice error. (#2132)
Hisoka-X Jul 5, 2022
67613d1
Fix the data output exception when accessing Hive using Spark JDBC So…
Bingz2 Jul 6, 2022
6d29e3e
Update setup.md (#2139)
zzzzzzzs Jul 6, 2022
bc82e91
[Feat][UI] Add theme-related configuration. (#2136)
songjianet Jul 6, 2022
b0e4204
fix the doc error (#2143)
EricJoy2048 Jul 7, 2022
59ce8a2
[doc][connector-v2] pulsar source options doc (#2128)
ashulin Jul 7, 2022
e2283da
[Connector-V2] Add File Sink Connector (#2117)
EricJoy2048 Jul 7, 2022
e19660a
[Improvement][new api] refer to https://github.com/apache/incubator-s…
lhyundeadsoul Jul 7, 2022
eb179cf
[Core][Starter] Fix same source and sink register plugin set twice (#…
Hisoka-X Jul 8, 2022
fd2c075
Update license.yml (#2153)
kezhenxu94 Jul 10, 2022
509add5
Update new-license.md (#2155)
CalvinKirs Jul 10, 2022
eaedc0a
[Connector-V2]Add Hudi Source (#2147)
Emor-nj Jul 11, 2022
a3a2b5d
[Bug] [connector-v2] When outputting data to clickhouse, a ClassCastE…
gaaraG Jul 12, 2022
b5a8947
[Bug] [seatunnel-core] Failed to get APP_DIR path bug fixed (#2165)
youyangkou Jul 12, 2022
dde91db
fix KafkaTableStream source json parse issue (#2168)
immustard Jul 12, 2022
cf9707c
[Feature][transform] Add a module to set default value for null field…
Interest1-wyt Jul 12, 2022
a0c0ab9
[Core]ConnectorV2 no longer provides binary packages (#2162)
EricJoy2048 Jul 13, 2022
d8d74f8
[Core][Starter] When use cluster mode, but starter app root dir also …
Hisoka-X Jul 14, 2022
e4e6d7d
[doc][connector-v2] http source options doc (#2145)
zhuangchong Jul 14, 2022
8e60ac2
[Feat][UI] Add routes configuration. (#2177)
songjianet Jul 14, 2022
5560123
[api-draft][flink-1.13] Ensure checkpoint execution for all data (#1988)
ashulin Jul 14, 2022
5acda5a
[flink-1.13] esured checkpoint enable (#2178)
ashulin Jul 15, 2022
42b3ccd
fix (#2175)
EricJoy2048 Jul 15, 2022
9464fb2
Delete a repeated dependency libary. (#2180)
ljl1988com Jul 16, 2022
960d3b9
update flinkCommand to sparkCommand in spark example (#2184)
zhaomin1423 Jul 16, 2022
6aa580d
update doc about module desc to keep consistent with the real module …
zhaomin1423 Jul 16, 2022
23ad4ee
[Connector-V2] Add Hive sink connector v2 (#2158)
EricJoy2048 Jul 16, 2022
def6958
[Feat][UI] Add login page. (#2183)
songjianet Jul 16, 2022
ebaf72c
[bug]fix commandArgs -t(--check) conflict with flink deployment t…
sandyfog Jul 16, 2022
5dbc2df
[Bug][spark-connector-v2-example] fix the bug of no class found. (#21…
TyrantLucifer Jul 18, 2022
7c56d71
update the condition to 1 = 0 about get table operation (#2186)
zhaomin1423 Jul 19, 2022
4dc871f
[Docs] Add connectors-v2 to docs item (#2187)
CalvinKirs Jul 19, 2022
dda2bb7
[Feat][UI] Add dashboard layout. (#2198)
songjianet Jul 19, 2022
6d08b5f
[checkstyle] Improved validation scope of MagicNumber (#2194)
ashulin Jul 19, 2022
c64c00f
[Bug][Connector]Hudi Source loads the data twice
superzhang0929 Jul 19, 2022
9baed60
add unknown exception message (#2204)
zhuangchong Jul 19, 2022
48a4272
[Bug] [seatunnel-api-flink] Connectors dependencies repeat additions …
gaaraG Jul 19, 2022
c61bbdd
[Bug][Script]Fix the problem that the help command is invalid
lvlv-feifei Jul 19, 2022
35f72cd
[Fix][CI] Add remove jar from /tmp/seatunnel-dependencies before run
EricJoy2048 Jul 19, 2022
99d7cea
[Feat][UI] Add dashboard default router. (#2216)
songjianet Jul 19, 2022
b8c9bb2
[Feat][UI] Add the header component in the dashboard layout. (#2218)
songjianet Jul 19, 2022
2dd3476
[Core][Starter] Change jar connector load logic (#2193)
Hisoka-X Jul 19, 2022
9dbace9
[Docs]Fix Flink engine version requirements (#2220)
CalvinKirs Jul 20, 2022
a41b0d1
[Feat][UI] Add the setting dropdown in the dashboard layout. (#2225)
songjianet Jul 20, 2022
e96fbd8
[Feat][UI] Add the user dropdown in the dashboard layout. (#2228)
songjianet Jul 20, 2022
62ca075
[Bug][hive-connector-v2] Resolve the schema inconsistency bug (#2229)…
TyrantLucifer Jul 21, 2022
5fc205b
[doc] Correct v2 connector avoid duplicate slug (#2231)
zhongjiajie Jul 21, 2022
8250fc8
[Build]Optimize license check (#2232)
CalvinKirs Jul 21, 2022
cf5327c
[Core][Starter] Fix connector v2 can't deserialize on spark (#2221)
Hisoka-X Jul 21, 2022
db04651
[Bug][connector-hive] filter '_SUCCESS' file in file list (#2235) (#2…
TyrantLucifer Jul 22, 2022
8c426ef
StateT of SeaTunnelSource should extend `Serializable` (#2214)
lhyundeadsoul Jul 22, 2022
31d43f3
[Feat][UI] Add the table in the user manage. (#2234)
songjianet Jul 22, 2022
1737798
0
2013650523 Jul 22, 2022
8801c60
[Feat][UI] Add the table in the data pipes. (#2247)
songjianet Jul 23, 2022
eedef28
Update pom.xml
2013650523 Jul 23, 2022
146fa24
Update plugin-mapping.properties
2013650523 Jul 23, 2022
9b1f10b
Update pom.xml
2013650523 Jul 23, 2022
8c28e80
add email sink connector
2013650523 Jul 23, 2022
7414bcc
[CI]Fix License Check not found snapshot jar (#2242)
CalvinKirs Jul 23, 2022
b71c993
using constants replace the hard code string (#2251)
zhaomin1423 Jul 23, 2022
ed790c8
Delete seatunnel-connectors-v2/connector-email directory
2013650523 Jul 23, 2022
5ba8049
Update plugin-mapping.properties
2013650523 Jul 23, 2022
48018fa
Merge branch 'dev' into api-draft
2013650523 Jul 23, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .dlc.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
},
{
"pattern": "^/docs/category"
},
{
"pattern": "^https://opencollective.com"
}
],
"timeout": "10s",
Expand Down
13 changes: 9 additions & 4 deletions .github/workflows/backend.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ on:
paths-ignore:
- 'docs/**'
- '**/*.md'
- 'seatunnel-ui/**'

concurrency:
group: backend-${{ github.event.pull_request.number || github.ref }}
Expand Down Expand Up @@ -88,7 +89,7 @@ jobs:
java: [ '8', '11' ]
os: [ 'ubuntu-latest', 'windows-latest' ]
runs-on: ${{ matrix.os }}
timeout-minutes: 50
timeout-minutes: 80
steps:
- uses: actions/checkout@v3
with:
Expand All @@ -114,7 +115,7 @@ jobs:
name: Dependency licenses
needs: [ sanity-check ]
runs-on: ubuntu-latest
timeout-minutes: 30
timeout-minutes: 40
steps:
- uses: actions/checkout@v3
with:
Expand Down Expand Up @@ -154,7 +155,9 @@ jobs:
cache: 'maven'
- name: Run Unit tests
run: |
./mvnw -T 2C -B clean verify -D"maven.test.skip"=false -D"checkstyle.skip"=true -D"scalastyle.skip"=true -D"license.skipAddThirdParty"=true --no-snapshot-updates
./mvnw -B -T 1C clean verify -D"maven.test.skip"=false -D"checkstyle.skip"=true -D"scalastyle.skip"=true -D"license.skipAddThirdParty"=true --no-snapshot-updates
env:
MAVEN_OPTS: -Xmx2048m

integration-test:
name: Integration Test
Expand All @@ -175,4 +178,6 @@ jobs:
cache: 'maven'
- name: Run Integration tests
run: |
./mvnw -T 2C -B verify -DskipUT=true -DskipIT=false -D"checkstyle.skip"=true -D"scalastyle.skip"=true -D"license.skipAddThirdParty"=true --no-snapshot-updates
./mvnw -T 1C -B verify -DskipUT=true -DskipIT=false -D"checkstyle.skip"=true -D"scalastyle.skip"=true -D"license.skipAddThirdParty"=true --no-snapshot-updates
env:
MAVEN_OPTS: -Xmx2048m
2 changes: 2 additions & 0 deletions .github/workflows/code-analysys.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@ on:
paths-ignore:
- 'docs/**'
- '**/*.md'
- 'seatunnel-ui/**'
jobs:
build:
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- uses: actions/checkout@v2
with:
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/codeql.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,13 @@ on:
paths-ignore:
- 'docs/**'
- '**/*.md'
- 'seatunnel-ui/**'

jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
timeout-minutes: 30
timeout-minutes: 60
env:
JAVA_TOOL_OPTIONS: -Xmx2G -Xms2G -Dhttp.keepAlive=false -Dmaven.test.skip=true -Dcheckstyle.skip=true -Dlicense.skipAddThirdParty=true -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.http.retryHandler.count=3 -Dmaven.wagon.httpconnectionManager.ttlSeconds=120

Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ on:
paths-ignore:
- 'docs/**'
- '**/*.md'
- 'seatunnel-ui/**'

concurrency:
group: docker-${{ github.event.pull_request.number || github.ref }}
Expand All @@ -34,7 +35,7 @@ jobs:
check:
name: Spark
runs-on: ubuntu-latest
timeout-minutes: 30
timeout-minutes: 60
steps:
- uses: actions/checkout@v2
- name: Set up JDK 1.8
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/license.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
steps:
- uses: actions/checkout@v2
- name: Check License Header
uses: apache/skywalking-eyes@main
uses: apache/skywalking-eyes/header@501a28d2fb4a9b962661987e50cf0219631b32ff
auto-license:
name: Auto License
runs-on: ubuntu-latest
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/publish-docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ on:
paths-ignore:
- 'docs/**'
- '**/*.md'
- 'seatunnel-ui/**'

env:
HUB: ghcr.io/${{ github.repository }}
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ target/
# Intellij Idea files
.idea/
*.iml
.idea/*

.DS_Store

Expand Down Expand Up @@ -40,4 +41,4 @@ Test.scala
test.conf
log4j.properties
spark-warehouse
*.flattened-pom.xml
*.flattened-pom.xml
262 changes: 262 additions & 0 deletions docs/en/connector-v2/sink/File.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# File

## Description

Output data to local or hdfs or s3 file.

## Options

<Tabs
groupId="engine-type"
defaultValue="LocalFile"
values={[
{label: 'LocalFile', value: 'LocalFile'},
{label: 'HdfsFile', value: 'HdfsFile'},
]}>
<TabItem value="LocalFile">

| name | type | required | default value |
| --------------------------------- | ------ | -------- | ------------------------------------------------------------- |
| path | string | yes | - |
| file_name_expression | string | no | "${transactionId}" |
| file_format | string | no | "text" |
| filename_time_format | string | no | "yyyy.MM.dd" |
| field_delimiter | string | no | '\001' |
| row_delimiter | string | no | "\n" |
| partition_by | array | no | - |
| partition_dir_expression | string | no | "\${k0}=\${v0}\/\${k1}=\${v1}\/...\/\${kn}=\${vn}\/" |
| is_partition_field_write_in_file | boolean| no | false |
| sink_columns | array | no | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean| no | true |
| save_mode | string | no | "error" |

### path [string]

The target dir path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,

### file_name_expression [string]

`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,
`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.

Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.

### file_format [string]

We supported `file_format` is `text`.

Please note that, The final file name will ends with the file_format's suffix, the suffix of the text file is `txt`.

### filename_time_format [string]

When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:

| Symbol | Description |
| ------ | ------------------ |
| y | Year |
| M | Month |
| d | Day of month |
| H | Hour in day (0-23) |
| m | Minute in hour |
| s | Second in minute |

See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.

### field_delimiter [string]

The separator between columns in a row of data.

### row_delimiter [string]

The separator between rows in a file.

### partition_by [array]

Partition data based on selected fields

### partition_dir_expression [string]

If the `partition_by` is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.

Default `partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field and `v0` is the value of the first partition field.

### is_partition_field_write_in_file [boolean]

If `is_partition_field_write_in_file` is `true`, the partition field and the value of it will be write into data file.

For example, if you want to write a Hive Data File, Its value should be `false`.

### sink_columns [array]

Which columns need be write to file, default value is all of the columns get from `Transform` or `Source`.
The order of the fields determines the order in which the file is actually written.

### is_enable_transaction [boolean]

If `is_enable_transaction` is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.

Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.

Only support `true` now.

### save_mode [string]

Storage mode, currently supports `overwrite` , `append` , `ignore` and `error` . For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)

Streaming Job not support `overwrite`.

</TabItem>
<TabItem value="HdfsFile">

In order to use this connector, You must ensure your spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.

| name | type | required | default value |
| --------------------------------- | ------ | -------- | ------------------------------------------------------------- |
| path | string | yes | - |
| file_name_expression | string | no | "${transactionId}" |
| file_format | string | no | "text" |
| filename_time_format | string | no | "yyyy.MM.dd" |
| field_delimiter | string | no | '\001' |
| row_delimiter | string | no | "\n" |
| partition_by | array | no | - |
| partition_dir_expression | string | no | "\${k0}=\${v0}\/\${k1}=\${v1}\/...\/\${kn}=\${vn}\/" |
| is_partition_field_write_in_file | boolean| no | false |
| sink_columns | array | no | When this parameter is empty, all fields are sink columns |
| is_enable_transaction | boolean| no | true |
| save_mode | string | no | "error" |

### path [string]

The target dir path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,

### file_name_expression [string]

`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,
`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.

Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.

### file_format [string]

We supported `file_format` is `text`.

Please note that, The final file name will ends with the file_format's suffix, the suffix of the text file is `txt`.

### filename_time_format [string]

When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:

| Symbol | Description |
| ------ | ------------------ |
| y | Year |
| M | Month |
| d | Day of month |
| H | Hour in day (0-23) |
| m | Minute in hour |
| s | Second in minute |

See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.

### field_delimiter [string]

The separator between columns in a row of data.

### row_delimiter [string]

The separator between rows in a file.

### partition_by [array]

Partition data based on selected fields

### partition_dir_expression [string]

If the `partition_by` is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.

Default `partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field and `v0` is the value of the first partition field.

### is_partition_field_write_in_file [boolean]

If `is_partition_field_write_in_file` is `true`, the partition field and the value of it will be write into data file.

For example, if you want to write a Hive Data File, Its value should be `false`.

### sink_columns [array]

Which columns need be write to file, default value is all of the columns get from `Transform` or `Source`.
The order of the fields determines the order in which the file is actually written.

### is_enable_transaction [boolean]

If `is_enable_transaction` is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.

Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.

Only support `true` now.

### save_mode [string]

Storage mode, currently supports `overwrite` , `append` , `ignore` and `error` . For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)

Streaming Job not support `overwrite`.
</TabItem>
</Tabs>

## Example

<Tabs
groupId="engine-type"
defaultValue="LocalFile"
values={[
{label: 'LocalFile', value: 'LocalFile'},
{label: 'HdfsFile', value: 'HdfsFile'},
]}>
<TabItem value="LocalFile">

```bash

LocalFile {
path="file:///tmp/hive/warehouse/test2"
field_delimiter="\t"
row_delimiter="\n"
partition_by=["age"]
partition_dir_expression="${k0}=${v0}"
is_partition_field_write_in_file=true
file_name_expression="${transactionId}_${now}"
file_format="text"
sink_columns=["name","age"]
filename_time_format="yyyy.MM.dd"
is_enable_transaction=true
save_mode="error"
}

```

</TabItem>

<TabItem value="LocalFile">

```bash

HdfsFile {
path="file:///tmp/hive/warehouse/test2"
field_delimiter="\t"
row_delimiter="\n"
partition_by=["age"]
partition_dir_expression="${k0}=${v0}"
is_partition_field_write_in_file=true
file_name_expression="${transactionId}_${now}"
file_format="text"
sink_columns=["name","age"]
filename_time_format="yyyy.MM.dd"
is_enable_transaction=true
save_mode="error"
}

```

</TabItem>
</Tabs>
Loading