Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Connector-V2]Add Hudi Source #2147

Merged
merged 5 commits into from
Jul 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/en/connector-v2/source/Hudi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Hudi

## Description

Used to read data from Hudi. Currently, only supports hudi cow table and Snapshot Query with Batch Mode.

## Options

| name | type | required | default value |
|--------------------------|---------|----------|---------------|
| table.path | string | yes | - |
| table.type | string | yes | - |
| conf.files | string | yes | - |
| use.kerberos | boolean | no | false |
| kerberos.principal | string | no | - |
| kerberos.principal.file | string | no | - |

### table.path [string]

`table.path` The hdfs root path of hudi table,such as 'hdfs://nameserivce/data/hudi/hudi_table/'.

### table.type [string]

`table.type` The type of hudi table. Now we only support 'cow', 'mor' is not support yet.

### conf.files [string]

`conf.files` The environment conf file path list(local path), which used to init hdfs client to read hudi table file. The example is '/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml'.

### use.kerberos [boolean]

`use.kerberos` Whether to enable Kerberos, default is false.

### kerberos.principal [string]

`kerberos.principal` When use kerberos, we should set kerberos princal such as 'test_user@xxx'.

### kerberos.principal.file [string]

`kerberos.principal.file` When use kerberos, we should set kerberos princal file such as '/home/test/test_user.keytab'.

## Examples

```hocon
source {

Hudi {
table.path = "hdfs://nameserivce/data/hudi/hudi_table/"
table.type = "cow"
conf.files = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
use.kerberos = true
kerberos.principal = "test_user@xxx"
kerberos.principal.file = "/home/test/test_user.keytab"
}

}
```
1 change: 1 addition & 0 deletions plugin-mapping.properties
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,4 @@ seatunnel.sink.Jdbc = connector-jdbc
seatunnel.sink.HdfsFile = connector-file-hadoop
seatunnel.sink.LocalFile = connector-file-local
seatunnel.source.Pulsar = connector-pulsar
seatunnel.source.Hudi = connector-hudi
8 changes: 7 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
<neo4j.connector.spark.version>4.1.0</neo4j.connector.spark.version>
<iceberg.version>0.13.1</iceberg.version>
<flink.version>1.13.6</flink.version>
<hudi.version>0.10.0</hudi.version>
<hudi.version>0.11.1</hudi.version>
<orc.version>1.5.6</orc.version>
<hive.exec.version>2.3.9</hive.exec.version>
<commons.logging.version>1.2</commons.logging.version>
Expand Down Expand Up @@ -499,6 +499,12 @@
<version>${flink.version}</version>
</dependency>

<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-hadoop-mr-bundle</artifactId>
<version>${hudi.version}</version>
</dependency>

<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark-bundle_${scala.binary.version}</artifactId>
Expand Down
5 changes: 5 additions & 0 deletions seatunnel-connectors-v2-dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@
<artifactId>connector-file-local</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.seatunnel</groupId>
<artifactId>connector-hudi</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>

<build>
Expand Down
61 changes: 61 additions & 0 deletions seatunnel-connectors-v2/connector-hudi/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--

Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>seatunnel-connectors-v2</artifactId>
<groupId>org.apache.seatunnel</groupId>
<version>${revision}</version>
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>connector-hudi</artifactId>

<dependencies>

<dependency>
<groupId>org.apache.seatunnel</groupId>
<artifactId>seatunnel-hive-shade</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.seatunnel</groupId>
<artifactId>seatunnel-api</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-hadoop-mr-bundle</artifactId>
</dependency>

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.seatunnel.connectors.seatunnel.hudi.config;

public class HudiSourceConfig {

public static final String TABLE_PATH = "table.path";

public static final String TABLE_TYPE = "table.type";

public static final String CONF_FILES = "conf.files";

public static final String USE_KERBEROS = "use.kerberos";

public static final String KERBEROS_PRINCIPAL = "kerberos.principal";

public static final String KERBEROS_PRINCIPAL_FILE = "kerberos.principal.file";

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.seatunnel.connectors.seatunnel.hudi.exception;

public class HudiPluginException extends Exception{

public HudiPluginException(String message) {
super(message);
}

public HudiPluginException(String message, Throwable cause) {
super(message, cause);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.seatunnel.connectors.seatunnel.hudi.source;

import static org.apache.seatunnel.connectors.seatunnel.hudi.config.HudiSourceConfig.CONF_FILES;
import static org.apache.seatunnel.connectors.seatunnel.hudi.config.HudiSourceConfig.KERBEROS_PRINCIPAL;
import static org.apache.seatunnel.connectors.seatunnel.hudi.config.HudiSourceConfig.KERBEROS_PRINCIPAL_FILE;
import static org.apache.seatunnel.connectors.seatunnel.hudi.config.HudiSourceConfig.TABLE_PATH;
import static org.apache.seatunnel.connectors.seatunnel.hudi.config.HudiSourceConfig.TABLE_TYPE;
import static org.apache.seatunnel.connectors.seatunnel.hudi.config.HudiSourceConfig.USE_KERBEROS;

import org.apache.seatunnel.api.common.PrepareFailException;
import org.apache.seatunnel.api.common.SeaTunnelContext;
import org.apache.seatunnel.api.serialization.DefaultSerializer;
import org.apache.seatunnel.api.serialization.Serializer;
import org.apache.seatunnel.api.source.Boundedness;
import org.apache.seatunnel.api.source.SeaTunnelSource;
import org.apache.seatunnel.api.source.SourceReader;
import org.apache.seatunnel.api.source.SourceSplitEnumerator;
import org.apache.seatunnel.api.table.type.SeaTunnelDataType;
import org.apache.seatunnel.api.table.type.SeaTunnelRow;
import org.apache.seatunnel.api.table.type.SeaTunnelRowType;
import org.apache.seatunnel.common.config.CheckConfigUtil;
import org.apache.seatunnel.common.config.CheckResult;
import org.apache.seatunnel.common.constants.PluginType;
import org.apache.seatunnel.connectors.seatunnel.hudi.exception.HudiPluginException;
import org.apache.seatunnel.connectors.seatunnel.hudi.util.HudiUtil;

import org.apache.seatunnel.shade.com.typesafe.config.Config;

import com.google.auto.service.AutoService;

import java.io.IOException;

@AutoService(SeaTunnelSource.class)
public class HudiSource implements SeaTunnelSource<SeaTunnelRow, HudiSourceSplit, HudiSourceState> {

private SeaTunnelContext seaTunnelContext;

private SeaTunnelRowType typeInfo;

private String filePath;

private String tablePath;

private String confFiles;

private boolean useKerberos = false;

@Override
public String getPluginName() {
return "Hudi";
}

@Override
public void prepare(Config pluginConfig) {
CheckResult result = CheckConfigUtil.checkAllExists(pluginConfig, TABLE_PATH, CONF_FILES);
if (!result.isSuccess()) {
throw new PrepareFailException(getPluginName(), PluginType.SOURCE, result.getMsg());
}
// default hudi table tupe is cow
// TODO: support hudi mor table
// TODO: support Incremental Query and Read Optimized Query
if (!"cow".equalsIgnoreCase(pluginConfig.getString(TABLE_TYPE))) {
throw new PrepareFailException(getPluginName(), PluginType.SOURCE, "Do not support hudi mor table yet!");
}
try {
this.confFiles = pluginConfig.getString(CONF_FILES);
this.tablePath = pluginConfig.getString(TABLE_PATH);
if (CheckConfigUtil.isValidParam(pluginConfig, USE_KERBEROS)) {
this.useKerberos = pluginConfig.getBoolean(USE_KERBEROS);
if (this.useKerberos) {
CheckResult kerberosCheckResult = CheckConfigUtil.checkAllExists(pluginConfig, KERBEROS_PRINCIPAL, KERBEROS_PRINCIPAL_FILE);
if (!kerberosCheckResult.isSuccess()) {
throw new PrepareFailException(getPluginName(), PluginType.SOURCE, result.getMsg());
}
HudiUtil.initKerberosAuthentication(HudiUtil.getConfiguration(this.confFiles), pluginConfig.getString(KERBEROS_PRINCIPAL), pluginConfig.getString(KERBEROS_PRINCIPAL_FILE));
}
}
this.filePath = HudiUtil.getParquetFileByPath(this.confFiles, tablePath);
if (this.filePath == null) {
throw new HudiPluginException(String.format("%s has no parquet file, please check!", tablePath));
}
// should read from config or read from hudi metadata( wait catlog done)
this.typeInfo = HudiUtil.getSeaTunnelRowTypeInfo(this.confFiles, this.filePath);

} catch (HudiPluginException | IOException e) {
throw new PrepareFailException(getPluginName(), PluginType.SOURCE, "Prepare HudiSource error.", e);
}

}

@Override
public void setSeaTunnelContext(SeaTunnelContext seaTunnelContext) {
this.seaTunnelContext = seaTunnelContext;
}

@Override
public SeaTunnelDataType<SeaTunnelRow> getProducedType() {
return this.typeInfo;
}

@Override
public SourceReader<SeaTunnelRow, HudiSourceSplit> createReader(SourceReader.Context readerContext) throws Exception {
return new HudiSourceReader(this.confFiles, readerContext, typeInfo);
}

@Override
public Boundedness getBoundedness() {
// Only support Snapshot Query now.
// After support Incremental Query and Read Optimized Query, we should supoort UNBOUNDED.
// TODO: support UNBOUNDED
return Boundedness.BOUNDED;
}

@Override
public SourceSplitEnumerator<HudiSourceSplit, HudiSourceState> createEnumerator(SourceSplitEnumerator.Context<HudiSourceSplit> enumeratorContext) throws Exception {
return new HudiSourceSplitEnumerator(enumeratorContext, tablePath, this.confFiles);
}

@Override
public SourceSplitEnumerator<HudiSourceSplit, HudiSourceState> restoreEnumerator(SourceSplitEnumerator.Context<HudiSourceSplit> enumeratorContext, HudiSourceState checkpointState) throws Exception {
return new HudiSourceSplitEnumerator(enumeratorContext, tablePath, this.confFiles, checkpointState);
}

@Override
public Serializer<HudiSourceState> getEnumeratorStateSerializer() {
return new DefaultSerializer<>();
}
}
Loading