Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfsreader 插件读取text文件时报错 #66

Closed
wgzhao opened this issue Dec 11, 2020 · 0 comments
Closed

hdfsreader 插件读取text文件时报错 #66

wgzhao opened this issue Dec 11, 2020 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@wgzhao
Copy link
Owner

wgzhao commented Dec 11, 2020

Describe the bug

hdfsreader 插件读取text文件时报错

运行的json文件如下:

{
    "job": {
        "setting": {
            "speed": {
                "byte": -1,
                "channel": 1
            }
        },
        "content": [
            {
                "writer": {
                    "name": "streamwriter",
                    "parameter": {
                        "print": "true"
                    }
                },
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "column": [
                            {
                                "index": 0,
                                "type": "string"
                            },
                            {
                                "index": 1,
                                "type": "long"
                            },
                            {
                                "index": 2,
                                "type": "date"
                            },
                            {
                                "index": 3,
                                "type": "boolean"
                            },
                            {
                                "index": 4,
                                "type": "string"
                            }
                        ],
                        "defaultFS": "hdfs://sandbox-hdp.hortonworks.com:8020",
                        "path": "/tmp/out_orc",
                        "fileType": "text",
                        "fieldDelimiter": "\u0001",
                        "fileName": "test_none",
                        "encoding": "UTF-8",
                    }
                }
            }
        ]
    }
}

执行结果如下:

....
2020-12-11 21:27:24.903 [job-0] INFO  DFSUtil - get HDFS all files in path = [/tmp/out_orc]
2020-12-11 21:27:26.459 [job-0] ERROR DFSUtil - 检查文件[hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件,请检查您文件类型和文件是否正确。
2020-12-11 21:27:26.472 [job-0] INFO  StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 0.00%
2020-12-11 21:27:26.474 [job-0] ERROR Engine - Code:[HdfsReader-10], Description:[读取文件出错].  - Code:[HdfsReader-10], Description:[读取文件出错].  - 检查文件[hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件,请检查您文件类型和文件是否正确。 - java.lang.RuntimeException: hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [101, 115, 116, 10]
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:531)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:712)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:609)
	at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.isParquetFile(DFSUtil.java:893)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.checkHdfsFileType(DFSUtil.java:724)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileByType(DFSUtil.java:222)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileIfNotEmpty(DFSUtil.java:152)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFilesNORegex(DFSUtil.java:209)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFiles(DFSUtil.java:179)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getAllFiles(DFSUtil.java:141)
	at com.alibaba.datax.plugin.reader.hdfsreader.HdfsReader$Job.prepare(HdfsReader.java:172)
	at com.alibaba.datax.core.job.JobContainer.prepareJobReader(JobContainer.java:702)
	at com.alibaba.datax.core.job.JobContainer.prepare(JobContainer.java:312)
	at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:115)
	at com.alibaba.datax.core.Engine.start(Engine.java:90)
	at com.alibaba.datax.core.Engine.entry(Engine.java:151)
	at com.alibaba.datax.core.Engine.main(Engine.java:169)

运行环境

  • OS: CentOS 7.7.1908
  • JDK Version: openjdk 14
  • DataX Version: 3.1.4
@wgzhao wgzhao added the bug Something isn't working label Dec 11, 2020
@wgzhao wgzhao self-assigned this Dec 11, 2020
@wgzhao wgzhao closed this as completed Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant