关于IIOAdapter的疑问 #536

AnyListen · 2017-05-17T03:47:02Z

IIOAdapter中有两个接口需要实现：open与create；
现在我想将词库放在HDFS上，读取很好实现，只是现在不大清楚create方法的具体作用；
现在需要明确的是：

create是否是生成bin文件使用的；
在Linux上貌似是不生成bin文件的，那么是不是可以让create方法只接返回Null；
如果在词库初始化的时候需要生成bin或者其他文件，那么在使用HDFS词库的情况下，多节点同时初始化的时候是不是会造成写入冲突问题

AnyListen · 2017-05-18T06:41:15Z

下面是我实现的HDFS的词库加载类，经测试可以正常使用
配置文件root需要配置为类似：http://ns1:50070/webhdfs/v1/hanlp/

import com.hankcs.hanlp.corpus.io.IIOAdapter;
import com.hankcs.hanlp.corpus.io.IOUtil;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;

public class HdfsIoAdapter implements IIOAdapter {
    public InputStream open(String s) throws IOException {
        URL url = new URL(s+"?op=OPEN");
        System.out.println(url.toString());
        HttpURLConnection con = (HttpURLConnection)url.openConnection();
        if (con.getResponseCode() >= 400){
            return null;
        }
        int contentLength = con.getContentLength();
        InputStream is = con.getInputStream();
        if (contentLength <= 0){
            contentLength = is.available();
        }
        byte[] buffer = new byte[contentLength];
        IOUtil.readBytesFromOtherInputStream(is,buffer);
        return new ByteArrayInputStream(buffer);
    }

    public OutputStream create(String s) throws IOException {
        return null;
    }
}

这里有个问题，最初的open方法如下：

    public InputStream open(String s) throws IOException {
        URL url = new URL(s+"?op=OPEN");
        return url.getInputStream();
    }

此时IOUitil类的readBytesFromOtherInputStream中的is.available()读取的流的长度是错误的，一直是25536，这个数字应该是HDFS的默认缓冲大小，不是真实大小，这样会造成读取的数据是错误的就会引发后面的很多报错信息；

对于IIOAdapter的create方法，貌似只是用来写入bin文件的，在默认情况下貌似可以不用实现

AnyListen · 2017-05-18T06:53:21Z

同时建议IIOAdapter接口添加一个判断文件是否存在的方法

hankcs · 2017-05-18T09:25:52Z

这个问题是重复的：com.hankcs.hanlp.corpus.io.ByteArrayOtherStream.ensureAvailableBytes 中 int availableBytes = is.available(); #528
可以在本地生成bin后上传HDFS
判断文件是否存在的方法的作用是什么呢？

AnyListen · 2017-05-21T02:50:16Z

对于问题2是可以生成bin文件上传到HDFS但是很多情况下不能这样操作，因为每一次自定义词库变动bin文件都需要重新生成，按照这种步骤的话，每次要重新删除bin或者替换bin，就很麻烦，所以问一下Linux环境是不是不生成bin文件，对于自定义词库而言；至少目前在使用的时候，没看到自定义词库生成了Bin文件；
判断文件是否存在主要是为了能够使用HanLp.Config.IOAdapter这个对象实现文件或者路径是否存在的判断，这样很便于判断路径设置的正确性，然后再不对的时候采用一些内置的备用方案；目前当某个文件不存在的时候，代码是直接异常报错打印了日志，通过判断文件时候存在，就能避免抛出异常，且可以做更多的自定义操作

hankcs · 2017-05-21T06:23:52Z

我明白你的意思了。

任何平台都是一样会生成bin的。只不过所有用户词典都会合并到一个CustomDictionary.txt.bin里面去。当时cache设计可以显著加快加载速度，比Java自己的DataInputStream要快一两个数量级，对程序员调试时的冷启动很友好。不过也带了一些麻烦，毕竟这不是个通用设计，很多人不喜欢操心这个。未来会加一个配置项控制是否打开cache机制。也可以在open bin文件返回null或抛异常的时候不加载bin。
我的想法是这样的，IOAdapter可以在open方法内部做任何的逻辑判断，包括检测路径是否正确、在文件不存在的时候启用备用数据；但在返回的时候一定要给出一个合法的InputSteam。因为目前模型是静态加载的，没有预留“加载失败再重试一次”的接口，失败就是失败了。这也许是个糟糕的设计，算法库的IO设计的确要弱一些。其实抛任何异常都可以捕获的，如果启动时捕捉到异常，则通知上层业务初始化失败：当分词模型加载失败时，直接导致Tomcat死掉 #116 (comment)
截止这个issue，com.hankcs.hanlp.corpus.io.IOUtil#readBytesFromOtherInputStream(java.io.InputStream)还是不对，会产生你所说的问题。
无论如何，我对这个问题是开放的态度，欢迎继续探讨。

AnyListen · 2017-05-21T15:43:16Z

Linux系统，在自定义词库的文件夹下面没有看到生成的bin文件；我在使用的时候也是把CustomDictionary.txt.bin删除了的；
第二点我明白你的意思了；我目前的使用时使用了HDFS词库优先，本地词库做备用，然后jar包中的mini词库做终极备用的方案，所以觉得目前对异常的处理不是特别友好；不过确实可以按照你说的将全部的备用逻辑放在open方法里面，这一点也可以实践一下。
第三个问题我也会继续关注。

hankcs · 2017-05-22T00:28:15Z

可能是权限问题，可以打开调试模式观察
的确异常处理设计得太粗暴了，你可以按需要顺手改一下
我新提交了一个patch：e6f0617 欢迎测试

AnyListen · 2017-05-22T08:44:06Z

明天会对新的版本进行整合测试，到时候报告一下测试结果，多谢

AnyListen · 2017-05-24T01:58:35Z

对于问题3，已测试，目前的模式可以正确读取到流中的数据；
异常的处理见仁见智吧，暂时可以先不做更多的处理，因为不知道用户是否有备用词典什么的；
但是我觉得IIOAdapter还是需要定义判断路径是否存在的方法；见下面的代码，在ioType.equals("hdfs") && !HdfsUtils.isDirExits(rootPath)与代码最后的File判断，我使用了特定的路径判断方案，如果IIOAdapter存在路径判断方法的话，我就可以使用HanLP.Config.IOAdapter.isFileExits(rootPath)统一进行处理，而不需要写特别针对的代码；这里只是一个应用案例，这种模式对于有一种或者多种备用方案的情况很是有用，希望可以考虑一下；

        String rootPath = settings.get("analysis.hanlp.rootPath", null);
        if (TextUtility.isBlank(rootPath)){
            return;
        }
        String ioType = settings.get("analysis.hanlp.ioType", "file");
        if (!isSuccess || (ioType.equals("hdfs") && !HdfsUtils.isDirExits(rootPath))){
            HanLP.Config.IOAdapter = new FileIOAdapter();
            rootPath = settings.get("analysis.hanlp.localDicPath", null);
            ioType = "file";
            if (TextUtility.isBlank(rootPath)){
                return;
            }
        }
        File f = new File(rootPath);
        if (!f.exists()){
            return;
        }

hankcs · 2017-05-24T06:21:17Z

感谢测试反馈
关于定义判断路径是否存在的接口，我再考虑一下。如果很多人需要这个接口，就加上吧。

SupDataKing · 2017-06-08T11:20:31Z

读取hdfs词库文件出现java heap size溢出；
配置文件：
#IOAdapter=com.hankcs.hanlp.corpus.io.FileIOAdapter
IOAdapter=com.fb.etl.main.HdfsIoAdapter
HdfsIoAdapter类根据@AnyListen提供的类；
直接本地调用(从hdfs读取词库)：
List sentenceList = HanLP.segment(text) 就出现问题

跟踪查看了，因为com.hankcs.hanlp.dictionary.CoreDictionary.Attribute 加载
CoreNatureDictionary.txt.bin文件时，Attribute属性数组size有 1008813135然后报了内存溢出了！！单机本地的size只有153115

不知道什么原因造成的，期待大牛们帮忙解决...或是有其它的途进读取hdfs上的词典，方便告知...

hankcs · 2017-06-08T11:27:05Z

检查两地bin文件的md5是否一致
试试https://github.com/hualongdata/hanlp-ext

SupDataKing · 2017-06-09T01:22:06Z

感谢 @hankcs , 参考hanlp-ext调试可以了!

nylqd · 2017-06-14T10:08:56Z

其实传进来的参数都是路径，是否尝试过直接使用hdfs的方法操作呢？FSDataInputStream以及FSDataOutputStream是可以直接当做InputStream和OutputStream来操作的。
下面是我这边的一个实现，目前为止正常工作，仅供参考

private static Configuration conf = new Configuration();
private static FileSystem fileSystem = null;

static {
    conf.set("fs.default.name", "server ip");
    try {
        fileSystem = FileSystem.get(conf);
    } catch (IOException e) {
        System.out.println("where is my hdfs?");
        e.printStackTrace();
    }
}

public InputStream open(String s) throws IOException {
    Path path = new Path(s);

    return fileSystem.open(path);
}

public OutputStream create(String s) throws IOException {
    Path path = new Path(s);

    return fileSystem.create(path);
}

hankcs added the duplicated label May 18, 2017

hankcs added a commit that referenced this issue May 21, 2017

修复HDFS上的readBytesFromOtherInputStream：#536 (comment)

e6f0617

hankcs added the improvement label May 21, 2017

AnyListen closed this as completed May 22, 2017

AnyListen reopened this May 24, 2017

AnyListen closed this as completed May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于IIOAdapter的疑问 #536

关于IIOAdapter的疑问 #536

AnyListen commented May 17, 2017

AnyListen commented May 18, 2017

AnyListen commented May 18, 2017

hankcs commented May 18, 2017

AnyListen commented May 21, 2017

hankcs commented May 21, 2017

AnyListen commented May 21, 2017

hankcs commented May 22, 2017

AnyListen commented May 22, 2017

AnyListen commented May 24, 2017 •

edited

Loading

hankcs commented May 24, 2017

SupDataKing commented Jun 8, 2017 •

edited

Loading

hankcs commented Jun 8, 2017

SupDataKing commented Jun 9, 2017

nylqd commented Jun 14, 2017

关于IIOAdapter的疑问 #536

关于IIOAdapter的疑问 #536

Comments

AnyListen commented May 17, 2017

AnyListen commented May 18, 2017

AnyListen commented May 18, 2017

hankcs commented May 18, 2017

AnyListen commented May 21, 2017

hankcs commented May 21, 2017

AnyListen commented May 21, 2017

hankcs commented May 22, 2017

AnyListen commented May 22, 2017

AnyListen commented May 24, 2017 • edited Loading

hankcs commented May 24, 2017

SupDataKing commented Jun 8, 2017 • edited Loading

hankcs commented Jun 8, 2017

SupDataKing commented Jun 9, 2017

nylqd commented Jun 14, 2017

AnyListen commented May 24, 2017 •

edited

Loading

SupDataKing commented Jun 8, 2017 •

edited

Loading