Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

com.hankcs.hanlp.corpus.io.ByteArrayOtherStream.ensureAvailableBytes 中 int availableBytes = is.available(); #528

Closed
1 task
realgzq opened this issue May 10, 2017 · 1 comment
Labels

Comments

@realgzq
Copy link

realgzq commented May 10, 2017

注意事项

请确认下列注意事项:

  • 我已仔细阅读下列文档,都没有找到答案:
  • 我已经通过Googleissue区检索功能搜索了我的问题,也没有找到答案。
  • 我明白开源社区是出于兴趣爱好聚集起来的自由社区,不承担任何责任或义务。我会礼貌发言,向每一个帮助我的人表示感谢。
  • 我在此括号内输入x打钩,代表上述事项确认完毕。

版本号

当前最新版本号是:portable-1.3.3
我使用的版本是:portable-1.3.2

我的问题

在resin下部署HanLP 加载词典文件时, 出现空指针异常

... 53 more

Caused by: java.lang.NullPointerException
at com.hankcs.hanlp.corpus.io.ByteArrayOtherStream.ensureAvailableBytes(ByteArrayOtherStream.java:72)
at com.hankcs.hanlp.corpus.io.ByteArrayStream.nextInt(ByteArrayStream.java:56)
at com.hankcs.hanlp.collection.trie.DoubleArrayTrie.load(DoubleArrayTrie.java:531)
at com.hankcs.hanlp.collection.trie.DoubleArrayTrie.load(DoubleArrayTrie.java:516)
at com.hankcs.hanlp.dictionary.common.CommonDictionary.loadDat(CommonDictionary.java:90)
at com.hankcs.hanlp.dictionary.common.CommonDictionary.load(CommonDictionary.java:44)
at com.hankcs.hanlp.dictionary.nr.PersonDictionary.(PersonDictionary.java:57)

复现问题

每次都能复现

步骤

在resin环境下, 调用如下代码, 会产生异常
List<com.hankcs.hanlp.seg.common.Term> segment = HanLP.segment(episodeName);

触发代码

com.hankcs.hanlp.corpus.io.ByteArrayOtherStream.ensureAvailableBytes(ByteArrayOtherStream.java:72)

经过调试, 发现是由于resin对InputStream的实现is.available()每次最多读取8192个字节, 而不是将所有的数据都读取. 程序第一次在读取8192字节, 没有读取完整数据的情况下, 将InputStream流错误关闭, 第二次读取时报上述错误
if (readBytes == availableBytes)
{
is.close();
is = null;
}

下面是InputStream available 的文档
/**
* Returns an estimate of the number of bytes that can be read (or
* skipped over) from this input stream without blocking by the next
* invocation of a method for this input stream. The next invocation
* might be the same thread or another thread. A single read or skip of this
* many bytes will not block, but may read or skip fewer bytes.
*
* 这里说明了可能不会返回所有字节*
***

Note that while some implementations of {@code InputStream} will return
* the total number of bytes in the stream, many will not. It is
* never correct to use the return value of this method to allocate
* a buffer intended to hold all data in this stream.**
*
*

A subclass' implementation of this method may choose to throw an
* {@link IOException} if this input stream has been closed by
* invoking the {@link #close()} method.
*
*

The {@code available} method for class {@code InputStream} always
* returns {@code 0}.
*
*

This method should be overridden by subclasses.
*
* @return an estimate of the number of bytes that can be read (or skipped
* over) from this input stream without blocking or {@code 0} when
* it reaches the end of the input stream.
* @exception IOException if an I/O error occurs.
*/

建议代码修改:
if(is.available()==0)
{
is.close();
is = null;
}

期望输出

实际输出

其他信息

@hankcs
Copy link
Owner

hankcs commented May 10, 2017

感谢指正,resin的确没测试过,当时对available的理解不足。

只有FileInputStream的available才代表文件总大小,参考其源码:http://blog.csdn.net/hongweideng/article/details/6897818。

你提的改正意见基本正确,但基于网路的IO中available有可能时不时地返回0,参考:http://www.cnblogs.com/MyFavorite/archive/2010/10/19/1855758.html 。所以一定要读取一点数据后才知道到底有没有到达文件尾部。目前HanLP的bin文件结构都有赘余的数据量字段,不会在到达文件尾后还继续请求数据。所以关闭操作放到com.hankcs.hanlp.corpus.io.ByteArrayOtherStream#close 里面去了。

这个问题应该解决了,欢迎测试,我会持续跟进的。

如果还有问题,欢迎重开issue。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants