Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input streams opened from Paths fail when the underlying file is a pipe #1084

Open
lbergelson opened this issue Feb 8, 2018 · 13 comments
Open

Comments

@lbergelson
Copy link
Member

lbergelson commented Feb 8, 2018

Opening a unix pipe using Files.newInputStream() results in an InputStream. However, if you call available() on that input stream it crashes with an error in native code due to an illegal seek. The implementation of available() is unsafe when using an unseekable backing stream. Unfortunately, this is a bug in the java standard library and we can't easily change it.

This is problematic because BufferedInputStream uses available as part of it's call to read().

There are some potential solutions.

  1. avoid using Files.newInputStream(). This would be really unfortunate since one of our goals is to transition away from files entirely.
  2. check if the input is not a regular file and wrap the resulting stream in a new wrapper that redefines available() in a safe way. This is probably the right thing to do but might be slightly tricky to implement correctly
  3. avoid use of available() by not using BufferedInputStream, GzipInputStream, or other classes that may use it.. This would be difficult and library code may make use of it without us realizing.

This is related to #1083 but this problem existed before #1077

lbergelson added a commit that referenced this issue Feb 8, 2018
fixing an issue that prevented reading a bam file from an unseekable file, ex: a unix pipe
fixes #1083 which was introduced by #1077

this only fixes the case where the bam is being opened as file, a similar issue exists for Paths, #1084
support for reading pipes as paths has never worked
@lbergelson
Copy link
Member Author

We should also file a bug with java.

lbergelson added a commit that referenced this issue Feb 8, 2018
fixing an issue that prevented reading a bam file from an unseekable file, ex: a unix pipe
fixes #1083 which was introduced by #1077

this only fixes the case where the bam is being opened as file, a similar issue exists for Paths, #1084
support for reading pipes as paths has never worked
lbergelson added a commit that referenced this issue Feb 8, 2018
* re-enable reading bam files from pipes
fixing an issue that prevented reading a bam file from an unseekable file, ex: a unix pipe
fixes #1083 which was introduced by #1077

this only fixes the case where the bam is being opened as file, a similar issue exists for Paths (#1084)
support for reading pipes as paths has never worked
@nh13
Copy link
Member

nh13 commented May 18, 2018

@lbergelson I just hit this issue with when upgrading from 2.13.0 to 2.14.3. Not supporting /dev/stdin as a path is a serious regression causing production issues on my end. I have confirmed it works with 2.13.0 and now fails (unexpectedly) in 2.14.3: fulcrumgenomics/fgbio#404

@nh13
Copy link
Member

nh13 commented May 18, 2018

@tfenne

@nh13
Copy link
Member

nh13 commented May 18, 2018

I can also confirm that reading from stdin works in 2.14.1 and not in 2.14.2, with the latter including #1077 and on.

@tfenne
Copy link
Member

tfenne commented May 18, 2018

@nh13 & @lbergelson I've opened PR #1118 which cleans up the test for named pipes a little bit, and more importantly makes it so that it now executes twice, once where the input resource is a java.io.File and again when the resource is a Path. The first version succeeds (which is what the test had previously) but the latter version fails.

@magicDGS since it sounds like it was your PR that affected this, could you take a look please? It looks to me like your implementation of available() is causing position() in SeekablePathStream to get called, which throws an exception if the path is a pipe. I'm not sure how to fix this in a way that still works for you usage.

@lbergelson
Copy link
Member Author

lbergelson commented May 18, 2018

@nh13 Did reading from Stdin as a path work before? I was under the impression that it had never worked.

There's a bug in the java implementation of Files.newInputStream() which makes it annoying to work around.
The following code fails:

    @Test
    public void testStdin() throws IOException {
        final InputStream inputStream = Files.newInputStream(Paths.get("/dev/stdin/"));
        inputStream.available();
    }
java.io.IOException: Illegal seek

	at sun.nio.ch.FileChannelImpl.position0(Native Method)
	at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264)
	at sun.nio.ch.ChannelInputStream.available(ChannelInputStream.java:116)
	at htsjdk.samtools.SamReaderFactoryTest.testStdin(SamReaderFactoryTest.java:539)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:639)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:816)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1124)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
	at org.testng.TestRunner.privateRun(TestRunner.java:774)
	at org.testng.TestRunner.run(TestRunner.java:624)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:359)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:354)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:312)
	at org.testng.SuiteRunner.run(SuiteRunner.java:261)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1215)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
	at org.testng.TestNG.run(TestNG.java:1048)
	at org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72)
	at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:123)

@nh13
Copy link
Member

nh13 commented May 18, 2018

@lbergelson it fails in 2.14.2 and works in 2.14.1: fulcrumgenomics/fgbio#404

@lbergelson
Copy link
Member Author

If it worked before it worked with the very nasty caveat that the if you were reading something GZIP'd from a pipe you could have random corruption of the data you were reading. See broadinstitute/gatk#4224

The issue is that the GzipInputStream uses available() to determine end of stream in some cases, which is a misuse of available() and a violation of the contract, but it's been unpatched in java for a long time and there's no reason to think it will be fixed in the future. So if you have a stream that has a default implementation of available() then you risk corrupted data.

@nh13
Copy link
Member

nh13 commented May 18, 2018

That's concerning with respect to GzipInputStream, but we should still support it for SAM/BAM/CRAM right?

@tfenne
Copy link
Member

tfenne commented May 18, 2018

@lbergelson The regression we're seeing is when reading BAM specifically, which uses BlockCompressedInputStream which is totally independent from GzipInputStream - it does it's own stream handling and uses a BlockGunzipper to do the decompression. As @nh13 states, it was definitely working previously.

I think part of the problem is that we have too many code-paths for opening SAM/BAM/CRAM. Am I right in saying that all java.io.Files can be equally well modeled by java.nio.file.Paths? If that's the case perhaps we can eliminate FileInputResource in SamReaderFactory and have it just build the Path equivalent with file.toPath? That would simplify things and highlight problems like this faster. Thoughts?

@tfenne
Copy link
Member

tfenne commented May 18, 2018

Also, if the JDK GzipInputStream is so fundamentally broken, perhaps instead of @magicDGS's fix, we should just have our own one internally that is a copy/paste of the JDK's with a fix in place? It wouldn't be very hard to do.

@lbergelson
Copy link
Member Author

lbergelson commented May 18, 2018

Ah, you're right about it for bam. I was thinking of the tribble case.

We definitely have to many ways to open things. I would love to get rid of java.io.File completely because java.nio.file.Path should be able to handle everything that File can. I was assuming we would do that as part of the big changes for htsjdk3 that we started talking about a while ago. We could speed that bit up and do it first though if it would help fix this problem.

Having our own GzipInputStream with a fix isn't a bad idea, it's a pretty commonly used wrapper though, so it's very possible that other libraries will make use of it on top of htsjdk, which means we want our streams to be compatible with it I think.

@magicDGS
Copy link
Member

I can't look at it soon because I have some technical problems with my computer which are slowing down ny development. I definitely agree that this is a regression that should be fixed, and I will think about it it but I usually don't work with pipes in Java. Maybe wrapping the stream in s buffered one will help.

On the other hand, I guess that htsjdk3 should be java.nio.Path implemented, and forget about File...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants