Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrintBGZFBlockInformation: a tool to dump information about blocks in a BGZF file #4239

Merged
merged 5 commits into from
Jan 15, 2019

Conversation

droazen
Copy link
Collaborator

@droazen droazen commented Jan 23, 2018

No description provided.

@droazen
Copy link
Collaborator Author

droazen commented Jan 23, 2018

@jamesemery please review (after using the tool to help diagnose #4224)

@codecov-io
Copy link

codecov-io commented Jan 25, 2018

Codecov Report

Merging #4239 into master will increase coverage by 0.004%.
The diff coverage is 84%.

@@               Coverage Diff               @@
##              master     #4239       +/-   ##
===============================================
+ Coverage     87.048%   87.052%   +0.004%     
- Complexity     31445     31479       +34     
===============================================
  Files           1921      1923        +2     
  Lines         144977    145146      +169     
  Branches       16062     16081       +19     
===============================================
+ Hits          126199    126352      +153     
- Misses         12935     12944        +9     
- Partials        5843      5850        +7
Impacted Files Coverage Δ Complexity Δ
...te/hellbender/tools/PrintBGZFBlockInformation.java 79.798% <79.798%> (ø) 21 <21> (?)
...ools/PrintBGZFBlockInformationIntegrationTest.java 92.157% <92.157%> (ø) 7 <7> (?)
...institute/hellbender/engine/VariantWalkerBase.java 100% <0%> (ø) 14% <0%> (+1%) ⬆️
...ender/utils/haplotype/HaplotypeBAMDestination.java 100% <0%> (ø) 4% <0%> (ø) ⬇️
...roadinstitute/hellbender/engine/VariantWalker.java 93.333% <0%> (ø) 12% <0%> (ø) ⬇️
...nstitute/hellbender/engine/VariantLocusWalker.java 89.091% <0%> (ø) 17% <0%> (ø) ⬇️
...nstitute/hellbender/engine/MultiVariantWalker.java 100% <0%> (ø) 16% <0%> (ø) ⬇️
...aplotypecaller/HaplotypeCallerIntegrationTest.java 88.262% <0%> (+0.027%) 84% <0%> (ø) ⬇️
... and 6 more

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor changes, I might insist though that for this to be a useful tool in diagnosing problems with (valid).vcf.gz files which have null blocks before the final block that there should at least be a check for this functionality, as the code looks like it would stop reading blocks after that point.

Also probably making a better error message for empty file

@Argument(fullName = "bgzf-file", doc = "The BGZF-format file for which to print block information", optional = false)
private String bgzfPathString;

@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "File to which to write block information (if not specified, prints to standard output", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing parenthesis

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


if ( ! AbstractFeatureReader.hasBlockCompressedExtension(bgzfPathString) ) {
throw new UserException("File " + bgzfPathString + " does not end in a recognized BGZF file extension (" +
AbstractFeatureReader.BLOCK_COMPRESSED_EXTENSIONS + ")");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, but you should probably use a string joiner on this set as I suspect it will mangle the formatting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a UserException.BadInput instead to be more specific?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Argument(fullName = "bgzf-file", doc = "The BGZF-format file for which to print block information", optional = false)
private String bgzfPathString;

@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "File to which to write block information (if not specified, prints to standard output", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ')'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

public class PrintBGZFBlockInformationIntegrationTest extends CommandLineProgramTest {

@Test
public void testNormalInput() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably add a test asserting that this behaves rationally for bad input (eg. ending on a non-empty block or by having null blocks interspersed with occupied blocks in the file)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a test with a corrupt BGZF file.

}
}

private BGZFBlockMetadata processNextBlock( InputStream stream, String streamSource) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it was pulled from htsjdk, this should maybe exist in some common utility method somewhere exposed in htsjdk ideally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes, but beyond the scope of this PR.

@@ -0,0 +1,1712 @@
Block at file offset 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these output files should have a header pointing to the file they are summarizing to avoid confusion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added.

@ExperimentalFeature
@CommandLineProgramProperties(
summary = "Print information about the compressed blocks in a BGZF format file",
oneLineSummary = "Print information about the compressed blocks in a BGZF format file",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand this comment to more explicitly summarize the compressed and uncompressed size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary I think -- the tool-level docs have that information.

@jamesemery jamesemery assigned droazen and unassigned jamesemery Jan 25, 2018
@jamesemery
Copy link
Collaborator

jamesemery commented Jan 25, 2018

Also, this seems to misbehave when run on a GZIP file that is NOT a block gzipped file. I get the following result which appears to be flawed in a number of ways:

Block at file offset 0
	- compressed size: 25442
	- uncompressed size: 2114545489

15:24:38.967 INFO  PrintBGZFBlockInformation - Shutting down engine
[January 25, 2018 3:24:38 PM EST] org.broadinstitute.hellbender.tools.diagnostics.PrintBGZFBlockInformation done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=190840832
***********************************************************************

A USER ERROR has occurred: Error while parsing BGZF file. Error message was: testBrokenFile/gnomADaccuracyTest.SynDip.unBlocked.vcf.gz has invalid uncompressedLength: -758824605

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments for documentation and simplification.

* A diagnostic tool that prints information about the compressed blocks in a BGZF format file,
* such as a .vcf.gz file.
*
* The output looks like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For properly formatted in the docgen, this requires the <p> HTML tag (I guess).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* The output looks like this:
*
* Block at file offset 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can be in a <code> tag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* ...
* etc.
*
* The output can be redirected to a file using the -O option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also <p>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

programGroup = DiagnosticsAndQCProgramGroup.class
)
public class PrintBGZFBlockInformation extends CommandLineProgram {
private final Logger logger = LogManager.getLogger(PrintBGZFBlockInformation.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CommandLineProgram already has a protected logger initialized for this.getClass().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- removed

bgzfPath = IOUtils.getPath(bgzfPathString);

if ( ! Files.exists(bgzfPath) ) {
throw new UserException("File " + bgzfPathString + " does not exist");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better a UserException.CouldNotReadInputFile to narrow down?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to UserException.CouldNotReadInputFile


if ( ! AbstractFeatureReader.hasBlockCompressedExtension(bgzfPathString) ) {
throw new UserException("File " + bgzfPathString + " does not end in a recognized BGZF file extension (" +
AbstractFeatureReader.BLOCK_COMPRESSED_EXTENSIONS + ")");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a UserException.BadInput instead to be more specific?

try ( InputStream bgzfInputStream = Files.newInputStream(bgzfPath) ) {
BGZFBlockMetadata blockInfo = null;
while ( (blockInfo = processNextBlock(bgzfInputStream, bgzfPathString)) != null ) {
outStream.println(String.format("Block at file offset %d", blockInfo.blockOffset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better outStream.printf() for removing the String.format().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

outStream.println();
}
} catch ( IOException e ) {
throw new UserException("Error while parsing BGZF file. Error message was: " + e.getMessage(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better a UserException.CouldNotReadInputFile, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.

@CommandLineProgramProperties(
summary = "Print information about the compressed blocks in a BGZF format file",
oneLineSummary = "Print information about the compressed blocks in a BGZF format file",
programGroup = DiagnosticsAndQCProgramGroup.class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be out-of-place in DiagnosticsAndQCProgramGroup, which currently contains things like metrics collectors, amongst other things. Other utilities like this are in OtherProgramGroup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to OtherProgramGroup

@droazen
Copy link
Collaborator Author

droazen commented Jan 11, 2019

@jamesemery I added a pretty comprehensive set of tests for various kinds of corrupt BGZF files (as well as a regular GZIP file), and patched the tool to report something sensible in these cases. You can see the new tool output in the *.out files included in the branch. Needs another review from you before we can merge.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do with these comments what you will, the tests are good now and this was the primary issue with this branch before so once you have acknowledged/responded 👍

}

// Emit a warning at the end if we encountered any terminator blocks before the final block:
if ( sawNonFinalTerminatorBlock ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would almost think that this should also print out the location of the first one found, just so its use friendly and doesn't confuse anybody about there being two empty blocks in a row at the end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that or clearly separate this from the very similarly formatted output line that appears right before it in the output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- I added block numbers, and reference them in the error messages.

IntegrationTestSpec.assertEqualTextFiles(actualOutput, expectedOutput);
}

@Test(expectedExceptions= UserException.CouldNotReadInputFile.class)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a brief comment explaining what this test means.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanatory comments to all tests

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output looks clearer, 👍

@droazen droazen merged commit 2d0a4d1 into master Jan 15, 2019
@droazen droazen deleted the dr_bgzf_block_info_tool branch January 15, 2019 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants