Efficient, high-quality streaming parsers and writers for 16 text-based formats used in bioinformatics.
The goal is to have the best possible parsers for the most hated and problematic formats.
Supported formats:
- VCF (4.2)
- VFF
- GenBank
- BED
- GFF2, GFF3, GTF, and GVF
- FASTA
- FASTA alignment
- FASTQ
- UCSC liftOver format
- pre-MAKEPED LINKAGE
- BGEE expression format
- Turtle and RDF
- Delimited text (e.g., CSV)
Features & choices:
- Reads and writes Java Streams, keeping only essential metadata in memory.
- Parses every part of a format, leaving nothing as text unnecessarily.
- Has a consistent API. Coordinates are always 0-indexed and text is always escaped as per the specification.
- Immutable, thread-safe, null-pointer-safe (
Optional<>
), and arbitrary-precision. - All methods are in interfaces, or in records, enums, or final classes
This example reads, filters, and writes a VCF file.
import org.pharmgkb.parsers.vcf.*;
import org.pharmgkb.parsers.vcf.model.*;
Stream<VcfPosition> mitochondrialCalls = new VcfDataParser().parseFile(path)
.filter(p -> p.chromosome().isMitochondial())
new VcfDataWriter().writeToFile(mitochondrialCalls, filteredPath);
Compatible with Java 21 LTS and higher. You can get the artifacts from Maven Central.
<dependency>
<groupId>org.pharmgkb</groupId>
<artifactId>bioio</artifactId>
<version>0.3.0</version>
</dependency>
implementation 'org.pharmgkb:bioio:0.3.0'
"org.pharmgkb" % "bioio" % "0.3.0"
Releases contain both fat JARs (containing dependencies)
and thin JARs (without dependencies), independently for each subproject
(e.g. bioio-vcf
for VCF, or bioio-gff
for GFF/GTV/GVF).
You can build artifacts from a source checkout using Gradle:
- To JAR all subprojects, run
gradle jarAll
- To build a single subproject (e.g. VCF), run
gradle :vcf:jar
This long list of examples showcases many of the parsers.
For added flavor, they also use various methods for IO (parseAll
, etc.) and various
Stream
functions (parallel()
, collect
, flatMap
, etc.)
// Store GFF3 (or GVF, or GTF) features into a list
List<Gff3Feature> features = new GffParser.Builder().build().collectAll(inputFile);
features.get(0).type(); // the parser unescaped this string
// Now write the lines:
new Gff3Writer.Builder().build().writeToFile(outputFile);
// The writer percent-encodes GFF3 fields as necessary
// From a BED file, get distinct chromosome names that start with "chr", in parallel
Files.lines(file)
.map(new BedParser())
.parallel()
.map(BedFeature::chromosome())
.distinct()
.filter(chr -> chr.startsWith("chr"));
// You can also use new BedParser().parseAll(file)
// From a pre-MAKEPED file, who are Harry Johnson's children?
Pedigree pedigree = new PedigreeParser.Builder().build().apply(Files.lines(file));
NavigableSet<Individual> children = pedigree.getFamily("Johnsons")
.find("Harry Johnson")
.children();
// Traverse through a family pedigree in topological order
Pedigree pedigree = new PedigreeParser.Builder().build().apply(Files.lines(file));
Stream<Individual> = pedigree.family("Johnsons")
.topologicalOrder();
// "Lift over" coordinates using a UCSC chain file
// Filter out those that couldn't be lifted over
GenomeChain chain = new GenomeChainParser().apply(Files.lines(hg19ToGrch38ChainFile));
List<Locus> liftedOver = lociList.parallelStream()
.map(chain)
.filter(Optional::isPresent)
.toList();
// You can also use new GenomeChainParser().parse(hg19ToGrch38ChainFile)
// Print formal species names from a GenBank file
Path input = Paths.get("plasmid.genbank");
new GenbankParser().parseAll(input)
.filter(record -> record instanceof SourceAnnotation)
.map(record -> record.formalName())
.forEach(System.out::println);
// Parse a GenBank file
// Get the set of "color" properties of features on the complement starting before the sequence
Set<String> properties = new GenbankParser().parseAll(input)
.filter(record -> record instanceof FeaturesAnnotation)
.flatMap(record -> record.features())
.filter(feature -> record.range.isComplement());
.filter(feature -> record.range.start() < 0);
.flatMap(feature -> feature.properties().entrySet().stream())
.filter(prop -> prop.getKey().equals("color"))
.map(prop -> prop.getValue())
.toSet();
// Read FASTA bases with a buffered random-access reader
RandomAccessFastaStream stream = new RandomAccessFastaStream.Builder(file)
.setnCharsInBuffer(4096)
.build();
char base = stream.read("gene_1", 58523);
// Suppose you have a 2GB FASTA file
// and a method smithWaterman that returns AlignmentResults
// Align each sequence and get the top 10 results, in parallel
MultilineFastaSequenceParser parser = new MultilineFastaSequenceParser.Builder().build();
List<AlignmentResult> topScores = parser.parseAll(Files.lines(fastaFile))
.parallel()
.peek(sequence -> logger.info("Aligning {}", sequence.header())
.map(sequence -> smithWaterman(sequence.sequence(), reference))
.sorted() // assuming AlignmentResult implements Comparable
.limit(10);
}
// Stream Triples in Turtle format from a URL
/*
@prefix myPrefix: <https://abc#owner> .
<https://abc#cat> "belongsTo" @myPrefix ;
"hasSynonym" <https://abc#feline> .
*/
Stream<String> input = null;
try (
BufferedReader reader = new BufferedReader(
new InputStreamReader((HttpURLConnection) myUrl.openConnection()).getInputStream())
)
) {
input = reader.lines();
}
// usePrefixes=true will replace prefixes
TripleParser parser = new TripleParser(true);
Stream<Triple> stream = input.map(new TripleParser());
// contains: List[ https://abc#cat belongsTo https://abc#owner , \
// https://abc#cat hasSynonym https://abc#feline ]
List<Prefix> prefixes = parser.prefixes();
// Parse VCF, validate it,
// and write a new VCF file containing only positions whose QUAL field
// is at least 10, each with its FILTER field cleared
// short-circuits during read:
VcfMetadataCollection metadata = new VcfMetadataParser().parse(input);
Stream<VcfPosition> data = new VcfDataParser().parseAll(input)
.filter(p ->
p.quality().stream().anyMatch(q -> q.greaterThanOrEqual("10"))
).map(p -> new VcfPosition.Builder(p).clearFilters().build())
// verify consistent with metadata:
.peek(new VcfValidator.Builder(metadata).warnOnly().build());
new VcfMetadataWriter().writeToFile(metadata.lines(), output);
new VcfDataWriter().appendToFile(data, output);
// From a VCF file, associate every GT with its number of occurrences, in parallel
Map<String, Long> genotypeCounts = new VcfDataParser().parseAll(input)
.parallel()
.flatMap(p -> p.samples().stream())
.filter(s -> s.containsKey(ReservedFormatProperty.Genotype))
.map(s -> s.get(ReservedFormatProperty.Genotype).get())
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
Stream<GeneralizedBigDecimal> MatrixParserI.tabs().parseAll(file).map(GeneralizedBigDecimal::new);
- Where possible, a parser is a
Function<String, R>
orFunction<Stream<String>, R>
, and writer is aFunction<R, String>
orFunction<R, Stream<String>>
. Java 8+ Streams are expected to be used. - Null values are banned from public methods in favor of
Optional
. See https://www.oracle.com/technetwork/articles/java/java8-optional-2175753.html for more information. - Most operations are thread-safe.
Thread safety is annotated using
javax.annotation.concurrent
. - Top-level data classes are immutable, as annotated by
javax.annotation.concurrent.Immutable
. - The builder pattern is used for non-trivial classes. Each builder has a copy constructor.
- Links to specifications are provided. Any choice made in an ambiguous specification is documented.
- Parsing and writing is moderately strict.
Severe violations throw a
BadDataFormatException
, and milder violations are logged as SLF4J warnings. Not every aspect of a specification is validated. - For specification-mandated escape sequences, encoding and decoding is automatic.
- Coordinates are always 0-based, even for 1-based formats. This is to ensure consistency and arithmetic simplicity.
- Never reuse a parser for a new stream. Some parsers need to track some metadata on the stream. For example, the multiline FASTQ parser needs to know the length of the last sequence. (Otherwise, it’s impossible to know where a score ends and a new header begins!)
Licensed under the Mozilla Public License, version 2.0.
Copyright 2015–2024, the authors
Please refer to the contributing guide.
Credits:
- Douglas Myers-Turnbull (design and parsers)
- Mark Woon (bug fixes and code review)
- the Stanford University School of Medicine
- the Pharmacogenomics Knowledge Base at Stanford
- the University of California, San Francisco (UCSF)