Jonix is a robust, free, open-source Java library designed for data extraction from ONIX for Books sources.
It offers a variety of services for efficient ONIX processing, with a focus on:
- High-performance (speed and memory efficiency)
- Fluent, intuitive and type-safe APIs
- Easy Extensibility
Jonix is not just a simple XML-processing wrapper or an XPath tool in disguise. It is purpose-built for handling ONIX files and is regularly updated with each new ONIX schema release, which occurs four times a year.
With Jonix, every ONIX element is represented by a dedicated Java class (automatically generated from the official schema),
ensuring type-safe access to the data within that element. These Java classes come with a clear and intuitive API,
where methods never return null
, while public fields (representing values at terminal nodes) may.
ONIX elements serve various roles: some are simple data elements containing a single value (and possibly attributes), others are "Composites," holding multiple elements (some of which may also be composites), and some are mere flags. The Jonix Object Model clearly distinguishes between these types and offers tailored APIs for each.
Maven Version | Onix version | Codelist Issue |
---|---|---|
2024-10-fix |
3.1.01 | 67 |
2024-10-onix308-fix |
3.0.08 | 67 |
API documentation for latest release can be found here.
Jonix features long backward compatibility:
- All versions of Jonix run on Java version 8 and above
- Although being long deprecated,
Onix-2
is still supported in all versions of Jonix - Choose the version of Jonix based on the compatibility level of your sources with Onix standard
NOTE: version
2023-05
features the most significant leap for Jonix in a decade. The API has been revised and extended (slightly breaking backward compatibility), resulting in a more expressive and fluent syntax than ever. In particular, two powerful APIs,.firstOrEmpty()
and.filter()
were introduced for Lists of composites, eliminating many previously-unavoidablenull
/exists()
checks. For streaming control (e.g.Jonix.source()...stream()...
), theJonixSource
passed by the framework now hasproductCount()
andproductGlobalCount()
, as well asskipSource()
to use inside.onSourceStart()
. Additionally, theJonixRecord
object passed by the stream, now supportsbreakStream()
andbreakCurrentSource()
. TheJonixRecords
object now offersscanHeaders()
forHeader
-only peek of the ONIX sources. It also hasfailOnInvalidFile()
method to replace a configuration flag with the same name. For convenience,pair()
was added to all Codelist Enums for ease of unification, and - for distinction between ONIX version 3.0 and 3.1 -.onixRelease()
and.onixVersion()
were added to top-levelProduct
andHeader
classes. See newly-crafted examples below.
Maven Version | Onix version | Codelist Issue |
---|---|---|
2024-07 |
3.1.01 | 66 |
2024-07-onix308 |
3.0.08 | 66 |
2024-04 |
3.1.01 | 65 |
2024-04-onix308 |
3.0.08 | 65 |
2024-01 |
3.1.00 | 64 |
2024-01-onix308 |
3.0.08 | 64 |
2023-10 |
3.1.00 | 63 |
2023-10-onix308 |
3.0.08 | 63 |
2023-07 |
3.1.00 | 62 |
2023-07-onix308 |
3.0.08 | 62 |
2023-05 |
3.1.00 | 61 |
2023-05-onix308 |
3.0.08 | 61 |
2023-04 |
3.1.00 | 61 |
2023-01 |
3.0.08 | 60 |
2022-11 |
3.0.08 | 59 |
2022-08 |
3.0.08 | 58 |
Maven
<dependency>
<groupId>com.tectonica</groupId>
<artifactId>jonix</artifactId>
<version>2024-10-fix</version>
</dependency>
Or, if you are NOT ready to switch to ONIX version 3.1
, use the latest 3.0
implementation:
<dependency>
<groupId>com.tectonica</groupId>
<artifactId>jonix</artifactId>
<version>2024-10-onix308-fix</version>
</dependency>
To build locally from source:
# verify requirements: Maven-version >= 3.3.9 && JDK-version >= 9
mvn -version
# clone the repository
git clone https://github.com/zach-m/jonix.git
# build
cd jonix
mvn clean
mv -P release install
Once completed, Jonix should be available to use as a maven dependency on your local file system.
NOTE: Make sure to point
pom.xml
of your local project to the Jonix coordinates of the version you just built (highlighted here: https://github.com/zach-m/jonix/blob/master/pom.xml#L7).
If you need to extract common fields from sources of mixed ONIX variants (ONIX-2 alongside ONIX-3, reference
alongside short
format (see here)), the following example should help:
Jonix.source(new File("/path/to/folder-with-onix-files"), "*.xml", false)
.source(new File("/path/to/file-with-short-style-onix-2.xml"))
.source(new File("/path/to/file-with-reference-style-onix-3.onx"))
.onSourceStart(src -> {
// after the <Header> of current source was processed, we look at the source's properties
System.out.printf(">> Opening %s (ONIX release %s)%n", src.sourceName(), src.onixRelease());
src.header().map(Jonix::toBaseHeader)
.ifPresent(header -> System.out.printf(">> Sent from: %s%n", header.senderName));
})
.onSourceEnd(src -> {
// we finalize the processing of the ONIX source
System.out.printf("<< Processed %d products (total %d) %n",
src.productCount(), src.productGlobalCount());
})
.stream() // iterates over all the products contained in all ONIX sources
.map(Jonix::toBaseProduct) // transforms ONIX-2/3 product into a unified version-agnostic object
.forEach(product -> {
String ref = product.info.recordReference;
String isbn13 = product.info.findProductId(ProductIdentifierTypes.ISBN_13);
String title = product.titles.findTitleText(TitleTypes.Distinctive_title_book);
List<String> authors = product.contributors.getDisplayNames(ContributorRoles.By_author);
System.out.println("ref = " + ref);
System.out.println("isbn13 = " + isbn13);
System.out.println("title = " + title);
System.out.println("authors = " + authors);
System.out.println("----------------------------------------------------------");
});
The example above uses the BaseProduct
class, which processes ONIX-2 and ONIX-3 sources differently, each according to its schema, and extracts the most essential
information into its public fields, such as info
, description
, subjects
, etc.
If, however, you need a more complicated extraction, specific to your needs and sources, this "one-size-fits-all" approach may not be right for you. Instead, you may want to process the raw fields by yourself, as follows:
Jonix.source(new File("/path/to/folder-with-mixed-onix-files"), "*.xml", false)
.stream()
.forEach(record -> {
if (record.product.onixVersion() == OnixVersion.ONIX2) {
com.tectonica.jonix.onix2.Product product = Jonix.toProduct2(record);
// TODO: write ONIX-2 specific code
} else {
com.tectonica.jonix.onix3.Product product = Jonix.toProduct3(record);
// TODO: write ONIX-3 specific code
}
});
Next example shows how to process a folder containing ALL ONIX-3 sources, with some non-standard logic.
In particular, the authors
are extracted in a more elaborate way compared to BaseProduct.contributors
,
and the frontCoverImageLink
which doesn't exist at all in BaseProduct
is extracted here as well.
Pay careful attention to the usage of
.firstOrEmpty()
,orElse()
andflatMap()
, espeically in the extraction oftitle
,authors
andfrontCoverImageLink
. They demostrate the Jonix fluent API, where if a certain element doesn't exist in the ONIX XML source (and certainly not its children elements), we still apply the same logic as if it does (counting onnull
terminal values if it doesn't). This syntax eliminates the need for cumbersoneif-else
blocks (testing for existence of elements), and leaves us with concise and clean expressions.
Jonix.source(new File("/path/to/all-onix3-folder"), "*.xml", false)
.onSourceStart(src -> {
// safeguard: we skip non-ONIX-3 files
if (src.onixVersion() != OnixVersion.ONIX3) {
src.skipSource();
}
})
.onSourceEnd(src -> {
System.out.printf("<< Processed %d products from %s %n", src.productCount(), src.sourceName());
})
.stream() // iterate over the products contained in all ONIX sources
.map(Jonix::toProduct3)
.forEach(product -> {
// take the requested information from the current product
String ref = product.recordReference().value;
String isbn13 = product.productIdentifiers()
.find(ProductIdentifierTypes.ISBN_13)
.map(pi -> pi.idValue().value)
.orElse(null);
String title = product.descriptiveDetail().titleDetails()
.filter(td -> td.titleType().value == TitleTypes.Distinctive_title_book)
.firstOrEmpty()
.titleElements()
.firstOrEmpty()
.titleWithoutPrefix().value;
List<String> authors = product.descriptiveDetail().contributors()
.filter(c -> c.contributorRoles().values().contains(ContributorRoles.By_author))
.stream()
.map(c -> c.personName().value().orElse(
c.nameIdentifiers().find(NameIdentifierTypes.Proprietary)
.flatMap(ni -> ni.idTypeName().value())
.orElse("N/A")))
.collect(Collectors.toList());
String frontCoverImageLink = product.collateralDetail().supportingResources()
.filter(sr -> sr.resourceContentType().value == ResourceContentTypes.Front_cover)
.firstOrEmpty()
.resourceVersions()
.filter(rv -> rv.resourceForm().value == ResourceForms.Downloadable_file)
.first()
.map(rv -> rv.resourceLinks().firstValueOrNull())
.orElse(null);
System.out.println("ref = " + ref);
System.out.println("isbn13 = " + isbn13);
System.out.println("title = " + title);
System.out.println("authors = " + authors);
System.out.println("frontCoverImageLink = " + frontCoverImageLink);
System.out.println("-----------------------------------------------------");
});
If your project requires delicate handling of many ONIX fields, you may want to consider replacing the
BaseProduct
class with your own version altogether. This will allow you, or your team members, to write simple,
version-agnostic streaming scripts, like the one at the top of this section, leaving the extraction details separate
from the business logic.
This feature of Jonix is known as Custom Unification, and there are 3 examples included in the project:
- Extend the
BaseProduct
with some additional global fields, see MyCustomBaseUnifier1 - Extend individual members and sub-members of
BaseProduct
(such asdescription
,title
, etc.), see MyCustomBaseUnifier2 - Create a whole new replacement for
BaseProduct
, extracting only the fields you're interested in, see MyCustomUnifier
The following example, for converting a list of ONIX files into CSV files, demonstrates several features:
- use of
store()
andretrieve()
of theJonixSource
to pass variables between event handlers on the same source (thecsv
object in this case) - use of
breakStream()
,productCount()
andproductGlobalCount()
to monitor and control the streaming progress - use of Custom Unification of
MyProduct
withMyUnifier
(see previous section) - ideas for error handling, including
JonixJson.toJson()
andrecordReferenceOf()
public static void onixToCsv(List<String> fileNames) {
Jonix.source(fileNames.stream().map(File::new).toList())
.onSourceStart(source -> {
String csvFileName = source.sourceName();
System.out.println("Creating " + csvFileName + "..");
final CsvWriter csv = new CsvWriter(csvFileName);
csv.writeCsvHeader();
source.store("csv", csv);
})
.onSourceEnd(source -> {
final CsvWriter csv = source.retrieve("csv");
csv.close();
System.out.printf("Processed %d / %d products%n",
source.productCount(), source.productGlobalCount());
})
.stream()
.forEach(rec -> {
final OnixProduct product = rec.product;
final CsvWriter csv = rec.source.retrieve("csv");
try {
MyProduct mp = JonixUnifier.unifyProduct(product, MyUnifier.unifier);
csv.writeCsvLine(mp.toCsvColumns());
} catch (Exception e) {
// e.printStackTrace();
// System.err.println(JonixJson.toJson(product));
System.err.printf("ERROR in #REF [%s]: %s%n",
recordReferenceOf(product), e.getMessage());
// don't re-throw, don't break source, just continue to the next product..
}
if (rec.source.productCount() == 50) {
rec.breakStream();
}
});
}
public static String recordReferenceOf(OnixProduct product) {
final String ref;
if (product.onixVersion() == OnixVersion.ONIX2) {
ref = Jonix.toProduct2(product).recordReference().value;
} else if (product.onixVersion() == OnixVersion.ONIX3) {
ref = Jonix.toProduct3(product).recordReference().value;
} else {
throw new RuntimeException("Unexpected type: " + product.getClass().getName());
}
return (ref == null) ? "N/A" : ref;
}
If prior to opening an ONIX file, its header needs to be examined, use scanHeaders()
.
This is particularly useful when a bulk of ONIX files needs to be pre-scanned for display/sorting/filtering purposes.
The following example provides a simple function that returns the (unified) BaseHeader
of any ONIX file name.
Can be easily extended to a support multiple files as input.
public static BaseHeader headerOf(String onixFileName) {
List<BaseHeader> holder = new ArrayList<>(1);
Jonix.source(new File(onixFileName))
.onSourceStart(src -> src.header().map(Jonix::toBaseHeader).ifPresent(holder::add))
.scanHeaders();
return holder.isEmpty() ? null : holder.get(0);
}
The most fundamental function in Jonix is to transform ONIX sources (containing XML content) into Java objects. When an ONIX source is being read, each record is transformed into a Java object (with many nested objects inside it), letting the user manipulate it without having to deal with the intricacies of the raw XML.
With ONIX, dealing directly with the XML content could be quite complicated, for several reasons:
- the size of the source may be huge (ONIX files may contain thousands of records, easily weighing tens of MBs)
- there are two major versions, generally known as ONIX-2 (deprecated) and ONIX-3 (current)
- each version has two sub-schemas - Reference and Short - see here
- there are many Codelists, whose exact spelling and meaning is crucial for data extraction
- there are syntax rules, governing which tags are repeatable, which are mandatory, what's the relationship between them, etc.
Jonix provides solutions for all the above:
- Source size - Jonix is using XmlChunker internally, which is a service capable of processing infinitely large ONIX sources by reading them chunk-by-chunk.
- ONIX Versions - All versions and all sub-schemas of ONIX are mapped to a corresponding set of Java classes.
- Codelists - Each ONIX Codelist is mapped to a Jonix
Enum
, all listed here. Note that even though each ONIX version defines its own set of Codelists, the correspondingEnum
s in Jonix were unified to avoid confusion. - Schema Rules - These are accounted for in Jonix in several ways:
- ONIX Tags that can be repeated are represented as Java
Set
s orList
s - Tags with special traits (is-mandatory, data format, etc.) have a corresponding Java-doc comment in their definition
- Coherent and descriptive data model with several interfaces used to categorize ONIX tags as either Composite, Element or Flag.
- ONIX Tags that can be repeated are represented as Java
Classes in Jonix that represent ONIX tags are generated automatically from the official schema (here and here). There are over 500 classes for ONIX-3 (and over 430 classes for ONIX-2) and almost 200 enumerators representing the Codelists.
On top of the low-level functions, Jonix offers array of services for data manipulation, including:
- Unification. This is one of the most powerful features in Jonix, which enables processing of mixed sources, i.e. a group of sources, where each may have a different ONIX version (2 or 3) and sub-schema (Reference or Short). These sources will be transformed into a single, common, set of Java classes, on which version-agnostic operations can be made (such as writing into a database, sort, search, etc.).
- Tabulation. While ONIX records are organized as trees (i.e. XML records), it is sometimes easier to analyze them as if they were rows in a table. Flattening a tree into a plain list of columns can't be done without loss of generality, but with the proper knowledge of the ONIX content, it can be done at a reasonable compromise. Jonix offers a default tabulation scheme, which you can customize to your needs. For more information see the documentation for Tabulation
- Bulk Processing. Jonix provides methods for handling multiple ONIX sources, scattered in the file system.
One of Jonix's best facilities is the Unification framework, allowing to simplify the treatment in varied sources
(Onix2 mixed with Onix3, Reference mixed with Short) and eliminate some of the intricacies of XML handling.
The method streamUnified()
returns a Stream, but not of the low-level JonixRecord
s.
Instead, it streams out BaseRecords,
that contain BaseProduct - a typed
and unified representation of the most essential data within typical ONIX source.
The following example demonstrates extraction of some fundamental ONIX fields from an ONIX source of any version and
type using streamUnified()
:
Note that calling
streamUnified()
is identical to.stream().map(Jonix::toBaseProduct)
which we used above
Set<PriceTypes> requestedPrices = JonixUtil.setOf(
PriceTypes.RRP_including_tax,
PriceTypes.RRP_excluding_tax
);
JonixRecords records = Jonix.source(new File("/path/to/folder-with-onix-files"), "onix*.xml", false);
records.streamUnified()
.map(rec -> rec.product)
.forEach(product -> {
String title = product.titles.findTitleText(TitleTypes.Distinctive_title_book);
List<String> authors = product.contributors.getDisplayNames(ContributorRoles.By_author);
List<BasePrice> prices = product.supplyDetails.findPrices(requestedPrices);
List<String> priceLabels =
prices.stream().map(bp -> bp.priceAmountAsStr + " " + bp.currencyCode.code)
.collect(Collectors.toList());
System.out.printf("The book '%s' by %s costs: %s%n", title, authors, priceLabels);
});
Another case is Unification of the raw OnixHeader
, by using Jonix.toBaseHeader()
, like that:
// given a JonixRecords object
JonixRecords records = ...
// we can set use the 'SourceStart' event to print the ONIX Header information
records.onSourceStart(src -> {
src.header().map(Jonix::toBaseHeader).ifPresent(baseHeader -> System.out.println(baseHeader));
})
Jonix provides generic framework to allow flattening and outputting ONIX Products into a table-like structure (suitable
for CSV or database export). Jonix offers a Collector
that saves a stream into a CSV file:
import static com.tectonica.jonix.tabulate.JonixDelimitedWriter.toDelimitedFile;
// prepare to read from various sources
JonixRecords records = Jonix
.source(...)
.onSourceStart(src -> ...)
.onSourceEnd(src -> ...)
.configure(...);
// save the most important fields of the streamed ONIX products into a CSV file
File targetFile = new File("/path/to/destination.csv");
int recordsWritten = records.streamUnified()
.collect(toDelimitedFile(targetFile, ',', BaseTabulation.ALL));
// file is saved
System.out.println("Written " + recordsWritten + " records")
The procedure of how to define which fields to output and how are described in Tabulation and in FieldTabulator