From 7e23418d84d4513f2df75e0232525dddd9edf475 Mon Sep 17 00:00:00 2001 From: Adam Pocock Date: Thu, 15 Sep 2022 14:44:24 -0400 Subject: [PATCH 1/8] Updating the docs to reference protobuf serialization. --- docs/Architecture.md | 27 ++++++++++++++++++++++++++- docs/Internals.md | 25 ++++++++++++++++++++++++- docs/PackageOverview.md | 33 ++++++++++++++++++++++----------- docs/Roadmap.md | 15 +++++++++------ docs/Security.md | 4 ++++ 5 files changed, 85 insertions(+), 19 deletions(-) diff --git a/docs/Architecture.md b/docs/Architecture.md index 3bee56c57..fc9c7fce5 100644 --- a/docs/Architecture.md +++ b/docs/Architecture.md @@ -422,11 +422,36 @@ transparently hashes the inputs. The feature names tend to be particularly sensitive when working with NLP problems. For example, without such hashing, bigrams would appear in the feature domains. +## Serialization + +Tribuo supports Java serialization (i.e., using `java.io.Serializable`) and from +v4.3 it supports serializing objects to protobufs. Java serialization support is +deprecated, and will be removed in the next major version. While using the Java +serialization support we recommend the use of a serialization filter, more +information is given in our [Security documentation](Security.md). +Classes which support protobuf serialization +now implement `ProtoSerializable` where the type bound gives the type of the protobuf they +serialize to. Tribuo's protobuf serialization supports +all the types that Java serialization supports, with the exception of `Example` +metadata values which previously supported any `java.io.Serializable` type, and now +only support `String` values. Helper methods to deserialize objects from protobufs +have been added to all the major interfaces of the form `.deserialize(Proto)` +The protobuf definitions are packaged into Tribuo's jars, and the protobuf classes +are compiled using protoc `v3.19.4`. Tribuo's protobuf support includes versioning +of the protobufs to allow incremental modifications to the protobuf schemas as the +types evolve. This flexibility should allow protobuf to remain the preferred serialization +format for Tribuo without restricting the evolution of Tribuo's classes and interfaces. + +As Java's generic type system is erased, the objects returned from this serialization +mechanism internally validate that the types are consistent, but users must validate +that the `Model` is of the expected type using +`Model.validate(Class>)` or similar. + ## ONNX Export From v4.2 Tribuo supports exporting some models in the [ONNX](https://onnx.ai) model format. The ONNX format is a cross-platform model exchange format which -can be loaded in by many different machine learning libraries. Tribuo supports +can be loaded in by many machine learning libraries. Tribuo supports inference on ONNX models via ONNX Runtime. Models which can be exported implement the `ONNXExportable` interface, which provides methods for constructing the ONNX protobuf and serializing it to disk. As of the release of diff --git a/docs/Internals.md b/docs/Internals.md index c5c189ce7..20f5fea2b 100644 --- a/docs/Internals.md +++ b/docs/Internals.md @@ -102,7 +102,7 @@ this `Feature` was observed. At this point the `Dataset` can be transformed, by a `TransformationMap`. This applies an independent sequence of transformation to each `Feature`, so it can -perform rescaling or binning, but not Principle Component Analysis (PCA). The +perform rescaling or binning, but not Principal Component Analysis (PCA). The `TransformationMap` gathers the necessary statistics about the features, and then rewrites each `Example` according to the transformation, generating a `TransformerMap` which can be used to apply that specific transformations to @@ -166,3 +166,26 @@ classification, RMSE for regression etc). Finally, the input data's `DataProvenance` and the `Model`'s `ModelProvenance` are queried, and the evaluation statistics, provenances and predictions are passed to the appropriate `Evaluation`'s constructor for storage. + +## Protobuf Serialization + +Tribuo's protobuf serialization is based around redirection and the `Any` packed +protobuf to simulate polymorphic behaviour. Each type is packaged into a top +level protobuf representing the interface it implements which has an integer +version field incrementing from 0, the class name of the class which can +deserialize this object, and a packed `Any` message which contains class specific serialization information. This protobuf is +unpacked using the deserialization mechanism in `org.tribuo.protos.ProtoUtil` and +then the method `deserializeFromProto(int version, String className, Any message)` +is called on the `className` specified in the proto. The class name is passed through +to allow redirection for Tribuo internal classes which may want to deserialize as a +different type as we evolve the library. That method then typically checks that the +version is supported by the current class, to prevent inaccurate deserialization of +protobufs written by newer versions of Tribuo when loaded into older versions, and +then the `Any` message is unpacked into a class specific protobuf, any necessary +validation is performed, the deserialized object is constructed and then returned. + +There are two helper classes, `ModelDataCarrier` and `DatasetDataCarrier` which +allow easy serialization/deserialization of shared fields in `Model` and +`Dataset` respectively (and the sequence variants thereof). These are considered +an implementation detail as they may change to incorporate new fields, and may +be converted into records when Tribuo moves to a newer version of Java. diff --git a/docs/PackageOverview.md b/docs/PackageOverview.md index b51e76647..c2916f87f 100644 --- a/docs/PackageOverview.md +++ b/docs/PackageOverview.md @@ -98,17 +98,18 @@ suitable for use with models like BERT. Multi-class classification is the act of assigning a single label from a set of labels to a test example. The classification module has several submodules: -| Folder | ArtifactID | Package root | Description | -| --- | --- | --- | --- | -| Core | `tribuo-classification-core` | `org.tribuo.classification` | Contains an Output subclass for use with multi-class classification tasks, evaluation code for checking model performance, and an implementation of Adaboost.SAMME. It also contains simple baseline classifiers. | -| DecisionTree | `tribuo-classification-tree` | `org.tribuo.classification.dtree` | An implementation of CART decision trees. | -| Experiments | `tribuo-classification-experiments` | `org.tribuo.classification.experiments` | A set of main functions for training & testing models on any supported dataset. This submodule depends on all the classifiers and allows easy comparison between them. It should not be imported into other projects since it is intended purely for development and testing. | -| Explanations | `tribuo-classification-experiments` | `org.tribuo.classification.explanations` | An implementation of LIME for classification tasks. If you use the columnar data loader, LIME can extract more information about the feature domain and provide better explanations. | -| LibLinear | `tribuo-classification-liblinear` | `org.tribuo.classification.liblinear` | A wrapper around the LibLinear-java library. This provides linear-SVMs and other l1 or l2 regularised linear classifiers. | -| LibSVM | `tribuo-classification-libsvm` | `org.tribuo.classification.libsvm` | A wrapper around the Java version of LibSVM. This provides linear & kernel SVMs with sigmoid, gaussian and polynomial kernels. | -| Multinomial Naive Bayes | `tribuo-classification-mnnaivebayes` | `org.tribuo.classification.mnb` | An implementation of a multinomial naive bayes classifier. Since it aims to store a compact in-memory representation of the model, it only keeps track of weights for observed feature/class pairs. | -| SGD | `tribuo-classification-sgd` | `org.tribuo.classification.sgd` | An implementation of stochastic gradient descent based classifiers. It includes a linear package for logistic regression and linear-SVM (using log and hinge losses, respectively), a kernel package for training a kernel-SVM using the Pegasos algorithm, a crf package for training a linear-chain CRF, and a fm package for training pairwise factorization machines. These implementations depend upon the stochastic gradient optimisers in the main Math package. The linear, fm, and crf packages can use any of the provided gradient optimisers, which enforce various different kinds of regularisation or convergence metrics. This is the preferred package for linear classification and for sequence classification due to the speed and scalability of the SGD approach. | -| XGBoost | `tribuo-classification-xgboost` | `org.tribuo.classification.xgboost` | A wrapper around the XGBoost Java API. XGBoost requires a C library accessed via JNI. XGBoost is a scalable implementation of gradient boosted trees. | +| Folder | ArtifactID | Package root | Description | +|-------------------------| --- | --- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Core | `tribuo-classification-core` | `org.tribuo.classification` | Contains an Output subclass for use with multi-class classification tasks, evaluation code for checking model performance, and an implementation of Adaboost.SAMME. It also contains simple baseline classifiers. | +| DecisionTree | `tribuo-classification-tree` | `org.tribuo.classification.dtree` | An implementation of CART decision trees. | +| Experiments | `tribuo-classification-experiments` | `org.tribuo.classification.experiments` | A set of main functions for training & testing models on any supported dataset. This submodule depends on all the classifiers and allows easy comparison between them. It should not be imported into other projects since it is intended purely for development and testing. | +| Explanations | `tribuo-classification-experiments` | `org.tribuo.classification.explanations` | An implementation of LIME for classification tasks. If you use the columnar data loader, LIME can extract more information about the feature domain and provide better explanations. | +| FeatureSelection | `tribuo-classification-fs` | `org.tribuo.classification.fs` | An implementation of several information theoretic feature selection algorithms for classification problems. | +| LibLinear | `tribuo-classification-liblinear` | `org.tribuo.classification.liblinear` | A wrapper around the LibLinear-java library. This provides linear-SVMs and other l1 or l2 regularised linear classifiers. | +| LibSVM | `tribuo-classification-libsvm` | `org.tribuo.classification.libsvm` | A wrapper around the Java version of LibSVM. This provides linear & kernel SVMs with sigmoid, gaussian and polynomial kernels. | +| Multinomial Naive Bayes | `tribuo-classification-mnnaivebayes` | `org.tribuo.classification.mnb` | An implementation of a multinomial naive bayes classifier. Since it aims to store a compact in-memory representation of the model, it only keeps track of weights for observed feature/class pairs. | +| SGD | `tribuo-classification-sgd` | `org.tribuo.classification.sgd` | An implementation of stochastic gradient descent based classifiers. It includes a linear package for logistic regression and linear-SVM (using log and hinge losses, respectively), a kernel package for training a kernel-SVM using the Pegasos algorithm, a crf package for training a linear-chain CRF, and a fm package for training pairwise factorization machines. These implementations depend upon the stochastic gradient optimisers in the main Math package. The linear, fm, and crf packages can use any of the provided gradient optimisers, which enforce various different kinds of regularisation or convergence metrics. This is the preferred package for linear classification and for sequence classification due to the speed and scalability of the SGD approach. | +| XGBoost | `tribuo-classification-xgboost` | `org.tribuo.classification.xgboost` | A wrapper around the XGBoost Java API. XGBoost requires a C library accessed via JNI. XGBoost is a scalable implementation of gradient boosted trees. | ## Multi-label Classification @@ -220,3 +221,13 @@ TensorFlow not just for Tribuo but for the Java community as a whole. Tribuo demonstrates the TensorFlow interop by including an example config file, several example model generation functions and protobuf for an MNIST model graph. + +## Other modules + +Tribuo has a number of other modules: + +| Folder | ArtifactID | Package root | Description | +|---------| --- | --- | --- | --- | +| Json | `tribuo-json` | `org.tribuo.json` | Contains support for reading and writing Json formatted data, along with a program for inspecting and removing provenance information from models. | +| ModelCard | `tribuo-interop-modelcard` | `org.tribuo.interop.modelcard` | Contains support for reading and writing model cards in Json format, using the provenance information in Tribuo models to guide the card construction. | +| Reproducibility | `tribuo-reproducibility` | `org.tribuo.reproducibility` | A utility for reproducing Tribuo models and datasets. | diff --git a/docs/Roadmap.md b/docs/Roadmap.md index 73dd4790a..c76591be9 100644 --- a/docs/Roadmap.md +++ b/docs/Roadmap.md @@ -37,14 +37,16 @@ specific operations (though this can be achieved today using `DatasetView` and p categorical and real valued features, and promotes the former to the latter when there are too many categories. This could be tied into the `RowProcessor` to give the user control over the feature types, which could filter down into algorithmic choices elsewhere in the package. -- Serialization. We'd like to have alternate serialization mechanisms for models and datasets until -Java's serialization mechanisms improve. +- ~~Serialization. We'd like to have alternate serialization mechanisms for models and datasets until +Java's serialization mechanisms improve.~~ + - In 4.3 we added protobuf serialization to Tribuo and deprecated Java serialization. - Caching datasource. Datasources may currently perform expensive feature extraction steps (I'm looking at you `RowProcessor`), and it would be useful to be able to cache the output of that locally, while maintaining the link to the original data. We don't have a firm design for this feature yet, but we're in need of it for some internal work. -- KMeans & Nearest Neighbour share very little code, but are conceptually very similar. We'd like -to refactor out the shared code (while maintaining serialization compatibility). +- ~~KMeans & Nearest Neighbour share very little code, but are conceptually very similar. We'd like +to refactor out the shared code (while maintaining serialization compatibility).~~ + - In 4.3 we added a distance querying interface and refactored KMeans, KNN and HDBSCAN to use it. - Allow `DatasetView` to regenerate its feature and output domains. Currently all views of a dataset share the same immutable feature domain, but in some cases this can leak information from test time to train (e.g., when using the unselected data as an out of bag sample). @@ -61,8 +63,9 @@ specify a minimum purity decrease requirement.~~ - Integrated in Tribuo 4.1. - Gaussian Processes. - Vowpal Wabbit interface. -- Feature selection. We already have several feature selection algorithms implemented -in a Tribuo compatible interface, but the codebase isn't quite ready for release. +- ~~Feature selection. We already have several feature selection algorithms implemented +in a Tribuo compatible interface, but the codebase isn't quite ready for release.~~ + - Feature selection for classification problems is integrated in Tribuo 4.3. - Support word embedding features. - ~~Support contextualised word embeddings (through the ONNX or TensorFlow interfaces).~~ - ONNX support for BERT embeddings is integrated in Tribuo 4.1. diff --git a/docs/Security.md b/docs/Security.md index 8a7b0f641..309b236cd 100644 --- a/docs/Security.md +++ b/docs/Security.md @@ -24,6 +24,10 @@ Additionally, when running with a security manager, Tribuo will need access to the relevant filesystem locations to load or save model files. See the section on [Configuration](#Configuration) for more details. +In Tribuo 4.3 we introduced protobuf based serialization for all supported Java +serializable types. This is the preferred serialization mechanism, and Java +serialization support will be removed in the next major release of Tribuo. + ## Database access Tribuo provides a SQL interface that can load data via a JDBC connection. As it's frequently necessary to load data via a joined query from an unknown From 15e538fd16dca4a3a9d44a288be1aee81335e3c1 Mon Sep 17 00:00:00 2001 From: Adam Pocock Date: Thu, 15 Sep 2022 14:45:33 -0400 Subject: [PATCH 2/8] Adding serde helpers to Model, Dataset, SequenceModel and SequenceDataset. Updating main methods to support protobuf. --- .../classification/sequence/SeqTrainTest.java | 42 +++++--- .../classification/experiments/RunAll.java | 16 ++- .../classification/experiments/Test.java | 34 ++++-- .../explanations/lime/LIMETextCLI.java | 52 +++++---- .../classification/sgd/crf/SeqTest.java | 50 ++++++--- Core/src/main/java/org/tribuo/Dataset.java | 53 +++++++++ Core/src/main/java/org/tribuo/Model.java | 55 ++++++++++ .../main/java/org/tribuo/ModelExplorer.java | 47 +++++--- .../org/tribuo/sequence/SequenceDataset.java | 101 ++++++++++++++++++ .../org/tribuo/sequence/SequenceModel.java | 65 +++++++++++ .../sequence/SequenceModelExplorer.java | 49 ++++++--- .../java/org/tribuo/data/DataOptions.java | 52 ++++++++- .../java/org/tribuo/data/DatasetExplorer.java | 48 ++++++--- .../tribuo/data/PreprocessAndSerialize.java | 29 +++-- .../org/tribuo/interop/oci/OCIModelCLI.java | 28 ++++- .../tribuo/interop/tensorflow/TrainTest.java | 17 ++- 16 files changed, 609 insertions(+), 129 deletions(-) diff --git a/Classification/Core/src/main/java/org/tribuo/classification/sequence/SeqTrainTest.java b/Classification/Core/src/main/java/org/tribuo/classification/sequence/SeqTrainTest.java index d3bd0ce89..8b9797299 100644 --- a/Classification/Core/src/main/java/org/tribuo/classification/sequence/SeqTrainTest.java +++ b/Classification/Core/src/main/java/org/tribuo/classification/sequence/SeqTrainTest.java @@ -1,5 +1,5 @@ /* - * Copyright (c) 2015-2020, Oracle and/or its affiliates. All rights reserved. + * Copyright (c) 2015, 2022, Oracle and/or its affiliates. All rights reserved. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -29,11 +29,10 @@ import org.tribuo.util.Util; import java.io.BufferedInputStream; -import java.io.FileInputStream; -import java.io.FileOutputStream; import java.io.IOException; import java.io.ObjectInputStream; import java.io.ObjectOutputStream; +import java.nio.file.Files; import java.nio.file.Path; import java.util.logging.Logger; @@ -78,6 +77,9 @@ public String getOptionsDescription() { */ @Option(charName = 't', longName = "trainer-name", usage = "Name of the trainer in the configuration file.") public SequenceTrainer