Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add algebraic logging #2965

Open
wants to merge 104 commits into
base: master
Choose a base branch
from

Conversation

patins1
Copy link
Contributor

@patins1 patins1 commented Jan 26, 2024

Description

Supported by a dedicated training listener, algebraic operations executed during training can be recorded and stored as Python program.

  • If this change is a backward incompatible change, why must this change be made?

In order not to record a concrete batch size that is used during training, -1 is now used at some places in the existing Java code as value for the batch dimension. This is backwards compatible as underlying engines would infer the right value from the size of the array.

  • Interesting edge cases to note here

In case different epochs or even different batches within an epoch use different prediction / loss functions, multiple prediction / loss functions are generated (a Python comment will indicate how often they are "used"). The MNIST and ResNet examples only generated one prediction / loss function which are unit-tested and also tested by me in a tensorflow program to yield the same results as the original DJL model. It will be interesting to test other models in the future.

The algebraic logging works only with mxnet as of now the PyTorch engine doesn't build up a data structure describing the executed operation and its arguments.

SidneyLann and others added 30 commits September 19, 2023 17:36
---------

Co-authored-by: Administrator <Administrator@tech8>
Co-authored-by: KexinFeng <fenkexin@amazon.com>
* Implement PtNDArraryEx.multiboxDetection

* MultiboxDetection - code cleanup

* MultiboxDetection - code cleanup

* MultiboxDetection - code cleanup

* MultiboxDetection - code cleanup

* format code

* Fix, add tests, and pass CI

---------

Co-authored-by: Zach Kimberg <kimbergz@amazon.com>
This fixes the markdown headers to be h1 so they render correctly in docs.
…valibrary#2806)

* [api] Added Early stopping configuration (deepjavalibrary#38)

* [api] Added Builder for Early stopping configuration (deepjavalibrary#38)

* Explicitly set NDManager for dataset in EarlyStoppingListenerTest to make the test run on JDK11 in gradle.
This creates an abstraction for combining devices into a single device. The main
use case for now is in DJL Serving TP_parallel. It will allow us to create a
WorkerGroup and a PyPredictor for a set of devices and then track the usage of
devices properly. It could also be used later for multi-gpu training or other
multi-device cases.
* Updates doc versions to 0.24.0

Also moves android gradle.properties to the new 0.25.0.

* Remove android change
* Updates XGBoost to 2.0.1

* Use devtools 8

* Updates based on new Xgboost JNI API.

---------

Co-authored-by: Frank Liu <frankfliu2000@gmail.com>
* Added element-wise gauss error function (ERF)

* Added element-wise arctan2

* Format java

* Fixed docs

* added * to other_ptr in Atan2
* Added 2D FFT

* Format java

* Add default fft2

* Convert array to vectors

* Add inverse fft2

* Add better assersion in ifft2 test

* Add really better assersion in ifft2 test

* Move cast bellow ifft2 for unsupported exception

* Format java

* changed dims to axes

* changed dims to axes
* only build triton binaries

* install requests library

* remove script
@patins1 patins1 requested review from zachgk, frankfliu and a team as code owners January 26, 2024 06:15
@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2024

Codecov Report

Attention: Patch coverage is 75.13514% with 92 lines in your changes are missing coverage. Please review.

Project coverage is 72.33%. Comparing base (bb5073f) to head (5d65575).
Report is 1002 commits behind head on master.

Files Patch % Lines
...i/src/main/java/ai/djl/training/listener/Node.java 59.29% 75 Missing and 6 partials ⚠️
...va/ai/djl/training/listener/AlgebraicListener.java 92.71% 3 Missing and 8 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2965      +/-   ##
============================================
+ Coverage     72.08%   72.33%   +0.24%     
- Complexity     5126     7381    +2255     
============================================
  Files           473      724     +251     
  Lines         21970    32886   +10916     
  Branches       2351     3438    +1087     
============================================
+ Hits          15838    23789    +7951     
- Misses         4925     7460    +2535     
- Partials       1207     1637     +430     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zachgk
Copy link
Contributor

zachgk commented Feb 27, 2024

Hi @patins1. I think this PR is going in a useful direction, but we may need to make some changes. Let me start by putting it into the context I am approaching it from.

A number of the other imperative deep learning frameworks (PyTorch, MXNet, etc.) eventually reached a stage where they want to convert models from being imperative (embedded into python code) into symbolic (a standalone data structure). From the symbolic format, you can do lots of useful things such as easier importing/exporting or full compiler style optimizations.

There are then two major ways this is done: tracing or scripting. In tracing, you run a forward pass on your model and observe which operations are run. From the trace you can then reconstruct it as your symbolic model. The other approach is to do static analysis and look at the python/Java code itself to convert it into the equivalent data structure format. For example, see torchscript.

In that sense, this algebraic logging PR seems to be a tracing method that exports into a python keras model. I have a few large concerns. First is that it only works on MXNet. The main MXNet project is abandoned so we want to focus development on the maintained engines. Or ideally, it should be engine agnostic rather than targeted to a particular engine. The other is that we want to design an implementation that could expand to other output formats (python with pytorch, torchscript, maybe a DJL custom format, etc).

So using the global record is probably not going to work. Not all engines support a generic invoke. My thought is that we could have a TracingNDArray that wraps around an NDArray and will execute the wrapped NDArray operations while also recording the operations executed.

Then, we probably want to do a two-step recording. The first step would record the operation name and args into some standard DJL format. In the second step, that format would be converted into the desired target (python keras). So calling the core pieces would look something like:

TracedNDArray result = myOperation(new TracedNDArray(input1), new TracedNDArray(input2));
Symbolic symbolicMyOperation = result.getTrace();
PyKerasExporter.export(symbolicMyOperation, path);

From a solution like this, it would work with all engines because it just uses the NDArray class itself. We can add some helpers onto Trainer to simplify tracing such as Symbolic symbolicMyOperation = trainer.trace(). We could add other exporter classes that are based on Symbolic. And finally, we could even try to build a scripting based strategy (perhaps leveraging known Blocks) later on. As long as that scripting targets the same Symbolic class, it could share the same pool of exporters.

Does this make sense? Also, feel free to share any concerns or alternative suggestions to my proposal

@patins1
Copy link
Contributor Author

patins1 commented Mar 2, 2024

HI @zachgk , thanks for your thoughts.

You sketched a class Symbolic, which I assume would capture the DJL custom format you mentioned.
MXNet has a similar concept of symbolic model which you can activate using the symbolic-model option of ai.djl.examples.training.util.Arguments:

Use symbolic model, use imperative model if false

So I would assume that the symbolic model that can be loaded for MXNet, also control flow statements are contained, and this would make the difference to your Symbolic class which is the equivalence to my Node class I guess.
So given that we would only support imperative models / graphs, I dont see it is worth the effort to go into the direction you propose, it is just too little value as imperative models are not representative of the whole model (capturing only single traces by nature). Anyways, it is interesting to just log the graphs like I did for transparency/QA reasons.

On the otherhand, DJL already supports a symbolic format namely on block level, and it would be an interesting extension to DJL to write converters from it to the block-level equivalent of tensorflow which are keras layers or to the equivalent of PyTorch which is provided by the torch.nn package. I might look into the former transformation at some time in the future..

I had no idea MXNet is abandoned, personally I use PyTorch engine when working with DJL but only for this logging thing I had to use MXNet. And that's the beauty of DJL, that I can switch easily to MXNet without changing my code, awesome!

From my discoveries implementing this feature, I realized that PyTorch and MXNet are quite alike while tensorflow showed major differences:

  1. for the convolutional operation, the channels have a different dimensional order which makes an additional transpose operation necessary. I moved this transpose operation from the forward computational graf to the weight initialization section in my last commit (doesn't really make a difference, but now the corresponding model weight parameter have a different shape)
  2. Tensorflow needs additional tf.nn.bias_add calls for various operations while Pytorch / mxnet have this already incorporated e.g. the PyTorch bias parameter of torch.nn.Conv2d
  3. My next big challenge is to log RNN operations. I thought I could use tf.compat.v1.nn.dynamic_rnn but this is deprecated and tensorflow wants you to use tf.keras.layers.RNN which is a layer concept that I by design didnt want to use. To be continued..

@patins1
Copy link
Contributor Author

patins1 commented Mar 2, 2024

As to give an example how the block-level model built by TrainMnist.java would be converted to tensorflow:

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

@zachgk
Copy link
Contributor

zachgk commented Mar 4, 2024

Yeah. I borrowed the name of Symbolic from MXNet (which I used to work on). But they all have it: see the blog post What are Symbolic and Imperative APIs in TensorFlow 2.0? .

So control flow is a tricky part of the story. Symbolic formats can be viewed almost like programming languages and can contain control flow. But this is where the tracing/scripting methodologies differ the most. With tracing, it can't detect the control flow. Instead, it ends up interpreting the paths taken by the control flow as if they are hard-coded. This can work fine if the paths are fixed such as a for loop through all of the layers in the model. For paths that vary such as based on the input arguments, tracing simply won't work for those model designs. So, even if the Symbolic formats have control flow capability, the tracing methodology can't make use of it.

This is where some of the goal for scripting comes in. Using scripting, it can recognize control flow and treat it appropriately (assuming the Symbolic format can express the necessary control flow logic). However, scripting must also deal with other logic in the source programming language (python/Java) such as classes, function calls, recursion, other data types, etc. It also needs some avenue to be called from where it has access to the source code. This is less a problem in dynamic python, but in Java it would require either being before the Java compiler or to use the compiled java byte code. Overall, it is a more difficult but mot powerful path.

Now, DJL blocks are not actually a symbolic format. Imperative formats still use features in their source programming languages like class hierarchies. As an example, the imperative Pytorch includes the Module class.

There are two major differences that separate the DJL blocks from a symbolic format. The first is it's treatment of primitive vs compound features. In DJL, you can think of blocks as either being primitive blocks that call the actual engine operators or compound blocks that only call other blocks. If it was properly symbolic, a converter would require only defining the conversion for primitive blocks. As an analogy, a language like Java has primitives (defined in the Java language spec) and compounds (code written in language). Tools like the Java compiler require custom handling for all primitives but work on any arbitrary Java code. However, no DJL block converter would ever be finished. It would require implementations for every block any user might create.

The second difference comes from LambdaBlocks. These are blocks that can contain arbitrary Java code. So, there is no way to write a converter that works for LambdaBlocks without going back to the methodologies of tracing or scripting to convert the arbitrary Java code into Symbolic. This is a fairly big issue as we try to use LambdaBlocks whenever no parameters are necessary including most activations and pooling in addition to other arbitrary code like reshapes, transpose, flatten, etc.

@patins1
Copy link
Contributor Author

patins1 commented Mar 16, 2024

Similar to your abstraction, it makes sense to divide all blocks in DJL as primitive blocks on the one side and compound / lambda blocks on the other side. If we apply this division to other engines as well, we could postulate that primitive blocks can be converted across engines easily while the other blocks - can't, or we don't care for now. It is then the question to find the set of primitive blocks that shall be supported, and then hopefully 90% of networks can be transformed easily between these engines which share this common set of primitive blocks. I ll have a look if this is feasible for MNIST and Resnet which are the study objects of this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.