-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add algebraic logging #2965
base: master
Are you sure you want to change the base?
add algebraic logging #2965
Conversation
--------- Co-authored-by: Administrator <Administrator@tech8> Co-authored-by: KexinFeng <fenkexin@amazon.com>
* Implement PtNDArraryEx.multiboxDetection * MultiboxDetection - code cleanup * MultiboxDetection - code cleanup * MultiboxDetection - code cleanup * MultiboxDetection - code cleanup * format code * Fix, add tests, and pass CI --------- Co-authored-by: Zach Kimberg <kimbergz@amazon.com>
…brary#2796) This reverts commit 3a90d0a.
This fixes the markdown headers to be h1 so they render correctly in docs.
…valibrary#2806) * [api] Added Early stopping configuration (deepjavalibrary#38) * [api] Added Builder for Early stopping configuration (deepjavalibrary#38) * Explicitly set NDManager for dataset in EarlyStoppingListenerTest to make the test run on JDK11 in gradle.
This creates an abstraction for combining devices into a single device. The main use case for now is in DJL Serving TP_parallel. It will allow us to create a WorkerGroup and a PyPredictor for a set of devices and then track the usage of devices properly. It could also be used later for multi-gpu training or other multi-device cases.
* Updates doc versions to 0.24.0 Also moves android gradle.properties to the new 0.25.0. * Remove android change
* Updates XGBoost to 2.0.1 * Use devtools 8 * Updates based on new Xgboost JNI API. --------- Co-authored-by: Frank Liu <frankfliu2000@gmail.com>
* Added element-wise gauss error function (ERF) * Added element-wise arctan2 * Format java * Fixed docs * added * to other_ptr in Atan2
* Added 2D FFT * Format java * Add default fft2 * Convert array to vectors * Add inverse fft2 * Add better assersion in ifft2 test * Add really better assersion in ifft2 test * Move cast bellow ifft2 for unsupported exception * Format java * changed dims to axes * changed dims to axes
* only build triton binaries * install requests library * remove script
Updates the navigation as a followup to deepjavalibrary/djl-serving#1316.
* Suppress serial warning for JDK21 In JDK21, it now throws the serial warning for including potentially unserializable instance variables. This includes the standard Java data structures like List, Set, and Map. This changes the JDK21 support from deepjavalibrary#2903 to suppress the warning rather than no longer serializing the variables. * Keep CategoryMask as transient
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #2965 +/- ##
============================================
+ Coverage 72.08% 72.33% +0.24%
- Complexity 5126 7381 +2255
============================================
Files 473 724 +251
Lines 21970 32886 +10916
Branches 2351 3438 +1087
============================================
+ Hits 15838 23789 +7951
- Misses 4925 7460 +2535
- Partials 1207 1637 +430 ☔ View full report in Codecov by Sentry. |
Hi @patins1. I think this PR is going in a useful direction, but we may need to make some changes. Let me start by putting it into the context I am approaching it from. A number of the other imperative deep learning frameworks (PyTorch, MXNet, etc.) eventually reached a stage where they want to convert models from being imperative (embedded into python code) into symbolic (a standalone data structure). From the symbolic format, you can do lots of useful things such as easier importing/exporting or full compiler style optimizations. There are then two major ways this is done: tracing or scripting. In tracing, you run a forward pass on your model and observe which operations are run. From the trace you can then reconstruct it as your symbolic model. The other approach is to do static analysis and look at the python/Java code itself to convert it into the equivalent data structure format. For example, see torchscript. In that sense, this algebraic logging PR seems to be a tracing method that exports into a python keras model. I have a few large concerns. First is that it only works on MXNet. The main MXNet project is abandoned so we want to focus development on the maintained engines. Or ideally, it should be engine agnostic rather than targeted to a particular engine. The other is that we want to design an implementation that could expand to other output formats (python with pytorch, torchscript, maybe a DJL custom format, etc). So using the global record is probably not going to work. Not all engines support a generic invoke. My thought is that we could have a Then, we probably want to do a two-step recording. The first step would record the operation name and args into some standard DJL format. In the second step, that format would be converted into the desired target (python keras). So calling the core pieces would look something like: TracedNDArray result = myOperation(new TracedNDArray(input1), new TracedNDArray(input2));
Symbolic symbolicMyOperation = result.getTrace();
PyKerasExporter.export(symbolicMyOperation, path); From a solution like this, it would work with all engines because it just uses the Does this make sense? Also, feel free to share any concerns or alternative suggestions to my proposal |
HI @zachgk , thanks for your thoughts. You sketched a class
So I would assume that the symbolic model that can be loaded for MXNet, also control flow statements are contained, and this would make the difference to your On the otherhand, DJL already supports a symbolic format namely on block level, and it would be an interesting extension to DJL to write converters from it to the block-level equivalent of tensorflow which are keras layers or to the equivalent of PyTorch which is provided by the torch.nn package. I might look into the former transformation at some time in the future.. I had no idea MXNet is abandoned, personally I use PyTorch engine when working with DJL but only for this logging thing I had to use MXNet. And that's the beauty of DJL, that I can switch easily to MXNet without changing my code, awesome! From my discoveries implementing this feature, I realized that PyTorch and MXNet are quite alike while tensorflow showed major differences:
|
As to give an example how the block-level model built by TrainMnist.java would be converted to tensorflow:
|
Yeah. I borrowed the name of So control flow is a tricky part of the story. Symbolic formats can be viewed almost like programming languages and can contain control flow. But this is where the tracing/scripting methodologies differ the most. With tracing, it can't detect the control flow. Instead, it ends up interpreting the paths taken by the control flow as if they are hard-coded. This can work fine if the paths are fixed such as a for loop through all of the layers in the model. For paths that vary such as based on the input arguments, tracing simply won't work for those model designs. So, even if the Symbolic formats have control flow capability, the tracing methodology can't make use of it. This is where some of the goal for scripting comes in. Using scripting, it can recognize control flow and treat it appropriately (assuming the Symbolic format can express the necessary control flow logic). However, scripting must also deal with other logic in the source programming language (python/Java) such as classes, function calls, recursion, other data types, etc. It also needs some avenue to be called from where it has access to the source code. This is less a problem in dynamic python, but in Java it would require either being before the Java compiler or to use the compiled java byte code. Overall, it is a more difficult but mot powerful path. Now, DJL blocks are not actually a symbolic format. Imperative formats still use features in their source programming languages like class hierarchies. As an example, the imperative Pytorch includes the Module class. There are two major differences that separate the DJL blocks from a symbolic format. The first is it's treatment of primitive vs compound features. In DJL, you can think of blocks as either being primitive blocks that call the actual engine operators or compound blocks that only call other blocks. If it was properly symbolic, a converter would require only defining the conversion for primitive blocks. As an analogy, a language like Java has primitives (defined in the Java language spec) and compounds (code written in language). Tools like the Java compiler require custom handling for all primitives but work on any arbitrary Java code. However, no DJL block converter would ever be finished. It would require implementations for every block any user might create. The second difference comes from LambdaBlocks. These are blocks that can contain arbitrary Java code. So, there is no way to write a converter that works for LambdaBlocks without going back to the methodologies of tracing or scripting to convert the arbitrary Java code into Symbolic. This is a fairly big issue as we try to use LambdaBlocks whenever no parameters are necessary including most activations and pooling in addition to other arbitrary code like reshapes, transpose, flatten, etc. |
Similar to your abstraction, it makes sense to divide all blocks in DJL as primitive blocks on the one side and compound / lambda blocks on the other side. If we apply this division to other engines as well, we could postulate that primitive blocks can be converted across engines easily while the other blocks - can't, or we don't care for now. It is then the question to find the set of primitive blocks that shall be supported, and then hopefully 90% of networks can be transformed easily between these engines which share this common set of primitive blocks. I ll have a look if this is feasible for MNIST and Resnet which are the study objects of this pull request. |
Description
Supported by a dedicated training listener, algebraic operations executed during training can be recorded and stored as Python program.
In order not to record a concrete batch size that is used during training, -1 is now used at some places in the existing Java code as value for the batch dimension. This is backwards compatible as underlying engines would infer the right value from the size of the array.
In case different epochs or even different batches within an epoch use different prediction / loss functions, multiple prediction / loss functions are generated (a Python comment will indicate how often they are "used"). The MNIST and ResNet examples only generated one prediction / loss function which are unit-tested and also tested by me in a tensorflow program to yield the same results as the original DJL model. It will be interesting to test other models in the future.
The algebraic logging works only with mxnet as of now the PyTorch engine doesn't build up a data structure describing the executed operation and its arguments.