reproduce for the modern basic deep model
- Attention
- Multi-Head Attention
- GPT-2
This following the custom optimizer with my understand. Your issue and question is welcomed!
- SGD
- Momentum SGD
- Nestrov SGD
- Adam
- Nadam
- Adamw,(but maybe some bug here)
no weight decay supported. This will be added soon!