Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit convmixer cifar10 #374

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Dec 9, 2022

This previously didn't have a manifest.

Needs a correction to the translation rule for Optimisers.ClipNorm in Flux. fixed in FluxML/Flux.jl#2145

Also, a simple test showed increasing loss, so perhaps something else is broken?

@ToucheSir
Copy link
Member

The original README does have some notes about achievable accuracy, but I don't believe anyone verified those numbers. One thing that could help is using the built-in ConvMixer model in Metalhead, because that has undergone basic convergence testing.

@mcabbott mcabbott added the update making a model work with latest Flux, etc label Dec 11, 2022
@mcabbott mcabbott force-pushed the explicit_convmixer_cifar10 branch 2 times, most recently from 1333b75 to 7f13463 Compare February 5, 2023 15:39
Comment on lines -54 to -56
function create_loss_function(dataloader, device)

function loss(model)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions which create functions seem like an anti-pattern.

Comment on lines -93 to -96
if use_cuda
device = gpu
@info "Training on GPU"
else
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This @info will lie to you, unless you edit the file to change use_cuda... simpler just to call gpu and it will do nothing if you don't have one.

BatchNorm(dim),
[
Chain(
SkipConnection(Chain(Conv((kernel_size,kernel_size), dim=>dim, gelu; pad=SamePad(), groups=dim), BatchNorm(dim)), +),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No functional change to the model, just indents etc.

Comment on lines 97 to 104
opt = OptimiserChain(
WeightDecay(1f-3),
ClipNorm(1.0),
ADAM(η)
ClipNorm(1f0),
Adam(η),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exposes a real error: composing ClipNorm after another optimiser means it tries to take norm(::Broadcasted) which fails:

julia> train(epochs=10, images=128)
ERROR: Scalar indexing is disallowed.
...
Stacktrace:
...
 [12] norm(itr::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(+), Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}}}, p::Float32)
    @ LinearAlgebra ~/julia-9ded051e9f/share/julia/stdlib/v1.10/LinearAlgebra/src/generic.jl:596
 [13] apply!(o::Optimisers.ClipNorm{Float32}, state::Nothing, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, dx::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(+), Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}}})
    @ Optimisers ~/.julia/packages/Optimisers/kPdJV/src/rules.jl:584

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed, but not yet tested here.

@mcabbott mcabbott force-pushed the explicit_convmixer_cifar10 branch from 7f13463 to 3a2580b Compare July 25, 2023 14:10
end

model = model |> cpu
@save "model.bson" model
@save "model.bson" cpu(model)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should really be updated to Flux.state too

@mcabbott mcabbott force-pushed the explicit_convmixer_cifar10 branch from 3a2580b to ca9b934 Compare March 20, 2024 16:06
@mcabbott mcabbott marked this pull request as ready for review March 20, 2024 22:17
@mcabbott mcabbott requested a review from CarloLucibello March 21, 2024 02:25
opt = OptimiserChain(
WeightDecay(1f-3), # L2 regularisation
ClipNorm(1f0),
Adam(3f-4), # learning rate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current practice would be to use AdamW and set L2 regularization there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but maybe it's good to show how to chain rules?

for (x1, y1) in train_loader
# move one batch at a time to GPU; gpu(train_loader) would be another way
x, y = gpu(x1), gpu(y1)
grads = gradient(m -> Flux.logitcrossentropy(m(x), y; agg=sum), model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why agg=sum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left this, as it probably doesn't hurt... no idea why it's there, maybe it was like that in what someone followed? Showing what the keyword is doesn't seem awful

cpu_model = cpu(model)
BSON.@save "model.bson" cpu_model
BSON.@save "losses.bson" train_save test_save
# it's generally more robust to save just the state, like this, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what? I would remove entirely the model saving with bson in favor of state saving with JLD2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this because maybe it's not a terrible idea to show some variety of approaches, not repeat exactly the same thing in every model?

The way this model is set up, you'd have to copy-paste the code with the sizes etc. to re-build the model before you can load the state back in. So perhaps leaving this is OK?

Comment on lines +122 to +123
train_save[epoch, :] = [train_loss, train_acc]
test_save[epoch, :] = [test_loss, test_acc]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this history saving entirely? seems useless

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a crazy way to collect some numbers, maybe this pattern makes sense to someone instead of others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
update making a model work with latest Flux, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants