Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update for Cuda 2 #1345

Merged
merged 11 commits into from
Nov 3, 2020
Merged

update for Cuda 2 #1345

merged 11 commits into from
Nov 3, 2020

Conversation

CarloLucibello
Copy link
Member

and mark as broken a cuda test on the huber_loss that was already failing on master (for reasons i do not understand)

@CarloLucibello
Copy link
Member Author

@DhairyaLGandhi I add to remove the manifest since cuda 2.0 is not compatible with julia 1.4. I suggest we finally get rid of it (as discussed in #500 #1164 #1234

@CarloLucibello
Copy link
Member Author

@DhairyaLGandhi can we fix or remove the gitlab pipeline?

Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the gitlab pipeline for precisely the GPU tests, best is to merge the Zygote fixes for now and clean CI.

@@ -36,7 +36,11 @@ end
y = rand(Float32, 3,3)

for loss in ALL_LOSSES
gpu_gradtest(loss, x, y)
if loss == Flux.Losses.huber_loss
@test_broken 1 == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding a fake test here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to remark that the huber_loss is broken for gpu and we should fix the test at some point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just @test_broken the Huber loss test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is not failing the test, it is throwing an error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@test_throws then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the error? A CUDA-related error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@test_throws then?

that's supposed to test cases where you actually want to throw an error

@CarloLucibello
Copy link
Member Author

We need the gitlab pipeline for precisely the GPU tests, best is to merge the Zygote fixes for now and clean CI.

Doesn't seem to be a Zygote problem, it doesn't find CUDNN. When running GPU test with bors r+ or bors try it works instead, last time I checked

@CarloLucibello
Copy link
Member Author

bors try

bors bot added a commit that referenced this pull request Oct 7, 2020
@bors
Copy link
Contributor

bors bot commented Oct 7, 2020

try

Build failed:

@DhairyaLGandhi
Copy link
Member

it doesn't find CUDNN

That was separate, some of the runners had old versions of CUDNN installed. That is rectified since

@DhairyaLGandhi
Copy link
Member

bors try

bors bot added a commit that referenced this pull request Oct 13, 2020
@bors
Copy link
Contributor

bors bot commented Oct 13, 2020

try

Build failed:

@maleadt
Copy link
Collaborator

maleadt commented Oct 13, 2020

FWIW, I think committed Manifests are very useful: to reproduce CI environment locally, for the ability to reliably bisect issues, and to make it possible to depend on unreleased dependencies.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Oct 14, 2020 via email

@IanButterworth
Copy link
Contributor

Is there a lot of stuff holding this up? I just ask because there's a bug on nightly that makes v1 CUDA not work on MacOS systems with no cuda gpus, so Flux doesn't work on nightly.

If there is a lot, it may make sense to backport the fix to CUDA v1

@DhairyaLGandhi
Copy link
Member

Bump :)

@CarloLucibello
Copy link
Member Author

Agreed, we should hold off any Manifest related changes for this patch.

If I add back the Manifest CI will fail on julia 1.4 since it will try to use the non julia 1.4 compatible CUDA 2.0

@CarloLucibello
Copy link
Member Author

bors try

bors bot added a commit that referenced this pull request Oct 16, 2020
@bors
Copy link
Contributor

bors bot commented Oct 16, 2020

try

Build failed:

@IanButterworth
Copy link
Contributor

If I add back the Manifest CI will fail on julia 1.4

Unless I'm missing something, 1.4 shouldn't need to be supported given 1.5 is out?

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Oct 17, 2020

Unless I'm missing something, 1.4 shouldn't need to be supported given 1.5 is out?

yes, we can drop 1.4 support if we consider keeping the Manifest more important than supporting recent julia versions (btw, I do not, and I dislike some other aspects of keeping the Manifest in general, but probably I'm in minority here)

@CarloLucibello
Copy link
Member Author

Even with the newly released zygote, I'm experiencing all sorts fo error on my machine, especially with RNNs. Let's see what bors says

bors try

bors bot added a commit that referenced this pull request Oct 17, 2020
@bors
Copy link
Contributor

bors bot commented Oct 17, 2020

try

Build failed:

@maleadt
Copy link
Collaborator

maleadt commented Oct 17, 2020

if we consider keeping the Manifest more important than supporting recent julia versions

How are these related? Manifests aren't used by users. And for CI, we could jerry-rig something to have multiple Manifests for each Julia version.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Oct 17, 2020

And for CI, we could jerry-rig something to have multiple Manifests for each Julia version.

that would be cool. Can it be done?

@DhairyaLGandhi
Copy link
Member

Could you elaborate on the errors?

@CarloLucibello
Copy link
Member Author

this is what I see on my machine (where tests passed last week, dunno why):

[ Info: Testing GPU Support
LayerNorm GPU grad test: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/layers.jl:38
  Expression: gs[first(ps)] isa Flux.CUDA.CuArray
   Evaluated: Float32[1.2636185f-5] isa CuArray
Stacktrace:
 [1] macro expansion at /home/carlo/.julia/dev/Flux/test/cuda/layers.jl:38 [inlined]
 [2] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [3] macro expansion at /home/carlo/.julia/dev/Flux/test/cuda/layers.jl:26 [inlined]
 [4] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [5] gradtest(::String, ::Array{UnionAll,1}, ::Array{Float32,4}, ::Int64) at /home/carlo/.julia/dev/Flux/test/cuda/layers.jl:24
[ Info: Testing Flux/CUDNN
R = RNN, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
  Expression: y  collect(cuy)
   Evaluated: [-0.019431099902958646, -0.11841413968698347, 0.4722386310464857, -0.14604490552473026, -0.64691227373787]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = RNN, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
  Expression: collect(cux̄)
   Evaluated: [-0.830899605563262, -0.6662040633027664, -2.1067316948003136, 1.8660190087509037, -0.22096769547077244, 0.6238499990006453, 0.9966454260407338, -1.722732776318945, 1.6458662195438152, 0.86747077146643]  Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = RNN, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
  Expression: (m̄[]).state  collect((cum̄[]).state)
   Evaluated: [-0.8405158956849136, 2.061215493843378, 0.14061960905985366, -1.8565078327587845, 0.4424841142453512]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = RNN, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
  Expression: y  collect(cuy)
   Evaluated: [-0.019431099902958646 0.1618381490680678  -0.22441343007056594 0.047527729285282455; -0.11841413968698347 0.748564964186107  0.25762538051276473 0.565018971059919;  ; -0.14604490552473026 0.44967100141219984  0.11898398784283139 -0.20242204874078795; -0.64691227373787 -0.8838368682898494  -0.8729884936098729 -0.8352227805469519]  Float32[0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0;  ; 0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = RNN, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
  Expression: collect(cux̄)
   Evaluated: [-1.2563565767280167 -0.32043449857558765  0.6598185823329246 -0.8743064105599022; 0.4139337826036863 -0.0489961935370902  0.7104068283356213 -0.21005488704762065;  ; -1.1329831875497842 0.09261050769960308  -0.3669623959899771 0.11156129322481852; -0.0898901569255448 -0.3161775390020851  0.21676706154085665 -0.09266956050321835]  Float32[0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0;  ; 0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = RNN, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
  Expression: (m̄[]).state  collect((cum̄[]).state)
   Evaluated: [1.1450011499435349, -3.070202557803463, -1.1615876545036332, -0.04242987457808446, 0.2762399538473071]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = GRU, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
  Expression: y  collect(cuy)
   Evaluated: [-0.4275777643268079, -0.014423148979310548, 0.007237418485216347, -0.5282777904512245, -0.15764599247777863]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = GRU, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
  Expression: collect(cux̄)
   Evaluated: [-0.22776741748701154, 0.3209347956882381, 0.10693373763558503, -0.41107076672434856, -0.32476707616330297, 0.352650092060774, -0.04047323530121069, 0.1934077904424144, -0.2822926974842711, 0.19332311343347433]  Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = GRU, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
  Expression: (m̄[]).state  collect((cum̄[]).state)
   Evaluated: [-0.5141135637065525, -0.5711118387205082, -0.39127310148295213, 0.36650655519300074, -0.14387137441365852]  Float32[-0.7155977, -0.22475724, -0.44065005, 0.53868115, -0.03074057]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = GRU, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
  Expression: y  collect(cuy)
   Evaluated: [-0.4275777643268079 -0.3869860053459997  -0.2683960454492888 -0.3939344448830501; -0.014423148979310548 0.0825856908074926  0.08759317579596015 0.11528816778371748;  ; -0.5282777904512245 -0.5154974736821768  -0.38026836831507604 -0.38486394257549705; -0.15764599247777863 -0.1989142030500093  -0.08267492009419829 -0.10008267859310813]  Float32[0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0;  ; 0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = GRU, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
  Expression: collect(cux̄)
   Evaluated: [-0.167915828645572 0.10664332868172638  -0.13235601271948444 -0.0846844520822104; 0.36443219445413677 -0.024176883724530952  0.14256752722756888 0.01371924757029691;  ; -0.28075880626172206 0.03838631885980828  -0.24944711109191708 -0.0435247184088787; 0.3739027735371024 -0.05672318160495745  0.014029365321461928 0.10232388257206129]  Float32[0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0;  ; 0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = GRU, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
  Expression: (m̄[]).state  collect((cum̄[]).state)
   Evaluated: [-0.9164938006621115, 1.0211574132258705, -0.3580377108098198, -0.08689351690247682, 0.5065986969549852]  Float32[-0.97730887, 0.828195, -0.37877595, 0.085035026, 0.79239774]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:47
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
  Expression: y  collect(cuy)
   Evaluated: [0.17133454661393993, 0.08491448663403546, 0.1581792194746248, 0.10928057416312513, -0.00353936869665635]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
  Expression: collect(cux̄)
   Evaluated: [-0.030229336215151753, 0.12456343964566355, -0.01523602769955032, 0.03882078681763432, -0.19705709591272808, -0.05689166506030991, 0.09940324503612268, -0.018838281636579312, 0.09927881419095592, -0.014974470640134938]  Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
  Expression: x  collect(cx)
   Evaluated: [-0.08668788044236499, -0.09631148376162414, 0.16450188507074665, 0.16419487349568404, -0.10264999946485072]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 1: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
  Expression: x  collect(cx)
   Evaluated: [0.19790952785346314, -0.14148924286137277, 0.20355174555572256, -0.23772413178250634, -0.43214533482284356]  Float32[0.15913527, -0.111981004, 0.090632685, -0.26090252, -0.4198485]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
  Expression: y  collect(cuy)
   Evaluated: [0.17133454661393993 0.11182878930195422  0.1548314885630126 0.16292665019358618; 0.08491448663403546 0.06320487690104416  0.08889681684473064 0.12589660868354857;  ; 0.10928057416312513 0.06698992894957341  0.07453009457766963 0.06410766606551206; -0.00353936869665635 -0.027621339369615545  -0.013245722894946247 0.008094052575241577]  Float32[0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0;  ; 0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:30
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
  Expression: collect(cux̄)
   Evaluated: [-0.10268096525608542 0.00994827058360619  0.16947322399116943 -0.20729710678265328; 0.08206973619000309 -0.01609505278780965  -0.12169657342862912 0.10296755973293965;  ; 0.058128216231219505 0.16542800217809261  -0.1275825994643608 0.10090241858739234; -0.030771661228154282 0.0030305073944870728  -0.045480045439165784 0.12628198308549338]  Float32[0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0;  ; 0.0 0.0  0.0 0.0; 0.0 0.0  0.0 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:38
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
  Expression: x  collect(cx)
   Evaluated: [-0.15736053657821605, -0.1298820980694558, 0.18058347703021838, 0.20431763798467564, -0.17769890918058903]  Float32[0.0, 0.0, 0.0, 0.0, 0.0]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
R = LSTM, batch_size = 5: Test Failed at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
  Expression: x  collect(cx)
   Evaluated: [0.05868754018291389, -0.12972053368132985, 0.16920241016444965, -0.18323024718071512, -0.6215401990328058]  Float32[-0.0030328631, -0.15407962, 0.08194515, -0.21456608, -0.546141]
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:44
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1190
 [3] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
 [5] top-level scope at /home/carlo/.julia/dev/Flux/test/cuda/curnn.jl:16
Test Summary:                    | Pass  Fail  Broken  Total
CUDA                             |   92    21      35    148
  CUDA                           |    9                    9
  onecold gpu                    |    2                    2
  restructure gpu                |    1                    1
  GPU functors                   |    2                    2
  Losses                         |   44             1     45
  Basic GPU Movement             |    2                    2
  Conv GPU grad tests            |    6             1      7
  Pooling GPU grad tests         |    2                    2
  AdaptivePooling GPU grad tests |    2                    2
  Dropout GPU grad tests         |    1             1      2
  Normalising GPU grad tests     |    3     1              4
    LayerNorm GPU grad test      |    1     1              2
    BatchNorm GPU grad test      |    2                    2
  InstanceNorm GPU grad tests    |                  1      1
  GroupNorm GPU grad tests       |                  1      1
  Stateless GPU grad tests       |    1                    1
  CUDNN BatchNorm                |    8                    8
  R = RNN                        |    1             2      3
  R = GRU                        |    1             2      3
  R = LSTM                       |    1             2      3
  RNN                            |    6    20      24     50
    R = RNN, batch_size = 1      |    1     3       4      8
    R = RNN, batch_size = 5      |    1     3       4      8
    R = GRU, batch_size = 1      |    1     3       4      8
    R = GRU, batch_size = 5      |    1     3       4      8
    R = LSTM, batch_size = 1     |    1     4       4      9
    R = LSTM, batch_size = 5     |    1     4       4      9
ERROR: LoadError: Some tests did not pass: 92 passed, 21 failed, 0 errored, 35 broken.
in expression starting at /home/carlo/.julia/dev/Flux/test/runtests.jl:36
ERROR: Package Flux errored during testing

@CarloLucibello
Copy link
Member Author

bors instead still has problems in finding CUDNN https://gitlab.com/JuliaGPU/Flux.jl/-/jobs/796597125

@maleadt
Copy link
Collaborator

maleadt commented Oct 19, 2020

bors instead still has problems in finding CUDNN https://gitlab.com/JuliaGPU/Flux.jl/-/jobs/796597125

You're using an explicit image where you shouldn't (I though mentioned this somewhere already, but can't find it...).

@maleadt
Copy link
Collaborator

maleadt commented Oct 29, 2020

@maleadt can we have a new CUDA.jl release?

Yeah I'll have a look.

@maleadt
Copy link
Collaborator

maleadt commented Oct 30, 2020

JuliaGPU/CUDA.jl@602c549

@CarloLucibello
Copy link
Member Author

since I removed the manifest I also added julia 1.3 to CI, as it should be since we support it. I have yet to see a single situation where having a Manifest in flux provided some advantage, while I have seen several induced problems. I strongly suggest we give manifest removal a try and see how it goes, then we can reinstate it back at any point, also temporarily while working on a PR.

@CarloLucibello
Copy link
Member Author

bors try

bors bot added a commit that referenced this pull request Oct 31, 2020
@bors
Copy link
Contributor

bors bot commented Oct 31, 2020

try

Build succeeded:

if loss == Flux.Losses.huber_loss # remove when fixed
try
gpu_gradtest(loss, x, y)
@test 1 == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not super sure about this one, maybe we can have a different dispatch on huber_loss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no big deal, we shouldn't overthink it too much. hopefully, we will fix it soon

.travis.yml Outdated Show resolved Hide resolved
.gitignore Outdated
docs/build/
docs/site/
deps
Manifest.toml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-add the manifest for this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to understand if the environment needed to be updated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it did, since we moved to cuda 2.1.

@maleadt
Copy link
Collaborator

maleadt commented Nov 2, 2020

@maleadt can we have a new CUDA.jl release?

...

I have yet to see a single situation where having a Manifest in flux provided some advantage

By adding a dependency on a specific revision to the Manifest you would have been able to test the updated CUDA.jl days ago, without requiring me to tag a release. And this has been done in the past in Flux.jl:

1c0e9ac#diff-6d1205e03b7a43c01ed2ac8b8c83eb58c6f3a3b1360769e18644245abee8c565R95

@CarloLucibello
Copy link
Member Author

By adding a dependency on a specific revision to the Manifest you would have been able to test the updated CUDA.jl days ago, without requiring me to tag a release. And this has been done in the past in Flux.jl:

This can be easily done by temporarily reinstating the Manifest in a PR.

@CarloLucibello
Copy link
Member Author

Can we merge this as it is? that is

a) no Manifest --> CI on julia 1.3 + always test on up-to-date ecosystem

Alternatives are

b) broken CI
c) drop support for julia < 1.5.

I strongly favor a). @DhairyaLGandhi if you really dislike a), tell me if I should go with b) or c), I'd like this to be merged as soon as possible

@DhairyaLGandhi
Copy link
Member

We often have cases (even with the updating CUDNN) when we were incompatible with NNlib and CUDA at the same time, or even with LoopVectorisation. I definitely feel having the Manifest is useful, and managing it in a PR by PR basis is untenable since the dependencies would vary much more for every single PR, compared to having a set of known ones to go to.

It's not about disliking Julia 1.3 support (on the contrary, I dislike not supporting the latest versions of packages on LTS versions of the language), its about moving forward, and this change is not critical to getting the CUDA updates in, which is a bigger deal to me. We used to ] up on some maintenance PRs to keep the Manifest updated, and with CompatHelper, that's somewhat managed as well.

Having Julia 1.3 would still fetch the old dependencies when the package is added by a user. The downside is we don't get to test it on CI with every commit.

.gitignore Outdated
docs/build/
docs/site/
deps
Manifest.toml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Manifest.toml

Does this work to get the PR merged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if adding back the manifest, we should also remove CI on julia 1.3 from travis, I don't really want to see red lights for all PRs

@CarloLucibello
Copy link
Member Author

These incompatibilities are very rare, surely not common enough to be worth dropping CI support or julia 1.3 support.

Anyway, let's reinstate the Manifest. I wouldn't keep julia 1.3 and 1.4 support without CI, it's not healthy, so I think we should just drop < 1.5 support. Do you agree?

@DhairyaLGandhi
Copy link
Member

We can drop 1.5 support for now since we have some fixes on master that need to be tagged alongside CUDA 2 anyway

@CarloLucibello
Copy link
Member Author

bors r+

@bors
Copy link
Contributor

bors bot commented Nov 3, 2020

Build succeeded:

@bors bors bot merged commit e00322f into master Nov 3, 2020
@DhairyaLGandhi DhairyaLGandhi deleted the cuda2 branch November 30, 2020 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants