Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preparatory work for v0.12 #1527

Merged
merged 4 commits into from
Mar 12, 2021
Merged

preparatory work for v0.12 #1527

merged 4 commits into from
Mar 12, 2021

Conversation

CarloLucibello
Copy link
Member

No description provided.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Mar 11, 2021

There seem to be some regressions; this is not a complete list by any means and is pending further looking into the ecosystem

Metalhead should have benefitted from the CUDA work but seem to have regressed in its backwards passes - this is for resnet (by about 25%)

master

julia> include("ctest.jl")
BenchmarkTools.Trial:
  memory estimate:  3.54 MiB
  allocs estimate:  82438
  --------------
  minimum time:     357.666 ms (0.00% GC)
  median time:      413.510 ms (0.00% GC)
  mean time:        401.555 ms (0.00% GC)
  maximum time:     435.641 ms (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  3.55 MiB
  allocs estimate:  80798
  --------------
  minimum time:     406.275 ms (0.00% GC)
  median time:      442.683 ms (0.00% GC)
  mean time:        461.172 ms (0.00% GC)
  maximum time:     517.030 ms (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  3.56 MiB
  allocs estimate:  81026
  --------------
  minimum time:     422.131 ms (0.00% GC)
  median time:      562.444 ms (0.00% GC)
  mean time:        629.683 ms (0.00% GC)
  maximum time:     904.475 ms (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  3.66 MiB
  allocs estimate:  86891
  --------------
  minimum time:     566.048 ms (2.05% GC)
  median time:      708.968 ms (1.80% GC)
  mean time:        688.524 ms (3.28% GC)
  maximum time:     770.113 ms (6.90% GC)
  --------------
  samples:          4
  evals/sample:     1

tag

julia> include("ctest.jl")
BenchmarkTools.Trial:
  memory estimate:  3.78 MiB
  allocs estimate:  88973
  --------------
  minimum time:     416.192 ms (0.00% GC)
  median time:      466.101 ms (0.00% GC)
  mean time:        456.617 ms (0.00% GC)
  maximum time:     493.553 ms (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  3.79 MiB
  allocs estimate:  87399
  --------------
  minimum time:     486.490 ms (0.00% GC)
  median time:      513.770 ms (0.00% GC)
  mean time:        517.863 ms (0.00% GC)
  maximum time:     557.423 ms (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  3.80 MiB
  allocs estimate:  87540
  --------------
  minimum time:     535.202 ms (0.00% GC)
  median time:      543.230 ms (0.00% GC)
  mean time:        549.727 ms (0.00% GC)
  maximum time:     577.248 ms (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1
BenchmarkTools.Trial:
  memory estimate:  3.90 MiB
  allocs estimate:  93363
  --------------
  minimum time:     707.908 ms (1.23% GC)
  median time:      771.742 ms (1.13% GC)
  mean time:        821.113 ms (1.23% GC)
  maximum time:     983.688 ms (1.32% GC)
  --------------
  samples:          3
  evals/sample:     1

Objectdetector - used to be faster than darknet - fails to load - and is now slower after fixing
Torch.jl - uses Flux.batchnorm (same as CUDA until recently or even till today) and there's no replacement
EquivariantNetworks can't use StaticArrays in some kernels now.

The changes made to CI are appreciated, but the Manifest change can be reserved for the time being. We use it for consistency and development anyway. We want our CI runs to be meaningful (see the Group Conv for eg)

The change to tests is curious, since we shouldn't be passing tests with that. Same was the case with the normalization PR which still needs to pass tests before merging for 0.12.

@CarloLucibello
Copy link
Member Author

Hard to pinpoint the perf regressions if you don't share something reproducible. I suspect any regressions is related to CUDA.jl, so little we can do here, but let's see when you share the script (or some reduced examples).

Objectdetector - used to be faster than darknet - fails to load - and is now slower after fixing

again, this should be pinpointed to some specific layer/function. Failure to load probably due to using stuff at the end of the deprecation cycle (e.g. losses are not exported from Flux anymore, should get them from Flux.Losses). You should open an issue/pr there.

Model-zoo vision's examples are working properly and I could see any noticeable perf regression.

Torch.jl - uses Flux.batchnorm (same as CUDA until recently or even till today) and there's no replacement

There is no such thing as Flux.batchnorm, if torch Torch.jl is using CUDA.CUDNN.batchnorm as long as it keeps its compat bound it will be fine (same as flux), it won't be removed till CUDA 3.0.0 is released.

EquivariantNetworks can't use StaticArrays in some kernels now.

You should open an issue with a MWE

The changes made to CI are appreciated, but the Manifest change can be reserved for the time being. We use it for consistency and development anyway. We want our CI runs to be meaningful (see the Group Conv for eg)

Problem is with a Manifest I cannot test with CUDA 2.4 and 2.6 respectively on julia 1.5 and 1.6.
CI runs are very meaningful, you test with the latest versions of dependencies. The largest part of julia's packages don't use a manifest I don't see how why we should do something different that has no pros and a few cons in our situation.

The change to tests is curious, since we shouldn't be passing tests with that.

why not, initW has been deprecated for init

Same was the case with the normalization PR which still needs to pass tests before merging for 0.12.

since that pr has been open for more than a month, it would be nice if you could wrap it up and avoid derailing it with questionable changes (read bugs) to the activation state and to the returned parameters. In any case, that shouldn't block v0.12 since it is an internal refactoring.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Mar 11, 2021

The tests are pretty standard and can be found here; https://gist.github.com/DhairyaLGandhi/12d06b9389e13130cf4c03c67885be91

CUDA uses Manifests for exactly this reason as well.

@darsnack
Copy link
Member

The PR seems fine to me, but let's address the regressions. Either we use ModelZooKeeper or need to put together a temporary repo with all the tests (so that we can all debug).

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Mar 12, 2021

The tests are pretty standard and can be found here; https://gist.github.com/DhairyaLGandhi/12d06b9389e13130cf4c03c67885be91

this script cannot be run (e.g. can't install metalhead). Also, there is a subtle bug, during the benchmarks new models are created, but gradients are computed with respect to old params, so the gradient is nothing.

I fixed the script and cannot see any performance difference between master and latest release.

The PR seems fine to me, but let's address the regressions. Either we use ModelZooKeeper or need to put together a temporary repo with all the tests (so that we can all debug).

Regression should be addressed, but I cannot see any regression. If someone (@DhairyaLGandhi ?) claims to have spotted any bug or per regression, he should open a corresponding issue and provide a reproducible example to be helpful.

I'm opening a pr adding a benchmark script to a perf/ folder here, but definitely, we should think of a longer-term solution.

@CarloLucibello
Copy link
Member Author

In any case, this PR can be merged, release is not supposed to be tagged right now

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't a PR to tag a release, then the changes are fine.

@@ -1,6 +1,6 @@
name = "Flux"
uuid = "587475ba-b771-5e3f-ad9e-33799f191a9c"
version = "0.12.0-dev"
version = "0.12.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we leave this as is? Can do a commit to change it and tag that commit for release?

@DhairyaLGandhi
Copy link
Member

Thanks for pointing out the bug. I still see numbers similar to what I reported earlier after fixing it. The script does require adding the appropriate bounds to Metalhead in a local clone (and possibly removing the dev from the version number in the Flux clone). The gradients wouldn't have affected the runtime anyway.

@darsnack
Copy link
Member

Even without #1535, let's put these benchmarks in a repo? Can just be quick and dirty so that we can all commit/clone the same code for testing. Doesn't have to be a permanent solution, just something to make us all more comfortable about the release and on the same page.

@CarloLucibello
Copy link
Member Author

bors r+

@bors
Copy link
Contributor

bors bot commented Mar 12, 2021

Build succeeded:

@bors bors bot merged commit d9d03db into master Mar 12, 2021
bors bot added a commit that referenced this pull request Mar 14, 2021
1535: add vgg16 performance script r=CarloLucibello a=CarloLucibello

We could start collecting a few (possibly more granular) perf scripts, as a first crude benchmarking facility. 

Related discussion in #1527 .

On my laptop,  performances are the same for Flux master and v0.11.6.

```julia
# Julia 1.6-rc1, Flux v0.11.6, CUDA v2.6.1, Zygote v0.6.3, NNlib v0.7.16
CPU benchmark
  forward
  1.752 s (915 allocations: 288.05 MiB)
  backward
  4.166 s (20840 allocations: 1.01 GiB)
  forw and back
  6.117 s (51836 allocations: 1.63 GiB)
CUDA benchmark
  forward
  21.032 ms (4115 allocations: 156.30 KiB)
  backward
  139.990 ms (9218 allocations: 365.88 KiB)
  forw and back
  168.142 ms (19244 allocations: 2.57 MiB)
```
```julia
# Julia 1.6-rc1, Flux master, CUDA v2.6.1, Zygote v0.6.3, NNlib v0.7.16
CPU benchmark
  forward
  1.682 s (1123 allocations: 220.57 MiB)
  backward
  4.210 s (20153 allocations: 1.08 GiB)
  forw and back
  6.382 s (32100 allocations: 1.69 GiB)
CUDA benchmark
  forward
  21.704 ms (4052 allocations: 149.59 KiB)
  backward
  139.710 ms (9832 allocations: 377.77 KiB)
  forw and back
  169.053 ms (20759 allocations: 2.32 MiB)
```
```julia

```  


Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
@DhairyaLGandhi
Copy link
Member

Flux is also not compatible with CUDA#master on 1.6

@CarloLucibello
Copy link
Member Author

yeah this is known and nothing that concerns v0.12. CUDA's next release will be breaking. We have to register NNlibCUDA

@CarloLucibello CarloLucibello deleted the cl/v012 branch April 7, 2022 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants