Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue when calculating loss #1255

Closed
findmyway opened this issue Jul 1, 2020 · 9 comments
Closed

Performance issue when calculating loss #1255

findmyway opened this issue Jul 1, 2020 · 9 comments

Comments

@findmyway
Copy link
Contributor

I notice that many loss functions in this package are written like this:

mae(ŷ, y) = sum(abs.(ŷ .- y)) * 1 // length(y)

mse(ŷ, y) = sum((ŷ .- y).^2) * 1 // length(y)

msle(ŷ, y; ϵ=eps(eltype(ŷ))) = sum((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)).^2) * 1 // length(y)

They all have a * 1 // . Originally I thought it was redundant. Only recently I got a performance issue and found that doing so will avoid a performance issue. Can anyone explain why we need this?

@CarloLucibello
Copy link
Member

Can you post some benchmarks?

@DhairyaLGandhi
Copy link
Member

That was done to ensure type stability and prevent unnecessary type promotions that can occur by mixing Float64 and Float32 for example

@findmyway
Copy link
Contributor Author

That was done to ensure type stability and prevent unnecessary type promotions that can occur by mixing Float64 and Float32 for example

I'm afraid it's not related to Float64 and Float32 here.

Can you post some benchmarks?

Here's the example:

m = Dense(32, 32)
x = rand(Float32, 32, 32)
y = rand(Float32, 32)
@benchmark gradient(Flux.params(m)) do
    sum(abs.(m(x) .- y)) / length(y) 
end
BenchmarkTools.Trial: 
  memory estimate:  114.19 KiB
  allocs estimate:  3197
  --------------
  minimum time:     136.582 μs (0.00% GC)
  median time:      140.018 μs (0.00% GC)
  mean time:        153.322 μs (7.62% GC)
  maximum time:     12.375 ms (95.82% GC)
  --------------
  samples:          10000
  evals/sample:     1
@benchmark gradient(Flux.params(m)) do
    sum(abs.(m(x) .- y)) * 1 // length(y) 
end
BenchmarkTools.Trial: 
  memory estimate:  97.72 KiB
  allocs estimate:  3196
  --------------
  minimum time:     78.068 μs (0.00% GC)
  median time:      82.311 μs (0.00% GC)
  mean time:        94.679 μs (11.36% GC)
  maximum time:     12.579 ms (96.22% GC)
  --------------
  samples:          10000
  evals/sample:     1

Note that, when model becomes larger, the performance difference soon becomes very large (in my case, it's about two order of magnitudes).

@CarloLucibello
Copy link
Member

hmhm, I cannot reproduce

julia> @btime gradient(Flux.params(m)) do
           sum(abs.(m(x) .- y)) / length(y) 
       end
  74.090 μs (3192 allocations: 97.61 KiB)
Grads(...)

julia> @btime gradient(Flux.params(m)) do
           sum(abs.(m(x) .- y)) * 1 // length(y) 
       end
  74.381 μs (3195 allocations: 97.73 KiB)
Grads(...)

@CarloLucibello
Copy link
Member

actually my last measurement was on Zygote 0.4.20, on newer Zygote versions I can reproduce the performance difference. @oxinabox could this be due to ChainRules?

@CarloLucibello
Copy link
Member

this is quite relevant, since i removed all 1 // lenght(y) in #1150

@oxinabox
Copy link
Member

oxinabox commented Jul 1, 2020

Hard to say, without a lot more informatiom.
It shouldn't be.

@CarloLucibello
Copy link
Member

Using mean instead is fine. This is good because that's the default way we now perform aggregations in losses

julia> using Flux, BenchmarkTools

julia> m = Dense(32, 32)
Dense(32, 32)

julia> x = rand(Float32, 32, 32);

julia> y = rand(Float32, 32);

julia> @btime gradient(Flux.params(m)) do
           sum(abs.(m(x) .- y)) / length(y) 
       end
  124.031 μs (3197 allocations: 114.19 KiB)
Grads(...)

julia> @btime gradient(Flux.params(m)) do
           sum(abs.(m(x) .- y)) * 1 // length(y) 
       end
  73.434 μs (3196 allocations: 97.72 KiB)
Grads(...)

julia> @btime gradient(Flux.params(m)) do
           mean(abs.(m(x) .- y)) 
       end
  73.004 μs (3190 allocations: 105.67 KiB)
Grads(...)

@CarloLucibello
Copy link
Member

CarloLucibello commented Jul 11, 2020

The integer division problem has been fixed in ChainRules

julia> using Flux, BenchmarkTools
[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]

julia> m = Dense(32, 32)
Dense(32, 32)

julia> x = rand(Float32, 32, 32);

julia> y = rand(Float32, 32);

julia> @btime gradient(Flux.params(m)) do
           sum(abs.(m(x) .- y)) / length(y) 
       end
  75.177 μs (3193 allocations: 105.61 KiB)
Grads(...)

julia> @btime gradient(Flux.params(m)) do
           sum(abs.(m(x) .- y)) * 1 // length(y) 
       end
  75.825 μs (3196 allocations: 105.72 KiB)
Grads(...)

julia> @btime gradient(Flux.params(m)) do
           mean(abs.(m(x) .- y)) 
       end
  74.869 μs (3190 allocations: 113.67 KiB)
Grads(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants