Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some issues with Zeros option 2 #1379

Merged
merged 2 commits into from
Nov 9, 2020
Merged

Fix some issues with Zeros option 2 #1379

merged 2 commits into from
Nov 9, 2020

Conversation

DrChainsaw
Copy link
Contributor

@DrChainsaw DrChainsaw commented Nov 1, 2020

Fixes #1332, #1277

This is take two on fixing the above issues and is intended to be mutually exclusive to #1374.

In this version Zeros is no longer an AbstractArray and is thus more like a there-is-no-such-parameter marker instead of this-parameter-is-an-immutable-array-of-zeros type.

I made the change so that the advertised way of disabling bias is through bias=false rather than bias=Zeros().

Zeros itself is alot less capable here compared to the previous attempt, but that also keeps the number of moving parts down, i.e no need to test that both the 0-dim version and the full size version works the same. All in all there are fewer specializations compared to #1374.

I think that a rename could definitely be in place. Better names I can think of are bias=Off, bias=None or as suggested by @mcabbott bias=False().

The best argument against renaming I can think of is to be a little bit less breaking.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

@CarloLucibello
Copy link
Member

What about the bias=false approach instead? Do you think it could provide further simplifications while being flexible enough?

@DrChainsaw
Copy link
Contributor Author

@CarloLucibello I guess that would be an option 3 then :)

Reason why I did not go for it is that it reintroduces the ambiguity as to whether the layer as a fixed constant bias or if it just doesn't have one at all.

For instance, allowing it would almost certainly mean that someone will put bias=true thinking that means "yes, I want my layer to have a trainable bias".

Allowing for scalar biases would increase the flexibility of course, but it would imo also increase the number of moving parts (a.k.a the error surface). Question is if non-zero scalar biases is useful enough to allow for this.

If we only allow for scalar zeros then Bool has one too many possible values. I was toying with the idea to use I from LinearAlgebra but I didn't pursue it (don't even remember if it does identity for addition).

One example I ran into where it could be problematic is the destructure function which returns all parameters of the model concatenated into a (mutable) array and also allows for recreating the model from them. Question then arises if scalars should somehow be special here while the approach that the layer just doesn't have a bias parameter makes the choice of what to return trivial.

As an at least to me not insignificant detail, having an own type means we can specialize arithmetic ops to be (essentially) zero cost. Broadcasted addition with false does not appear to be cheaper than broadcasted addition of a same sized array of zeros.

@DrChainsaw
Copy link
Contributor Author

Just an addendum to the "there is no bias" story from above. I left the implementations of -,* and / in there basically for legacy reasons.

If we rename this version of Zeros to something which more clearly conveys the idea of "no bias parameter" then they can be removed.

@DhairyaLGandhi
Copy link
Member

#873 tried a bunch of these approaches, and it was expressed that Zeros should be able to return true dimension errors, and participate in Base's broadcasting mechanism, hence the need to preserve dimensions etc. We also don't want to have to change every layer to specifically allow for Unions with Zeros (another thing raised in #873), since there are a lot of custom layers out there.

@mcabbott
Copy link
Member

mcabbott commented Nov 2, 2020

Sure, I guess it's a trade-off, pretending to be an AbstractArray saves effort some places, but adds it elsewhere. For most layers I would guess it's enough just not to constrain the type of the bias, and let broadcasting sort it out. That seems pretty Julian. Only if you will do something more complicated (like perhaps Conv does) will you need to care (and probably enforce in the constructor that it's either false or an array of the right shape).

Here's a quick sketch to see how much works in the simplest false story, destructure, loadparams!, fmap seem fine. Are there other things this misses? (Have not tried with a GPU.)

using Flux, Zygote

struct Dense2{T,S,F}
    W::T
    b::S
    f::F
end

Flux.@functor Dense2

(l::Dense2)(x) = (l.f).(l.W * x .+ l.b)

Dcon = Dense2(rand(2,3), randn(2), tanh)
Dsin = Dense2(rand(2,3), false, tanh)

Dcon(rand(3))
Dsin(rand(3)) # ok

params(Dcon) # two parameter arrays, ignores tanh
params(Dsin) # one array
params(Dense2(rand(2,3), fill(10.0), tanh)) # trainable scalar as a 0-array

par1, re1 = Flux.destructure(Dcon); length(par1) == 8
par2, re2 = Flux.destructure(Dsin); length(par2) == 6

re2((1:6)/100)([1,2,3]) 
re1(vcat(1:6, 0,0)/100)([1,2,3]) # same output, but has trainable bias

fmap(f32, Dcon)
fmap(f32, Dsin) # ok, retains false & tanh

Flux.loadparams!(Dcon, [ones(2,3), ones(2)])
Flux.loadparams!(Dsin, [ones(2,3)])
Flux.loadparams!(Dsin, [ones(2,3), ones(2)]) # no error, because it just calls zip

Broadcast.broadcasted(::typeof(+), a::AbstractArray{T}, b::Bool) where {T<:AbstractFloat} = 
    b===false ? a : Broadcast.broadcasted(+, a, one(T)) # many BLAS things have a branch like this, low cost:

@btime sum(r .+ b)  setup=(r=randn(10,10); b=0.0)   # 104.829 ns (1 allocation: 896 bytes)
@btime sum(r .+ b)  setup=(r=randn(10,10); b=false) #  13.044 ns (0 allocations: 0 bytes)

Zygote.@adjoint Broadcast.broadcasted(::typeof(+), a::AbstractArray{<:Number}, b::Bool) = 
    (a .+ b), da -> (nothing, da, nothing) # elsewhere Zygote does assume Bools can't have gradients

gradient((x,b) -> sum(x .+ b), rand(2), rand(2))[2] == [1,1] # actually Fill
gradient((x,b) -> sum(x .+ b), rand(2), 0.0)[2] == 2
gradient((x,b) -> sum(x .+ b), rand(2), false)[2] === nothing

@btime gradient((x,b) -> sum(x .+ b), r, 0.0)  setup=(r=randn(100,100));   # 6.142 μs (2 allocations: 78.20 KiB)
@btime gradient((x,b) -> sum(x .+ b), r, false)  setup=(r=randn(100,100)); #   815.515 ns (0 allocations: 0 bytes)

For calling Conv layers, the change needed is super-simple. Is that the only place which reshape(::Zeros) got used? The various constructors of Conv layers might need more thought, could surely be made non-breaking of user code by allowing Zeros()=false.

function (c::Conv)(x::AbstractArray)
    cdims = DenseConvDims(x, c.weight; stride=c.stride, padding=c.pad, dilation=c.dilation)
    b = c.bias isa AbstractVector ? reshape(c.bias, map(_->1, c.stride)..., :, 1) : c.bais
    (c.σ).(conv(x, c.weight, cdims) .+ b)
end

@DrChainsaw
Copy link
Contributor Author

But what advantage does using false give over an own simple type then?

I think the in my eyes surprising behaviour of bias=true is pretty significant for giving a pleasant user experience. Short circuiting the broadcasted addtion for speed is also type piracy, right?

I also think there is something strange and undefined about the behaviour of destructure if bias=true. I mean, the parameter is there and impacts the result so it should perhaps it needs to be included for alot of the use cases the function is designed for, but otoh we can't accept a restructure where it is changed, or?

Pretty much everything done above can also be done for an own type and it will work the same, no?

@mcabbott
Copy link
Member

mcabbott commented Nov 2, 2020

I feel it's just simpler that there's no type you have to know about. It could all be done with some scalar Zero type, it's just one more thing which might not be necessary... we already have FillArrays.Zeros and ChainRules.Zero floating around.

Agree that constructors which take a keyword argument certainly not take bias=true to literally save true. Probably it should just be a signal to use the default? So from a the perspective of a user who never looks at the internals, there's just a switch which enables/disables. Constant bias=1.0 isn't really something that needs to be supported, I think?

I guess my broadcast optimisation above is indeed piracy. If it's only done within Zygote.@adjoint then it's not, and that's where you'd care about performance anyway. (When not using Zygote, the .+ false should fuse with the σ.(...) broadcast, too.)

Can you explain a bit more what you mean about destructure? Is the concern is that some number other than false (say 3.0) won't be treated as a scalar parameter? I thought all learnable parameters had to be in mutable containers, I presume (Dense2(rand(2,3), fill(10.0), tanh) would work this way (although 0-arrays sometimes expose bad assumptions elsewhere). Scalars seem to be restored just fine:

Dtri = Dense2(rand(2,3), 3.0, tanh)
params(Dtri) # one array
par3, re3 = Flux.destructure(Dtri);
re3(ones(6)) # restores 3.0

Edit: here's a sketch of how activations could be applied: FluxML/NNlib.jl@master...mcabbott:activate
The function activate!! can re-use memory, and is a non-piratical way to send tanh.(A) to an @avx implementation. The simpler gradient for case b::Bool can be defined in Zygote without concern for whether the type Zero is available.

@DrChainsaw
Copy link
Contributor Author

I feel it's just simpler that there's no type you have to know about. It could all be done with some scalar Zero type, it's just one more thing which might not be necessary...

I see your point here. To me the focal point of this argument is that the type must have a name and that name is basically going to have to be a synonym for something which already exists (e.g. False, Nothing, Zero, None etc.) which in turn could make it a bit risky to export.

Agree that constructors which take a keyword argument certainly not take bias=true to literally save true. Probably it should just be a signal to use the default?

That works ofc from an API p.o.v, but then you'd end up having to remember to put this logic in every layer or else there will be (typically silent) unintended results. Maybe not the end of the world but it is also something which seems unnecessary given the rich set of options the language provides.

One more (admittedly somewhat hyperbolic) argument for an own type is that it could provide for a decent escape hatch through dispatch in a way which would require type-piracy if a bool was used.

I'm also thinking that the Zygote case is still piracy as it will redefine what an adjoint is for a Bool. If some other library needed to make another definition there would be a clash, no? Not asking rethorically btw since the intricacies of things like this is not something I feel fully confident in.

I guess a fused broadcast will remove alot of the cost from not specializing assuming the cost is for allocating a new array. Is there a reason why it is not used today in Flux (I'm here assuming that fusing only happens when using the @. macro)?

Constant bias=1.0 isn't really something that needs to be supported, I think?

I agree!

I guess whatever point I was trying to make about destructure is not relevant if we agree that the only scalar value to be allowed for bias is the one which puts the bias out of play (i.e 0, false, nothing, FluxsOwnNothing). It might not even have been relevant even if non-zero scalars are allowed so lets not dwell on it.

@DhairyaLGandhi
I read through the comments in that PR and I'm sorry if I reopened was what an already closed case. I see you basically went through all the iterations discussed in it :)

I like the array style from some kind of conceptual purity perspective for sure. Is there a way to signal that an array is immutable so that it could be made clear that params are all arrays which are mutable?

Anyways, I don't mean to sound dug in here. As a user I'm happy with whatever solution solves the issues in the end.

@mcabbott
Copy link
Member

mcabbott commented Nov 2, 2020

I'm also thinking that the Zygote case is still piracy as it will redefine what an adjoint is for a Bool.

It's my understanding that Zygote regards most integers as being real numbers that happen to be human-friendly round numbers, but Bool is special and can be regarded as categorical. It's not 100% sorted, but that's the idea. This isn't enforced, e.g. gradient(sin, true) currently promotes, but is permitted.

julia> Zygote.gradient(sum, [true, false])
(nothing,)

julia> Zygote.gradient(sum, [1, 0])
(2-element FillArrays.Fill{Int64,1,Tuple{Base.OneTo{Int64}}} = 1,)

I guess a fused broadcast will remove alot of the cost from not specializing assuming the cost is for allocating a new array. Is there a reason why it is not used today in Flux (I'm here assuming that fusing only happens when using the @. macro)?

Julia normally fuses all broadcasts, independent of whether @. wrote the dots or you did. I think Zygote often un-fuses them, roughly because writing lots of un-fused special cases (like broadcasting tanh) was faster than its handling of the more complicated fused operation. I think. There was talk of replacing this backward-mode broadcast with a forward-mode one, but it hasn't happened.

But I would say it should be OK to test b===false and specialise the broadcast, within Zygote's forward pass. The reason Base can't make x .+ false === x because someone may assume it's a fresh array that can be mutated without touching x, but Zygote forbids that anyway. Likewise identity.(x) must copy in Base, but need not within Zygote:

julia> @btime identity.(r)  setup=(r=rand(10^6)); # makes a copy, 7MB
  673.357 μs (2 allocations: 7.63 MiB)

julia> @btime identity.(r .+ 0)  setup=(r=rand(10^6)); # still one copy
  743.682 μs (2 allocations: 7.63 MiB)

julia> @btime sum(r)  setup=(r=rand(10^6)); # just reads
  357.805 μs (0 allocations: 0 bytes)

julia> @btime Zygote.gradient(x -> sum(x), r)  setup=(r=rand(10^6)); # makes a Fill
  359.569 μs (0 allocations: 0 bytes)

julia> @btime Zygote.gradient(x -> sum(identity.(x)), r)  setup=(r=rand(10^6)); # still just reads!
  359.039 μs (0 allocations: 0 bytes)

That works ofc from an API p.o.v, but then you'd end up having to remember to put this logic in every layer or else there will be (typically silent) unintended results.

Would be good perhaps to find examples of layers defined elsewhere, to see how big a deal this will be. If my custom layer never had a way to turn off bias then it doesn't care. If I want to add one, is it much more complicated than this?

Dense2(in::Int, out::Int, act::Function=identity; bias::Bool=true, init::Function=randn) = 
    Dense2(init(out,in), bias ? init(out) : false, act)  

@CarloLucibello
Copy link
Member

If we keep the Zeros type, whether an array or not, I think we should not expose it to users but use a boolean flag as in

Dense2(in::Int, out::Int, act::Function=identity; bias::Bool=true, init::Function=randn)

suggested above. This way we will be able to change the implementation without further breakings.

As for the b=false approach, I find it very appealing, except for the piracy

Broadcast.broadcasted(::typeof(+), a::AbstractArray{T}, b::Bool) where {T<:AbstractFloat} = 
    b===false ? a : Broadcast.broadcasted(+, a, one(T)) # many BLAS things have a branch like this, low cost:

If I understand @mcabbott comment, the piracy could be avoided by just specialing the Zygote adjoint:

Zygote.@adjoint Broadcast.broadcasted(::typeof(+), a::AbstractArray{<:Number}, b::Bool) = 
    y = b === false ? a : a  .+ b
    y, da -> (nothing, da, nothing)

This is good for the training phase, but we lose in performance at inference time.

Maybe having a slim type as in this PR is ok if we want to avoid this and other possible pitfalls of the b=false approach. So my suggestion is to proceed with this PR but eliminating the use of Zeros in the userspace.

@mcabbott
Copy link
Member

mcabbott commented Nov 3, 2020

Yes, I wasn't thinking clearly initially about piracy, that isn't safe at all. But the broadcast adjoint you write ought to be fine. I think the cost of not removing it during inference is really tiny:

julia> @btime identity.(W*x)  setup=(W=rand(64,64); x=rand(64,64););
  10.978 μs (4 allocations: 64.16 KiB)

julia> @btime identity.(W*x .+ false)  setup=(W=rand(64,64); x=rand(64,64););
  11.349 μs (4 allocations: 64.16 KiB)

julia> 11.349 / 10.978 # i.e. 3 percent, at this size, and no extra allocations
1.033794862452177

It might make sense to replace these broadcasts with muladd(W, x, b) or with some activate!!(identity, W*x, b), which can fuse things and reduce allocations. Those might be a little easier if there's no special type, muladd on master already works fine with false.

@DhairyaLGandhi
Copy link
Member

Keeping the Zeros type would be good, since it's fairly low overhead in terms of API, and retroactively also adds the functionality to most custom layers outside of the standard Flux layers. I agree that we need the adjoint for +/- and having the Zeros type also allows for neat dispatching.

@DrChainsaw
Copy link
Contributor Author

Cool, are we somewhat converged then?

Should the name change?

There are a bunch of tests added in test/utils.jl which I just copy-pasted from the other PR which test that various permutations of Zeros have nothing as gradient. It doesn't hurt to keep it I guess, but technically it is testing something which is not a strict requirement on this version of Zeros. The new tests for destructure and loadparams! in the same file are still relevant though.

@mcabbott
Copy link
Member

mcabbott commented Nov 3, 2020

and retroactively also adds the functionality to most custom layers outside of the standard Flux layers

Have you got a link to some such layers? I'm not sure I follow how needing to allow a special type is more flexible than allowing Base types. Maybe you can say which variant of Zeros you mean here?

fairly low overhead in terms of API

Here too, do you mean the Zeros <: AbstractArray (whose overhead includes things like Adapt.jl) or Zeros <: Any (of this PR)?

@jeremiedb
Copy link
Contributor

jeremiedb commented Nov 6, 2020

Could you clarify whether my understanding is wrong, but would this proposal involves that a layer Dense(2, 3, initb=Zeros) means a Dense without bias (ie. equivalent to non-trainable zeros), whereas, Dense(2, 3, initb=zeros), which is the current default behavior would mean a dense layer initialized with zeros, but whose parameters are trainable? If such is the case, then this would be fairly confusing from my perspective.

Also, although these come to questions of personal preference, given that common frameworks such as Pytorch or Mxnet are all using the bias=false as a flag to indicate whether or not a bias is applied, I feel using a different convention is more likely to add than remove potential interpretation conflicts.

@DrChainsaw
Copy link
Contributor Author

DrChainsaw commented Nov 6, 2020

@jeremiedb : This is fully correct. This is the same as how it is done on current master btw as I didn't change the name, just the implementation. In other words, this PR just fixes the existing implementation, namely #1332 and #1277. The original PR which added Zeros is #873

I also agree that a name change to make it clearer that it means "there is no bias parameter" would be helpful and perhaps worth the breaking change. Do you like any of the options mentioned in the first post?

@mcabbott
Copy link
Member

mcabbott commented Nov 6, 2020

Good point about initb=Zeros vs initb=zeros. That's definitely an argument for changing the interface to bias=false, regardless of what types the layer ends up containing.

Would the change have to be breaking? If you leave behind a dummy function like Flux.Zeros(s...)=false (with a depwarn I guess) could that work in both Dense(2,d,initb=Zeros) and Conv((2,2), 2=>3, bias=Flux.Zeros())?

@DrChainsaw
Copy link
Contributor Author

@mcabbott : I'm thinking that a breaking change maybe isn't so bad and it might be better to just rip the bandaid, especially since there are issues with the functionality already.

It seems there are many in favour for bias=false. I can make the change in this PR but it would be good to hear if there are anyone who opposes so I don't have to go back and forth alot.

The main drawback imo is that maintainers need to stay alert in codereviews so that new layers implement handling for it. This is however to some extent the same for both current masters Zeros and the Zeros in this branch due to type constraints.

While it is often good to stick to conventions, I hardly think that older frameworks tendency to have about 10 (sometimes mutually exclusive) arguments for creating a layer and about ten times as much if-else_ing in both the constructor and in the actual forward pass code is something to strive for. What got me hooked on Flux in the first place was definitely the absence of all that.

@jeremiedb
Copy link
Contributor

jeremiedb commented Nov 6, 2020

Just to clarify, I think I'm still a bit puzzled by the approach taken with the Zeros. Is it normal that currently on master, the utility builder wouldn't be working? For example m_cpu = Chain(Dense(3, 1, σ, initb=Flux.Zeros)) would throw the following during training: InexactError: Bool(0.005822220444679259).

I agree that conforming to the existing framework shouldn't be taken as a guideline. Nonetheless, I'm wondering if relying on type unions involving custom types just to handle whether a bias is included isn't risking to open a can of worms.

Thinking of it from a plumber perspective, what downside would come from handling the no_bias as distinct structs? For example, a DenseNoBias:

struct Dense{F,S <: AbstractArray,T <: AbstractArray}
    W::S
    b::T
    σ::F
end

struct DenseNoBias{F,S <: AbstractArray,}
    W::S
    σ::F
end
  
Dense(W, b) = Dense(W, b, identity)
Dense(W) = DenseNoBias(W, identity)

function Dense(in::Integer, out::Integer, σ=identity;
                 bias=false, initW=glorot_uniform, initb=zeros)
    if bias
        return DenseNoBias(initW(out, in), σ)
    else
        return Dense(initW(out, in), initb(out), σ)
    end
end

@functor Dense
@functor DenseNoBias

function (a::Dense)(x::AbstractArray)
    W, b, σ = a.W, a.b, a.σ
    σ.(W * x .+ b)
end

function (a::DenseNoBias)(x::AbstractArray)
    W, σ = a.W, a.σ
    σ.(W * x)
end

My only pending concern would be about the behavior for Dense, for which I'm not sure how the constructor of the layer should be modified so not to create confusion by specifying no_bias through the initb. Did you have a take on this one?

@DrChainsaw
Copy link
Contributor Author

Is it normal that currently on master, the utility builder wouldn't be working?

It is normal unfortunately and one of the issues fixed in this PR. Zygote defines adjoints for broadcasting with numerical arrays, so even if the Zeros is never used in the actual op, zygote creates a gradient for it (which ofc can't be applied). Trying to overload all the dispatch stuff was something I attempted in #1374 and it didn't turn out too pretty. Not 100% this is what you are seeing, but I have added tests for training without bias in this PR.

I'm wondering if relying on type unions involving custom types just to handle whether a bias is included isn't risking to open a can of worms.

It might, but the attempt at non-unions is basically whats in #1374. I think other frameworks leaves the bias as none or null and thats a type union where alot of worm-can escape hatches in julia would involve type piracy.

The distinct structs approach sure is clean cut, but it forces duplication. Dense{Array{Float32, 2}, NoBias} is also a distinct type from Dense{Array{Float32, 2}, Array{Float32, 1}}, right?

Coming from an OO background, the thought of something like

struct Biased{T, B<:AbstractArray}
  layer::T
  bias::B
end

(l::Biased)(x) = l.layer(x) .+ l.bias

has certainly struck me, but I recon its impractical for many reasons, specialized CuDNN operations being one and API smoothness being another.

My only pending concern would be about the behavior for Dense, for which I'm not sure how the constructor of the layer should be modified so not to create confusion by specifying no_bias through the initb. Did you have a take on this one?

Not sure I follow the concern here. Is this related to the distinct structs approach?

@jeremiedb
Copy link
Contributor

(l::Biased)(x) = l.layer(x) .+ l.bias

Would have been a clever way to handle it! It reminds a bit of the SkipConnexion approach.

Dense{Array{Float32, 2}, NoBias}

Are you referring for example to: DenseNoBias(W) = DenseNoBias(W, identity) which could have been added to the constructors? Otherwise, not sure what the NoBias type would be referring to. Essentially, the idea was to have the Dense builder returns either a Dense or DenseNoBias depending if bias is set to false.

Not sure I follow the concern here. Is this related to the distinct structs approach?

Sorry for confusion, I was referring to the current PR status where the initb=zeros vs initb=Zeros synthax may bring some confusion. If we look for consistency with the Conv, then initb could be dropped in favor of a bias argument taking directly the initialization values rather than the initialization function? That be breaking though. Or maybe FixedZeros would clarify?

@mcabbott
Copy link
Member

mcabbott commented Nov 7, 2020

Isn't the problem with (l::Biased)(x) = l.layer(x) .+ l.bias that the bias needs to be added before the nonlinearity is applied?

Having DenseNoBias would be very explicit, but won't you then need to write, test, and maintain ConvNoBias and DepthwiseConvNoBias and so on?

bias=false seems like a super-simple interface to use and to read. Agree this really ought to be consistent between Conv and Dense, and that we could dump the initb keyword. If you want some non-standard initialisation it's simpler to just supply the array yourself, rather than work through some special API.

I hardly think that older frameworks tendency to have about 10 (sometimes mutually exclusive) arguments for creating a layer and about ten times as much if-else_ing in both the constructor and in the actual forward pass code is something to strive for.

Yes, absolutely. Although I think it won't be so hard to juggle things to make all previous-version user-level syntax give a friendly warning (at construction), at least for a while. This isn't (IMO) so important that it should prevent us from ending up in the right place, but one of the big criticisms of Flux is that the model zoo is permanently broken, etc, so if possible it would be nice to print some warnings. (That said, Zeros doesn't appear in the model zoo, and the only initb in a layer definition not use, here.)

@DrChainsaw
Copy link
Contributor Author

Isn't the problem with (l::Biased)(x) = l.layer(x) .+ l.bias that the bias needs to be added before the nonlinearity is applied?

It would indeed make it so that both bias and activation would need to be removed from every layer and activation would be something added separately, just as one today can do Chain(Dense(2,3), relu). I guess one could simplify it further and just do Chain(Dense(2,3), AddBias(3), relu). This was not really a serious proposal though.

While it has a nice DRY and SRP:ish touch to it, as I said, seems far from practical. I can imagine there are more than a handful of "lets keep everything clean this time" frameworks out there which noone uses that use this approach. Maybe in some distant future when something XLA.ish just optimizes whatever code you write for whatever compute resources you have available then it could turn out to be useful.

Otherwise, not sure what the NoBias type would be referring to.

It would in this case be a potential new name for the type which in this PR is called Zeros. I was just trying to point out that I think that everything which can be done to separate bahviour of Dense and DenseNoBias can also be done to separate Dense{.., <:AbstractArray} and Dense{...,NoBias}. Not a major point in the grand scheme of things as I think the main drawback of Dense and DenseNoBias is the doubled maintenance.

Agree this really ought to be consistent between Conv and Dense, and that we could dump the initb keyword.

I agree with this. I'm thinking about trying to keep PRs small, so perhaps we just focus on fixing the issues here and then do another round to smooth out the rough edges around the API. There obviously needs to be some kind of direction and I'm not opposing adding the bias=false API in this PR if that is what the majority wants and there is no showstopping argument against it.

@CarloLucibello
Copy link
Member

so let's just switch to the bias=false api here and unexpose the Zeros type (i.e. remove any reference in docs). We can implement @mcabbott's suggestion b===false in a later (non-breaking) PR if we choose to.

@DrChainsaw
Copy link
Contributor Author

I think I'm all set as long as bors is happy.

@CarloLucibello
Copy link
Member

nice, thanks!

bors r+

@bors
Copy link
Contributor

bors bot commented Nov 9, 2020

Build succeeded:

@bors bors bot merged commit 09764f8 into FluxML:master Nov 9, 2020
@mcabbott mcabbott mentioned this pull request Nov 9, 2020
4 tasks
@DhairyaLGandhi
Copy link
Member

I'm not completely in favour of this API change here, we haven't particularly added conceptual clarity over the Boolean flag. And given that this restricts functionality, we need to be able to discuss bringing the generic function back.

@DrChainsaw
Copy link
Contributor Author

@DhairyaLGandhi : Fwiw, bias=Flux.Zeros() still works for all layers, but maybe that is not what you meant with that it restricts functionality.

@DhairyaLGandhi
Copy link
Member

Using this with mismatched shapes and saving and loading with parameters would cause the same issues, with lesser ways of justifying them too.

Having the necessary constructors and adjoints are good and well, necessary, I am trying to understand how this proposal supersedes design wise.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Nov 9, 2020

Really, the outer user facing API being decoupled from the implementation is usually something we haven't done much of but is common enough in development that I don't have strong opinions on them in this case as long as it keeps bloat manageable, while giving user the options to work on the abstractions they prefer or are better suited for the work at hand.

@mcabbott
Copy link
Member

mcabbott commented Nov 9, 2020

I got half-way with a PR re-coupling the internals to the API, but got distracted by tests & docstrings.

But I noticed the following error --- this works on Flux v0.11.2 but not on master, so I presume but have not proven that this PR is to blame:

julia> xs = rand(Float32, 100, 100, 3, 50); 

julia> ConvTranspose((5,5), 3 => 7, stride=2, bias=false)(xs) |> size
(203, 203, 7, 50)

julia> ConvTranspose((5,5), 3 => 7, stride=2)(xs) |> size 
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 7 and 3")
Stacktrace:
 [1] _bcs1 at ./broadcast.jl:501 [inlined]
...
 [9] materialize at ./broadcast.jl:837 [inlined]
 [10] (::ConvTranspose{2,4,typeof(identity),Array{Float32,4},Array{Float32,1}})(::Array{Float32,4}) at /Users/me/.julia/packages/Flux/pdPlo/src/layers/conv.jl:275

@mcabbott
Copy link
Member

mcabbott commented Nov 9, 2020

@DhairyaLGandhi can you clarify what functionality is restricted? And what do you mean by "the generic function" here? I'm not sure I follow at all.

I think this PR is a step in the right direction, towards a simpler API and a more maintainable pile of objects. Especially around "saving and loading with parameters" which seemed to be one of the places the fake-array Zeros caused problems.

@DhairyaLGandhi
Copy link
Member

This design assumes no checks on making sure the operations it participates in are valid, which can open issues down the line.

The generic function I was referring to might be ambiguous, but I was referring to the manner in which Zeros was implemented before merging this PR.

Taking the input of Zeros() actually means the users get to opt out of the shape checking while also having the option to use a stricter check via providing valid dimsensions.

This PR also doesn't catch the issue with loading model parameters with Zeros in them. This was an error in the previous implementation as well, but can be worked around by making the struct mutable - which isn't much of a problem with the design of the previous Zeros so to speak, since adding the keyword in there is an orthogonal change, but it has the advantage of holding the shape information so the model is actually reproducible once it is loaded.

As mentioned - a lighter API is desirable to me as well, but it should not sacrifice correctness which would be a bigger concern to me.

@mcabbott
Copy link
Member

mcabbott commented Nov 9, 2020

Thanks for expanding.

assumes no checks on making sure the operations it participates in are valid

Can you point out some checks you'd like to see? Or point out what input is allowed by the current state, which gave a friendly error on construction before? We could always catch more things, especially if they are mistakes people are apt to make. (Often these are easier for someone besides the author to see.)

This PR also doesn't catch the issue with loading model parameters with Zeros in them. This was an error in the previous implementation as well

Do you mean loading a model saved from an earlier version? That might not be trivial to support, but if it is meant to be, we should start by saving some test cases somewhere.

Or are there particular cases of Flux.destructure or Flux.loadparams! round-tripping which now fail? I confess that I have not yet taken the tests apart in detail here. But the simpler the design we adopt, the easier this will be. The false design above seems to work automatically.

This was an error in the previous implementation as well, but can be worked around by making the struct mutable

Which struct are you referring to here?

the advantage of holding the shape information so the model is actually reproducible

Are there cases where the length of the bias vector cannot be inferred from the shape of the weight matrix? The constructors all seem to have no problem doing this. Or what other shape information are you concerned about reproducing?

@mcabbott
Copy link
Member

mcabbott commented Nov 9, 2020

FWIW, this is how far I got with a PR storing false:

master...mcabbott:nobias

Here function create_bias does explicitly check the size of a provided bias vector. Flux.Zeros(...) prints a warning but should produce valid code, from its old usages. Lots of types are tightened from ::Union{Bool, Zeros, AbstractVector{T}} to ::Union{Bool, AbstractVector{T}}. But tests aren't sorted out yet.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Nov 9, 2020

Can you point out some checks you'd like to see?

The shape checking and broadcasting semantics that Julia expects to have the correct result when a mismatch does take place. This also requires being able to reshape consistently.

Do you mean loading a model saved from an earlier version?

Typically yes, but also equally importantly, consistently representing the information in the stored model such that it can be manifested back to the loaded model.

But the simpler the design we adopt, the easier this will be.

Well, it highly depends. Initially, my thought was checking for a bool in the forward pass isn't too bad for a start, but the implications of it are documented in #873. The complexity control flow adds to ad and the special handling in every layer are examples.

The unions in the constructors and the forward passes were also not well received.

Which struct are you referring to here?

The struct Zeros <: AbstractArray ... end needs the mutable keyword there.

@DhairyaLGandhi
Copy link
Member

On loading and saving - Yes it does get slightly tricky since we need to replace the m in loadparams!(m, ps) to account for the Zeros instead of accumulating the bias with literal 0s. But beyond that, we can preserve the strictness of shape checking via the dimensions.

@DhairyaLGandhi
Copy link
Member

To be clear, the tests and adjoints from the PR are fine, and while I wish we could preserve the ease of loading with booleans, the tradeoff with correctness is what I'm trying to understand.

@mcabbott
Copy link
Member

mcabbott commented Nov 9, 2020

The shape checking and broadcasting semantics that Julia expects to have the correct result when a mismatch does take place. This also requires being able to reshape consistently.

But beyond that, we can preserve the strictness of shape checking via the dimensions.

But if there is no bias, there is no shape, it cannot be wrong. Having an array of zeros which don't get trained, or a fake-array which is always zero, is a way to save writing a separate DenseNoBias. This array does have to have the right shape, of course. But with a DenseNoBias, there would obviously be no such correctness check, no chance of a mismatch. There is no "when". And the same is true of either the present scalar Zeros(), or simply false.

Initially, my thought was checking for a bool in the forward pass isn't too bad for a start, but the implications of it are documented in #873. The complexity control flow adds to ad and the special handling in every layer are examples.

The only complexity is that, when Conv layers reshape the bias to be in the 3rd dimension etc, they should not do this for a scalar. This is an extremely minor problem, it results in exactly 4 places where you call conv_reshape_bias instead of writing out reshape(c.bias, map(_->1, c.stride)..., :, 1) four times. It might even save characters.

The forward pass isn't an issue, worst case we simply add false which is just about free in a fused broadcast. Flux is not down to the last 1% time savings, else it would be extremely careful never to broadcast identity.(x) which is a wasted copy. And that's OK, there is value to being concise and easy to extend.

I don't know what bit of #873 you are thinking of. But we also know more now than we did then, about just how complicated this fake array story ended up. That's what's driving us to simpler alternatives.

The unions in the constructors and the forward passes were also not well received

By who? Before this PR, there was Conv(;weight::AbstractArray{T,N}, bias::Union{Zeros, AbstractVector{T}},, still a union, seemingly accepted. I think it ought to be Conv(;weight::AbstractArray{T,N}, bias::Union{Bool, AbstractVector{T}},, and while I agree unions are a little ugly, at least ones with Base types are better.

Not sure what you mean by unions in the forward passes.

The struct Zeros <: AbstractArray ... end needs the mutable keyword there

OK, thanks for clarifying. But it wasn't mutable before this PR, nor in the alternative one. Is this an approach you were thinking about exploring, or another PR somewhere else? This still seems like loading on ever more complication to compensate for other complication.

while I wish we could preserve the ease of loading with booleans

But this is exactly what Booleans shine at. The loading is trivially handled. It is preserving this with a complicated Zeros struct which requires jumping through all sorts of hoops.

tradeoff with correctness

Whether you have a FillArray, a custom version, a scalar Zero, false, or DenseNoBias, these can all be mathematically equivalent. So I'm not sure I see what you mean by tradeoff.

On loading and saving - Yes it does get slightly tricky since we need to replace the m in loadparams!(m, ps) to account for the Zeros instead of accumulating the bias with literal 0s.

I don't follow, I thought loadparams! wrote into an existing m, no replacement. Since params(m) contains Arrays, that's where the writing happens. If it does not contain a bias, then there is nothing to write, no problem.

Is your concern here about whether parameters from m1 = Chain(Dense(3,3)) can be loaded into m2 = Chain(Dense(3,3; bias=false)), or v-v? That seems to me like it ought to simply be an error, as they have different length(params(m)), and loadparams! should not attempt to guess what you wanted when mapping one model to a different one.

Do you mean loading a model saved from an earlier version?
Typically yes, but

If this is a requirement, then it would be great to start setting up some kind of test which does this, and which the tagged version of Flux passes. Then we can build against that. But if not, then it doesn't matter, and shouldn't influence our design.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Nov 9, 2020

Thank you for this!

I think it's important to make sure the argument about correctness is not lost.

But if there is no bias, there is no shape, it cannot be wrong.

It's not about there not being a bias to shape check, it's about the model passing without the bias where it wouldn't pass had the bias array existed there or vice versa. Both cases should behave identically.

save writing a separate DenseNoBias

I don't think we are considering the DenseNoBias at this point, so that isn't much of a concern for this argument. But existing layers do need to be handled without removing the flexibility of using this type where the shape check may be necessary.

#873 (comment)
#873 (comment)
#873 (comment) are examples where it is expressed that zeros behave close to a normal array. Although there are also examples where restricting it's use to "adding vectors of zeros" is all we take care of 😛 , and the easier broadcasting rules are definitely part of the design consideration.
Having it an array also helps with type hints for layers, needless to say.

I thought loadparams! wrote into an existing m, no replacement

Correct, but how would you handle loading an existing m with parameters that contain Zeros. It's mostly plumbing so handling that case is a nice bonus.

But it wasn't mutable before this PR, nor in the alternative one

Regardless, it's a known and simple fix to an issue.

The concern is loading the models with multiple layers which have their bias off. And again, while a Boolean would make it easier, it comes at the cost of special casing. It might have only needed a special reshape in Conv, but extending it for other cases is equally easy in both designs.

So I'm not sure I see what you mean by tradeoff

These methods here

+(::Zeros, b::AbstractArray) = b

And

struct Dense{F,S<:AbstractArray,T<:Union{Zeros, AbstractVector}}
and friends in Conv

Conv(;weight::AbstractArray{T,N}, bias::Union{Zeros, AbstractVector{T}},, still a union, seemingly accepted. I think it ought to be Conv(;weight::AbstractArray{T,N}, bias::Union{Bool, AbstractVector{T}},,

Yes, those were, but now We can't assume that biases are a subset of arrays, which Zeros should be.

@DhairyaLGandhi
Copy link
Member

The way I see it, the biggest complaints were:

  • Loading params - addressed by making Zeros mutable
  • Not having the adjoint for + (fixed by this PR)
  • GPU interactions (likely fixed by CUDA.Adaptor)

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Nov 9, 2020

I don't mean to come off as rigid because there's a LOT of good in this PR and my thanks and Kudos to @DrChainsaw and yourself for validating the motivations behind it, but also feel it necessary to address the cases which are important to cover

@DrChainsaw
Copy link
Contributor Author

DrChainsaw commented Nov 9, 2020

@DhairyaLGandhi No worries from my side, this is how good software is built. I hope I don't come off as rigid either. Just trying to help out here. :)

To me one of the attractions of open source is that one (at least I who is not on a payroll for this task) can afford to dwell on decisions like this.

Loading params - addressed by making Zeros mutable

I guess this also requires one to prevent that the loading functionality tries to mutate the Zeros. The use case of trying to just copy over the parameters from one model with bias:ed layers to another which doesn't have biases is probably going to be a bit ambiguous no matter what approach one takes. They are in essence different models imo.

Note sure how relevant, but in the case of destructure it seems like the contract is that the returned parameters can be modified and a new model can be built from them. So shall modified zeros result in a trainable bias materializing, or shall they be ignored? Perhaps replaced by a non-trainable array with the new values?

The only use case I could think of for that method is that adjoint method for backprop through an ODE solver from that paper which popularized neural ODEs. For that use case I think it is preferred to not return non-trainable parameters. I don't know if there will be any harm done if returning zeros and then ignoring whatever values comes back though (except wasted computation).

Not sure if worth mentioning, but I guess that a fixed AbstractArray type with the correct eltype and size does still fit in the type constraints, or?

@DhairyaLGandhi
Copy link
Member

I reckon not returning the Zeros would be preferable for NODEs, although I'd check with Chris Rackauckas before to be sure.

guess that a fixed AbstractArray type with the correct eltype and size does still fit in the type constraints, or?

Not sure what you're pointing at here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU error when using Zeros() as bias in Conv layer
5 participants