- Self-Attention is a great way to give attention to specific regions of an image. However, it does not care about the location of a feature but instead is location-independent.
- ReLU, GELU, Swish and LeakyReLU perform very well on real data, but have a significant issue that was mentioned in the paper accompanying SELU. None of these has a mean of zero. In addition to that, all of the above functions suffer from severe gradient issues. Just like tanh does. However, tanh, the counterpart of Sigmoid with zero-mean, performs significantly better than sigmoid. Therefore, I tried searching for a nonlinearity that has a zero mean, no vanishing or exploding gradients or values.
What it does
- Instead of using features to give attention to specific regions of an image, independent of their location, we use location-sensitive convolutions to give attention to specific features.
- It approximates the shape of the square root while avoiding the gradient issues of actually using a square root. It also keeps the shape of the gradient simple, while maintaining keeping all above properties.
How I built it
- Feature-Attention is based factorized convolutions, allowing a network to be fully-connected with the entire image, without wasting a lot of resources.
- As the first assumption of a working function was f(x) = x^(1/3) and the gradient issues come from x being between -1 and 1, it had to be made sure that the input to f was >= 1. Therefore the simplest function giving something close to abs(x) while having a stable gradient, b(x) = x^2 was chosen as a base function. Add 1 to the base function, and you can pass it as an input into f, resulting in an always positive function. To alleviate this issue, I used a continuous function sign function s which can then be multiplied with the output of f. In the end, we have RootTanh(x, a) = (x^2+1)^(1/a)*tanh(x) where a >= 3. In my testing, 4 seemed to be the most stable.
Challenges I ran into
- Not enough GPU memory
- Inception slowing down and decreasing performance of the model, resulting in noise
- Factorization (Inception-3C) screwing up the gradients, resulting in noise
- BatchNorm right before adding the residual, resulting in noise
- Swish and ReLU having dying gradients, resulting in degrading performance after initial epochs
- LeakyReLU having exploding values with exploding inputs, resulting in unstable gradients (and noise)
- X-Attention being close to zero, resulting in unstable gradients, NaN and black outputs
- AdaBound slowly morphing into SGD, an algorithm unsuitable of training GANs
Accomplishments that I'm proud of
- Feature-Attention is the inverse of Self-Attention, and therefore complements it in building a seemingly fully-connected fully-convolutional network. Using this, counting and similar tasks become trivial, without spending millions in training.
- RootTanh significantly outperforms its competitors on this test dataset
- Fast convergence with tiny amounts of training data on low-end CPUs (no GPU).
- Decent images after an hours with a 1660 (see attachment)
What I learned
During this time I've read an incredible amount of papers, constantly trying to improve the outputs of my GAN. I've learned that there are several ways to make any convolutional network stable, which can and should be used in together.
- BatchNorm will always make it more stable, but it can hurt the convergence. When applied sparingly, it can be very powerful.
- SpectralNorm is a requirement when trying to avoid gradient issues, which are bound to happen with attention layers.
- Consistency Regularization works better than gradient penalty, but both significantly improve the overall performance of a GAN.
What's next for LocAtE
- Implementing RevNet by tricking pytorchs gradient computation.
- Waiting for the 128x128 test to finish
- Testing it on 1024x1024 images on a V100