Tutorial 4: Proofreading and finishing connclusion

emanbuc · Oct 25, 2020 · 2123525 · 2123525
1 parent ed5e090
commit 2123525
Show file tree

Hide file tree

Showing 3 changed files with 1,495 additions and 21 deletions.
diff --git a/docs/tutorial_notebooks/tutorial2/Introduction_to_PyTorch.ipynb b/docs/tutorial_notebooks/tutorial2/Introduction_to_PyTorch.ipynb
@@ -1063,7 +1063,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Each linear layer has a weight matrix of the shape `[output, input]`, and a bias of the shape `[output]`. The tanh activation function does not have any parameters."
+    "Each linear layer has a weight matrix of the shape `[output, input]`, and a bias of the shape `[output]`. The tanh activation function does not have any parameters. Note that parameters are only registered for `nn.Module` objects that are direct object attributes, i.e. `self.a = ...`. If you define a list of modules, the parameters of those are not registered for the outer module and can cause some issues when you try to optimize your module. There are alternatives, like `nn.ModuleList`, `nn.ModuleDict` and `nn.Sequential`, that allow you to have different data structures of modules. We will use them in a few later tutorials and explain them there. "
    ]
   },
   {

diff --git a/docs/tutorial_notebooks/tutorial4/Optimization_and_Initialization.ipynb b/docs/tutorial_notebooks/tutorial4/Optimization_and_Initialization.ipynb
@@ -106,7 +106,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In the last part of the notebook, we will train models with different optimizers. The pretrained models for those are downloaded below."
+    "In the last part of the notebook, we will train models using three different optimizers. The pretrained models for those are downloaded below."
    ]
   },
   {
@@ -149,8 +149,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Throughout this notebook, we will use a deep fully connected network, similar to our previous tutorial. We will also again apply the network to FashionMNIST, so you can relate to the results of Tutorial 2. \n",
-    "\n",
+    "Throughout this notebook, we will use a deep fully connected network, similar to our previous tutorial. We will also again apply the network to FashionMNIST, so you can relate to the results of Tutorial 3. \n",
     "We start by loading the FashionMNIST dataset:"
    ]
   },
@@ -165,7 +164,7 @@
     "\n",
     "# Transformations applied on each image => first make them a tensor, then normalize them with mean 0 and std 1\n",
     "transform = transforms.Compose([transforms.ToTensor(),\n",
-    "                                transforms.Normalize((0.2861,), (0.3530))\n",
+    "                                transforms.Normalize((0.2861,), (0.3530,))\n",
     "                               ])\n",
     "\n",
     "# Loading the training dataset. We need to split it into a training and validation part\n",
@@ -187,7 +186,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The normalization transformation `transforms.Normalize` is designed to give us an expected mean of 0 and standard deviation of 1 across pixels. You can calculate these parameters by determining the mean and standard deviation on the original images:"
+    "In comparison to the previous tutorial, we have changed the parameters of the normalization transformation `transforms.Normalize`. The normalization is now designed to give us an expected mean of 0 and standard deviation of 1 across pixels. This will be particularly relevant for the disucssion about initialization we will look at below, and hence we change it here. It should be noted that in most classification tasks, both normalization techniques (between -1 and 1 or mean 0 and stdev 1) have shown to work well.\n",
+    "We can calculate the normalization parameters by determining the mean and standard deviation on the original images:"
    ]
   },
   {
@@ -274,7 +274,7 @@
     "            layers += [nn.Linear(layer_sizes[layer_index-1], layer_sizes[layer_index]),\n",
     "                       act_fn]\n",
     "        layers += [nn.Linear(layer_sizes[-1], num_classes)]\n",
-    "        self.layers = nn.ModuleList(layers)\n",
+    "        self.layers = nn.ModuleList(layers) # A module list registers a list of modules as submodules (e.g. for parameters)\n",
     "        \n",
     "        self.config = {\"act_fn\": act_fn.__class__.__name__, \"input_size\": input_size, \"num_classes\": num_classes, \"hidden_sizes\": hidden_sizes} \n",
     "        \n",
@@ -371,7 +371,7 @@
     "    # Pass one batch through the network, and calculate the gradients for the weights\n",
     "    model.zero_grad()\n",
     "    preds = model(imgs)\n",
-    "    loss = F.cross_entropy(preds, labels)\n",
+    "    loss = F.cross_entropy(preds, labels) # Same as nn.CrossEntropyLoss, but as a function instead of module\n",
     "    loss.backward()\n",
     "    # We limit our visualization to the weight parameters and exclude the bias to reduce the number of plots\n",
     "    grads = {name: params.grad.view(-1).cpu().clone().numpy() for name, params in model.named_parameters() if \"weight\" in name}\n",
@@ -426,7 +426,7 @@
     "\n",
     "Before starting our discussion about initialization, it should be noted that there exist many very good blog posts about the topic of neural network initialization (for example [deeplearning.ai](https://www.deeplearning.ai/ai-notes/initialization/)). In case something remains unclear after this tutorial, we recommend skimming through these blog posts as well.\n",
     "\n",
-    "When initializing a neural network, there are a few properties we would like to have. First, the variance of the input should be propagated through the model to the last layer, so that we have a similar standard deviation for the output neurons. If the variance would vanish the deeper we go in our model, the harder it becomes to optimize the model as the input to the next layer is basically a single constant value. Similarly, if the variance increases, it is likely to explode (i.e. head to infinity) the deeper we design our model. The second property we look out for in initialization techniques is a gradient distribution with equal variance across layers. If the first layer receives much smaller gradients than the last layer, we will have difficulties on choosing an appropriate learning rate. \n",
+    "When initializing a neural network, there are a few properties we would like to have. First, the variance of the input should be propagated through the model to the last layer, so that we have a similar standard deviation for the output neurons. If the variance would vanish the deeper we go in our model, it becomes much harder to optimize the model as the input to the next layer is basically a single constant value. Similarly, if the variance increases, it is likely to explode (i.e. head to infinity) the deeper we design our model. The second property we look out for in initialization techniques is a gradient distribution with equal variance across layers. If the first layer receives much smaller gradients than the last layer, we will have difficulties on choosing an appropriate learning rate. \n",
     "\n",
     "As a starting point for finding a good method, we will analyse different initialization based on our linear neural network with no activation function (i.e. an identity). We do this because initializations depend on the specific activation function used in the network, and we can adjust the initialization schemes later on for our specific choice."
    ]
@@ -24901,9 +24901,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As we expected, the variance stays indeed constant across layers. Note that our initialization does not restrict us to a normal distribution, but allows any other distribution with a mean of 0 and variance of $1/d$. You often see a uniform distribution being used for initialization. A small benefit of using a uniform instead normal distribution is that we can exclude the chance of initializing very large or small weights.\n",
+    "As we expected, the variance stays indeed constant across layers. Note that our initialization does not restrict us to a normal distribution, but allows any other distribution with a mean of 0 and variance of $1/d_x$. You often see a uniform distribution being used for initialization. A small benefit of using a uniform instead normal distribution is that we can exclude the chance of initializing very large or small weights.\n",
     "\n",
-    "Besides the variance of the activations, another variance we would like to stabilize is the one of the gradients. This ensures a stable optimization for deep networks. It turns out that we can do the same calculation as above, and come to the conclusion that we should initialize our layers with $1/d_x$ where $d_y$ is the number of input neurons (you can do the calculation as a practice, it can be done in the exact same way as above). As a compromise between both constraints, [Glorot and Bengio (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi) proposed to use the harmonic mean of both values. This leads us to the well-known Xavier initialization:\n",
+    "Besides the variance of the activations, another variance we would like to stabilize is the one of the gradients. This ensures a stable optimization for deep networks. It turns out that we can do the same calculation as above, and come to the conclusion that we should initialize our layers with $1/d_y$ where $d_y$ is the number of input neurons (you can do the calculation as a practice, it can be done in the exact same way as above). As a compromise between both constraints, [Glorot and Bengio (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi) proposed to use the harmonic mean of both values. This leads us to the well-known Xavier initialization:\n",
     "\n",
     "$$W\\sim \\mathcal{N}\\left(0,\\frac{2}{d_x+d_y}\\right)$$\n",
     "\n",
@@ -41164,7 +41164,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Although the variance decreases over depth, it is apparent that the activation distribution becomes more focused on the low values. Therefore, our variance will stabilize around 0.1 if we would go even deeper. Hence, Xavier initialization works well for Tanh networks. But what about ReLU networks? Here, we cannot take the previous assumption of the non-linearity becoming linear for small values. The ReLU activation function sets (in expectation) half of the inputs to 0, so that also the expectation of the input is not zero. However, as long as the expectation of $W$ is zero and $b=0$, the expectation of the output is zero. The variance is slightly changed, namely to $2/d_x$, which gives us the Kaiming initialization (see [He, K. et al. (2015)](https://arxiv.org/pdf/1502.01852.pdf)). Note that the Kaiming initialization does not use the harmonic mean between input and output size. In their paper (Section 2.2, Backward Propagation, last paragraph), they argue that using $d_x$ or $d_y$ both lead to stable gradients throughout the network, and only depend on the overall input and output size of the network. Hence, we can use here only the input $d_x$:"
+    "Although the variance decreases over depth, it is apparent that the activation distribution becomes more focused on the low values. Therefore, our variance will stabilize around 0.25 if we would go even deeper. Hence, we can conclude that the Xavier initialization works well for Tanh networks. But what about ReLU networks? Here, we cannot take the previous assumption of the non-linearity becoming linear for small values. The ReLU activation function sets (in expectation) half of the inputs to 0, so that also the expectation of the input is not zero. However, as long as the expectation of $W$ is zero and $b=0$, the expectation of the output is zero. The variance is slightly changed, namely to $2/d_x$, which gives us the Kaiming initialization (see [He, K. et al. (2015)](https://arxiv.org/pdf/1502.01852.pdf)). Note that the Kaiming initialization does not use the harmonic mean between input and output size. In their paper (Section 2.2, Backward Propagation, last paragraph), they argue that using $d_x$ or $d_y$ both lead to stable gradients throughout the network, and only depend on the overall input and output size of the network. Hence, we can use here only the input $d_x$:"
    ]
   },
   {
@@ -49209,7 +49209,7 @@
     "            ## Training ##\n",
     "            ##############\n",
     "            net.train()\n",
-    "            TP, count = 0., 0\n",
+    "            true_preds, count = 0., 0\n",
     "            t = tqdm(train_loader_local, leave=False)\n",
     "            for imgs, labels in t:\n",
     "                imgs, labels = imgs.to(device), labels.to(device)\n",
@@ -49219,11 +49219,11 @@
     "                loss.backward()\n",
     "                optimizer.step()\n",
     "                # Record statistics during training\n",
-    "                TP += (preds.argmax(dim=-1) == labels).sum().item()\n",
+    "                true_preds += (preds.argmax(dim=-1) == labels).sum().item()\n",
     "                count += labels.shape[0]\n",
     "                t.set_description(\"Epoch %i: loss=%4.2f\" % (epoch+1, loss.item()))\n",
     "                train_losses.append(loss.item())\n",
-    "            train_acc = TP / count\n",
+    "            train_acc = true_preds / count\n",
     "            train_scores.append(train_acc)\n",
     "\n",
     "            ################\n",
@@ -49270,14 +49270,14 @@
     "        data_loader - DataLoader object of the dataset to test on (validation or test)\n",
     "    \"\"\"\n",
     "    net.eval()\n",
-    "    TP, count = 0., 0\n",
+    "    true_preds, count = 0., 0\n",
     "    for imgs, labels in data_loader:\n",
     "        imgs, labels = imgs.to(device), labels.to(device)\n",
     "        with torch.no_grad():\n",
     "            preds = net(imgs).argmax(dim=-1)\n",
-    "            TP += (preds == labels).sum().item()\n",
+    "            true_preds += (preds == labels).sum().item()\n",
     "            count += labels.shape[0]\n",
-    "    test_acc = TP / count\n",
+    "    test_acc = true_preds / count\n",
     "    return test_acc "
    ]
   },
@@ -108410,9 +108410,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### What optimizer to take (TODO)\n",
+    "### What optimizer to take\n",
+    "\n",
+    "After seeing the results on optimization, what is our conclusion? Should we always use Adam and never look at SGD anymore? The short answer: no. There are many papers saying that in certain situations, SGD (with momentum) generalizes better where Adam often tends to overfit [5,6]. This is related to the idea of finding wider optima. For instance, see the illustration of different optima below (credit: [Keskar et al., 2017](https://arxiv.org/pdf/1609.04836.pdf)):\n",
+    "\n",
+    "<center width=\"100%\"><img src=\"flat_vs_sharp_minima.svg\" width=\"500px\"></center>\n",
     "\n",
-    "What is our conclusion on optimization? Should we always use Adam and never look at SGD anymore? The short answer: no. There are more than enough work saying SGD generalizes better, and especially for simpler models/problems, Adam often overfits. The reason: the goal is to find a wide optima instead of narrow, as wide optima often tend to generalize better. (Show example plot?)"
+    "The black line represents the training loss surface, while the dotted red line is the test loss. Finding sharp, narrow minima can be helpful for finding the minimal training loss. However, this doesn't mean that it also minimizes the test loss as especially flat minima have shown to generalize better. You can imagine that the test dataset has a slightly shifted loss surfaces due to the different examples than in the training set. A small change can have a significant influence for sharp minima, while flat minima are generally more robust to this change. \n",
+    "\n",
+    "In the next tutorial, we will see that some network types can still be better optimized with SGD and learning rate scheduling than Adam. Nevertheless, Adam is the most commonly used optimizer in Deep Learning as it usually performs better than other optimizers, especially for deep networks."
    ]
   },
   {
@@ -108421,10 +108427,22 @@
    "source": [
     "## Conclusion\n",
     "\n",
+    "In this tutorial, we have looked at initialization and optimization techniques for neural networks. We have seen that initialization have to balance the preservation of the gradient variance as well as the activation variance. This can be achieved with the Xavier initialization for tanh-based networks, and the Kaiming initialization for ReLU-based networks. In optimization, concepts like momentum and adaptive learning rate can help with challenging loss surface, but don't guarantee an increase in performance for neural networks.\n",
+    "\n",
+    "\n",
+    "## References\n",
+    "\n",
+    "[1] Glorot, Xavier, and Yoshua Bengio. \"Understanding the difficulty of training deep feedforward neural networks.\" Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010. [link](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)\n",
+    "\n",
+    "[2] He, Kaiming, et al. \"Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.\" Proceedings of the IEEE international conference on computer vision. 2015. [link](https://www.cv-foundation.org/openaccess/content_iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html)\n",
+    "\n",
+    "[3] Kingma, Diederik P. & Ba, Jimmy. \"Adam: A Method for Stochastic Optimization.\" Proceedings of the third international conference for learning representations (ICLR). 2015. [link](https://arxiv.org/abs/1412.6980)\n",
     "\n",
+    "[4] Keskar, Nitish Shirish, et al. \"On large-batch training for deep learning: Generalization gap and sharp minima.\" Proceedings of the fifth international conference for learning representations (ICLR). 2017. [link](https://arxiv.org/abs/1609.04836)\n",
     "\n",
+    "[5] Wilson, Ashia C., et al. \"The Marginal Value of Adaptive Gradient Methods in Machine Learning.\" Advances in neural information processing systems. 2017. [link](https://papers.nips.cc/paper/7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning.pdf)\n",
     "\n",
-    "## References"
+    "[6] Ruder, Sebastian. \"An overview of gradient descent optimization algorithms.\" arXiv preprint. 2017. [link](https://arxiv.org/abs/1609.04747)"
    ]
   }
  ],