Tutorial 6 (JAX): Clarify initialization of qkv

vinloo · Apr 2, 2023 · 01b1d43 · 01b1d43
1 parent fa80c4d
commit 01b1d43
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb b/docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb
@@ -340,7 +340,8 @@
     "\n",
     "<center width=\"100%\"><img src=\"../../tutorial6/multihead_attention.svg\" width=\"230px\"></center>\n",
     "\n",
-    "How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Using this approach, we can implement the Multi-Head Attention module below."
+    "How are we applying a Multi-Head Attention layer in a neural network, where we don't have an arbitrary query, key, and value vector as input? Looking at the computation graph above, a simple but effective implementation is to set the current feature map in a NN, $X\\in\\mathbb{R}^{B\\times T\\times d_{\\text{model}}}$, as $Q$, $K$ and $V$ ($B$ being the batch size, $T$ the sequence length, $d_{\\text{model}}$ the hidden dimensionality of $X$). The consecutive weight matrices $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding feature vectors that represent the queries, keys, and values of the input. Note that commonly, these weight matrices are initialized with the Xavier initialization. However, the layer is usually not too sensitive to the initialization, as long as the variance of $Q$ and $K$ do not become too large.\n",
+    "With this in mind, we can implement the Multi-Head Attention module below."
    ]
   },
   {