Tutorial 6: Fixing minor typo in WO dimension

emanbuc · Jan 3, 2023 · ffea03e · ffea03e
1 parent 55d7996
commit ffea03e
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 2 deletions.
diff --git a/docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb b/docs/tutorial_notebooks/JAX/tutorial6/Transformers_and_MHAttention.ipynb
@@ -337,7 +337,7 @@
     "\\end{split}\n",
     "$$\n",
     "\n",
-    "We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_k\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
+    "We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_v\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
     "\n",
     "<center width=\"100%\"><img src=\"../../tutorial6/multihead_attention.svg\" width=\"230px\"></center>\n",
     "\n",

diff --git a/docs/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.ipynb b/docs/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.ipynb
@@ -319,7 +319,7 @@
     "\\end{split}\n",
     "$$\n",
     "\n",
-    "We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_k\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
+    "We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{K}\\in\\mathbb{R}^{D\\times d_k}$, $W_{1...h}^{V}\\in\\mathbb{R}^{D\\times d_v}$, and $W^{O}\\in\\mathbb{R}^{h\\cdot d_v\\times d_{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).\n",
     "\n",
     "<center width=\"100%\"><img src=\"multihead_attention.svg\" width=\"230px\"></center>\n",
     "\n",