Fixed some small typos (#233)

* Fix small typo * Fix typo * Fix more typos * Fix typos in xai * Fix typos in attention * Run pre-commit
whitead · Jan 16, 2023 · 21138a6 · 21138a6
1 parent 1f95ef0
commit 21138a6
Show file tree

Hide file tree

Showing 16 changed files with 32 additions and 27 deletions.
diff --git a/applied/QM9.ipynb b/applied/QM9.ipynb
@@ -430,6 +430,7 @@
     "node_feature_len = 16\n",
     "msg_feature_len = 16\n",
     "\n",
+    "\n",
     "# make our weights\n",
     "def init_weights(g, n, m):\n",
     "    we = np.random.normal(size=(n, m), scale=1e-1)\n",

diff --git a/dl/Equivariant.ipynb b/dl/Equivariant.ipynb
@@ -971,6 +971,7 @@
     "\n",
     "def lift(f):\n",
     "    \"\"\"lift f into group\"\"\"\n",
+    "\n",
     "    # create new function from original\n",
     "    # that is f(gx_0)\n",
     "    @np_cache(maxsize=W**3)\n",

diff --git a/dl/Hyperparameter_tuning.ipynb b/dl/Hyperparameter_tuning.ipynb
@@ -376,7 +376,6 @@
     "def train_model(\n",
     "    model, lr=1e-3, Reduced_LR=False, Early_stop=False, batch_size=32, epochs=20\n",
     "):\n",
-    "\n",
     "    tf.keras.backend.clear_session()\n",
     "    callbacks = []\n",
     "\n",

diff --git a/dl/VAE.ipynb b/dl/VAE.ipynb
@@ -997,6 +997,7 @@
    "source": [
     "import numpy as np\n",
     "\n",
+    "\n",
     "###---------Transformation Functions----###\n",
     "def center_com(paths):\n",
     "    \"\"\"Align paths to COM at each frame\"\"\"\n",

diff --git a/dl/attention.ipynb b/dl/attention.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Attention Layers\n",
     "\n",
-    "Attention is a concept in machine learning and AI that goes back many years, especially in computer vision{cite}`BALUJA1997329`. Like the word \"neural network\", attention was inspired by the idea of attention in how human brains deal with the massive amount of visual and audio input{cite}`treisman1980feature`. **Attention layers** are deep learning layers that evoke the idea of attention. You can read more about attention in deep learning in Luong et al. {cite}`luong2015effective` and get a practical [overview here](http://d2l.ai/chapter_attention-mechanisms/index.html). Attention layers have been empirically shown to be so effective in modeling sequences, like language, that they have become indispensible{cite}`vaswani2017attention`. The most common place you'll see attention layers is in [**transformer**](http://d2l.ai/chapter_attention-mechanisms/transformer.html) neural networks that model sequences. We'll also sometimes see attention in graph neural networks.\n",
+    "Attention is a concept in machine learning and AI that goes back many years, especially in computer vision{cite}`BALUJA1997329`. Like the word \"neural network\", attention was inspired by the idea of attention in how human brains deal with the massive amount of visual and audio input{cite}`treisman1980feature`. **Attention layers** are deep learning layers that evoke the idea of attention. You can read more about attention in deep learning in Luong et al. {cite}`luong2015effective` and get a practical [overview here](http://d2l.ai/chapter_attention-mechanisms/index.html). Attention layers have been empirically shown to be so effective in modeling sequences, like language, that they have become indispensable{cite}`vaswani2017attention`. The most common place you'll see attention layers is in [**transformer**](http://d2l.ai/chapter_attention-mechanisms/transformer.html) neural networks that model sequences. We'll also sometimes see attention in graph neural networks.\n",
     "\n",
     "\n",
     "```{margin}\n",
@@ -89,7 +89,7 @@
    "source": [
     "## Attention Mechanism Equation\n",
     "\n",
-    "The attention mechanism equation uses query and keys arguments only. It outputs a tensor one rank less than the keys, giving a scalar for each key corresponding to the attention the query should have for the key. This attention vector should be normalized. The most common attention mechanism a dot product and softmax:\n",
+    "The attention mechanism equation uses query and keys arguments only. It outputs a tensor one rank less than the keys, giving a scalar for each key corresponding to the attention the query should have for the key. This attention vector should be normalized. The most common attention mechanism is a dot product and softmax:\n",
     "\n",
     "\\begin{equation}\n",
     "\\vec{b} = \\mathrm{softmax}\\left(\\vec{q}\\cdot \\mathbf{K}\\right) = \\mathrm{softmax}\\left(\\sum_j q_j k_{ij}\\right)\n",

diff --git a/dl/data.ipynb b/dl/data.ipynb
@@ -749,7 +749,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can see how points far away on the chain from 0 have much more variance in the point 0 align, whereas the COM alignment looks better spread. Remember, to apply these methods you must do them to your both your training data and any prediction points. Thus, they should be viewed as part of your neural network. We can now check that rotating has no effect on these. The plots below have the trajectory rotated by 1 radian and you can see that both alignment methods have no change (the lines are overlapping)."
+    "You can see how points far away on the chain from 0 have much more variance in the point 0 align, whereas the COM alignment looks better spread. Remember, to apply these methods you must do them to both your training data and any prediction points. Thus, they should be viewed as part of your neural network. We can now check that rotating has no effect on these. The plots below have the trajectory rotated by 1 radian and you can see that both alignment methods have no change (the lines are overlapping)."
    ]
   },
   {

diff --git a/dl/flows.ipynb b/dl/flows.ipynb
@@ -384,6 +384,8 @@
     "# use input (feature) and output (log prob)\n",
     "# to make model\n",
     "model = tf.keras.Model(x, log_prob)\n",
+    "\n",
+    "\n",
     "# define a loss\n",
     "def neg_loglik(yhat, log_prob):\n",
     "    # losses always take in label, prediction\n",

diff --git a/dl/gnn.ipynb b/dl/gnn.ipynb
@@ -1053,7 +1053,7 @@
     "A common piece of wisdom is if you want to solve a real problem with deep learning, you should read the most recent popular paper in an area and use the baseline they compare against instead of their proposed model. The reason is that a baseline model usually must be easy, fast, and well-tested, which is generally more important than being the most accurate\n",
     "```\n",
     "\n",
-    "SchNet is for atoms represented as xyz coordinates (points) -- not as a molecular graph. All our previous examples used the underlying molecular graph as the input. In SchNet we will convert our xyz coodinates into a graph, so that we can apply a GNNN. SchNet was developed for predicting energies and forces from atom configurations without bond information. Thus, we need to first see how a set of atoms and their positions is converted into a graph. To get the nodes, we do a similar process as above and the atomic number is passed through an embedding layer, which is just means we assign a trainable vector to each atomic number (See {doc}`layers` for a review of embeddings). \n",
+    "SchNet is for atoms represented as xyz coordinates (points) -- not as a molecular graph. All our previous examples used the underlying molecular graph as the input. In SchNet we will convert our xyz coodinates into a graph, so that we can apply a GNN. SchNet was developed for predicting energies and forces from atom configurations without bond information. Thus, we need to first see how a set of atoms and their positions is converted into a graph. To get the nodes, we do a similar process as above and the atomic number is passed through an embedding layer, which just means we assign a trainable vector to each atomic number (See {doc}`layers` for a review of embeddings). \n",
     "\n",
     "Getting the adjacency matrix is simple too: we just make every atom be connected to every atom. It might seem confusing what the point of using a GNN is, if we're just connecting everything. *It is because GNNs are permutation equivariant.* If we tried to do learning on the atoms as xyz coordinates, we would have weights depending on the ordering of atoms and probably fail to handle different numbers of atoms.\n",
     "\n",
@@ -1220,6 +1220,7 @@
     "\n",
     "label_str = list(set([k.split(\"-\")[0] for k in trajs]))\n",
     "\n",
+    "\n",
     "# now build dataset\n",
     "def generator():\n",
     "    for k, v in trajs.items():\n",
@@ -1553,7 +1554,7 @@
     "\n",
     "---\n",
     "\n",
-    "Let's give now use the model on some data."
+    "Let's now use the model on some data."
    ]
   },
   {
@@ -1680,7 +1681,7 @@
     "\n",
     "### Common Architecture Motifs and Comparisons\n",
     "\n",
-    "We've now seen message passing layer GNNs, GCNs, GGNs, and the generalized Battaglia equations. You'll find common motifs in the architectures, like gating, {doc}`attention`, and pooling strategies. For example, Gated GNNS (GGNs) can be combined with attention pooling to create Gated Attention GNNs (GAANs){cite}`zhang2018gaan`. GraphSAGE is a similar to a GCN but it samples when pooling, making the neighbor-updates of fixed dimension{cite}`hamilton2017inductive`. So you'll see the suffix \"sage\" when you sample over neighbors while pooling. These can all be represented in the Battaglia equations, but you should be aware of these names. \n",
+    "We've now seen message passing layer GNNs, GCNs, GGNs, and the generalized Battaglia equations. You'll find common motifs in the architectures, like gating, {doc}`attention`, and pooling strategies. For example, Gated GNNS (GGNs) can be combined with attention pooling to create Gated Attention GNNs (GAANs){cite}`zhang2018gaan`. GraphSAGE is similar to a GCN but it samples when pooling, making the neighbor-updates of fixed dimension{cite}`hamilton2017inductive`. So you'll see the suffix \"sage\" when you sample over neighbors while pooling. These can all be represented in the Battaglia equations, but you should be aware of these names. \n",
     "\n",
     "The enormous variety of architectures has led to work on identifying the \"best\" or most general GNN architecture {cite}`dwivedi2020benchmarking,errica2019fair,shchur2018pitfalls`. Unfortunately, the question of which GNN architecture is best is as difficult as \"what benchmark problems are best?\" Thus there are no agreed-upon conclusions on the best architecture. However, those papers are great resources on training, hyperparameters, and reasonable starting guesses and I highly recommend reading them before designing your own GNN. There has been some theoretical work to show that simple architectures, like GCNs, cannot distinguish between certain simple graphs {cite}`xu2018powerful`. How much this practically matters depends on your data. Ultimately, there is so much variety in hyperparameters, data equivariances, and training decisions that you should think carefully about how much the GNN architecture matters before exploring it with too much depth. "
    ]

diff --git a/dl/layers.ipynb b/dl/layers.ipynb
@@ -346,7 +346,7 @@
     "\n",
     "#### Layer Normalization\n",
     "\n",
-    "Batch normalization depends on there being a constant batch size. Some kinds of data, like text or a graphs, have different sizes and so the batch mean/variance can change significantly. **Layer normalization** avoids this problem by normalizing across the *features* (the non-batch axis/channel axis) instead of the batch. This has a similar effect of making the layer output features behave well-centered at 0 but without having highly variable means/variances because of batch to batch variation. You'll see these in graph neural networks and recurrent neural networks, with both take variable sized inputs. \n",
+    "Batch normalization depends on there being a constant batch size. Some kinds of data, like text or graphs, have different sizes and so the batch mean/variance can change significantly. **Layer normalization** avoids this problem by normalizing across the *features* (the non-batch axis/channel axis) instead of the batch. This has a similar effect of making the layer output features behave well-centered at 0 but without having highly variable means/variances because of batch to batch variation. You'll see these in graph neural networks and recurrent neural networks, with both take variable sized inputs. \n",
     "\n",
     "### Dropout\n",
     "\n",