Fall bug fixes 22 (#223)

whitead · Dec 14, 2022 · 24762c1 · 24762c1
1 parent 84b1a20
commit 24762c1
Show file tree

Hide file tree

Showing 9 changed files with 64 additions and 26 deletions.
diff --git a/.github/workflows/check-book.yml b/.github/workflows/check-book.yml
@@ -2,9 +2,9 @@ name: check-book
 
 on:
   push:
-    branches: [ master ]
+    branches: [ main ]
   pull_request:
-    branches: [ master ]
+    branches: [ main ]
 
 jobs:
   check-build-book:

diff --git a/.github/workflows/deploy-jupyter-book.yml b/.github/workflows/deploy-jupyter-book.yml
@@ -4,7 +4,6 @@ name: deploy-book
 on:
   push:
     branches:
-    - master
     - main
 
   workflow_dispatch:

diff --git a/dl/Equivariant.ipynb b/dl/Equivariant.ipynb
@@ -396,7 +396,7 @@
     "\n",
     "As you can see from the theorem, we must introduce more new concepts. The first important detail is that all our functions are over our group elements, not our space. This should seem strange. We will easily fix this because there is a way to assign one group element to each point in the space. The second detail is the $f \\uparrow^G$. The order of the group $G$ is greater than or equal to the number of points in our space, so if the function is defined on our space, we must \"lift\" it up to the group $G$ which has more elements. The last detail is the point about **quotient spaces**. Quotient spaces are how we cut-up our group $G$ into subgroups so that one has the same order as the number of points in our space. Below I detail these new concepts just enough so that we can implement and understand these convolutions.\n",
     "\n",
-    "There are some interesting notes about this definition. The first is that everything is scalar valued. The weights, which may be called a convolution filter, are coming out of a scalar valued function $\\omega(g)$. The output of the neural network $\\psi(f)$ is a scalar valued function of the *group*. Thus when we go to the next layer, we do not have to lift --- we can just have $U = V = G$. On our final layer we can choose $V = G / H$ and we can get out a function over the group that maps neatly into a function over the space (see **projecting** below). Finally, because our weights are a scalar valued function we cannot change the number of trainable parameters in an obvious way. We can do the same approach as what is done for image convolutional neural networks and create multiple $omega_k(g)$s and call them *channels.* Then after our first input layer, we'll have a new channel axes. In the SO(3) example below we'll formalize channels a bit more and show how to mix data between channels. \n",
+    "There are some interesting notes about this definition. The first is that everything is scalar valued. The weights, which may be called a convolution filter, are coming out of a scalar valued function $\\omega(g)$. The output of the neural network $\\psi(f)$ is a scalar valued function of the *group*. Thus when we go to the next layer, we do not have to lift --- we can just have $U = V = G$. The capability to do learning on functions over the whole group without lifting in the hidden layers is actually a major reason for the effectiveness of G-Equivariant convolutions. On our final layer we can choose $V = G / H$ and we can get out a function over the group that maps neatly into a function over the space (see **projecting** below). Finally, because our weights are a scalar valued function we cannot change the number of trainable parameters in an obvious way. We can do the same approach as what is done for image convolutional neural networks and create multiple $omega_k(g)$s and call them *channels.* Then after our first input layer, we'll have a new channel axes. In the SO(3) example below we'll formalize channels a bit more and show how to mix data between channels. \n",
     "\n",
     "```{warning}\n",
     "To actually learn, you need to put in a nonlinearity after the convolution. A simple (and often used) case is to just use a standard non-linear function like ReLU pointwise (applied to the output $u \\in G$). We'll look at more complex examples below for the continuous case.\n",
@@ -463,7 +463,7 @@
    "source": [
     "```{tabbed} ⬡ Finite Group $Z_6$ \n",
     "\n",
-    "Our function is the color of the vertices in our picture {glue:}`hex-0` $f(x) = (r, g, b)$ where $r,g,b$ are fractions of the color red, blue green. If we define the vertices to start at the line pointing up, we can label them $0,\\ldots,5$. So for example $f(0) =(0.11, 0.74, 0.61)$, which is the color of the top vertex. \n",
+    "Our function is the color of the vertices in our picture {glue:}`hex-0` $f(x) = (s_r, s_g, s_b)$ where $s_r,s_g,_bb$ are fractions of the color red, blue, green. If we define the vertices to start at the line pointing up, we can label them $0,\\ldots,5$. So for example $f(0) =(0.11, 0.74, 0.61)$, which is the color of the top vertex. \n",
     "\n",
     "We can define the origin as $x_0 = 0$. $|G| = |\\mathcal{X}|$ for this finite group and thus our stabilizer subgroup only contains the identity $H_0 = \\{e\\}$. Our cosets and their associated points will be $(eH_0, x = 0), (rH_0, x = 1), (r^2H_0, x = 2), (r^3H_0, x = 3), (r^4H_0, x = 4), (r^5H_0, x = 5)$. The lifted $f\\uparrow^G(g)$ can be easily defined using these cosets. \n",
     "\n",
@@ -2096,10 +2096,10 @@
     "\n",
     "1. Does the picture {glue:}`hex-1` represent a point in the space or function in the space? Justify your answer\n",
     "2. In the $Z_6$ examples, our stabilizer group is the identity -- $|G| = |\\mathcal{X}|$. Now consider including rotations up to $r^{11}$ but keep the space the same so that $|G| = 2|\\mathcal{X}|$. What would the stabilizer group be?\n",
-    "3. Is the standard representation always faithful?\n",
+    "3. Is the defining representation always faithful?\n",
     "4. Let's redefine our space for p4m to have $c$ channels like $\\mathcal{R}^{d\\times c}$. Can we construct a group action that makes this space homogeneous?\n",
-    "5. Revise the p4m example code to use multiple channels\n",
-    "6. The output from a G-equivariant neural network is a scalar valued function. Would taking the value at a specific point of the function be equivariant, invariant, or neither? What about the integral?\n",
+    "5. Explain how the code example above deals with channels and compare with your answer to 4.\n",
+    "6. The output from a G-equivariant neural network is a scalar valued function. Would taking the value at a specific point of the function be equivariant, invariant, or neither? What about a definite integral over the function?\n",
     "7. You can represent permutations as a group. For example, given a sequence a,b,c you can represent swapping position 1 and 2 as an element from a group. Write the Caley table for such a group. What is the space for this example and what needs to be true for it to be homogeneous.\n",
     "8. In the G-equivariant neural network layer definition, the output space can be different. Does it have to be homogeneous for the definition to hold? Why or why not? What if the input space is points and the output space is a multi-class probability vector -- can you have equivariance? Why or why  not?\n",
     "9. Could we make a scale equivariant neural network? A scale being some constant $s$ and s acting on $\\vec{r}$ is $s\\vec{r}$. Try to construct a group where each element is a scaling. What is the action, is it homogeneous, and are there any special considerations when building a G-equivariant neural network layer? Do things change if we have discrete scalings (e.g., $s = {\\frac{1}{10},\\frac{1}{2},1,2,10}$).\n",
@@ -2274,7 +2274,7 @@
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.7.8 64-bit",
    "language": "python",
    "name": "python3"
   },
@@ -2288,7 +2288,12 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.7.8"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "3e5a039a7a113538395a7d74f5574b0c5900118222149a18efb009bf03645fce"
+   }
   }
  },
  "nbformat": 4,

diff --git a/dl/NLP.ipynb b/dl/NLP.ipynb
@@ -57,14 +57,18 @@
     "Recent work from Krenn et al. developed an alternative approach to SMILES called SELF-referencIng Embedded Strings (SELFIES){cite}`Krenn_2020`. Every string is a valid molecule. Note that the characters in SELFIES are not all ASCII characters, so it's not like every sentence encodes a molecule (would be cool though). SELFIES is an excellent choice for generative models because any SELFIES string automatically decodes to a valid molecule. SELFIES, as of 2021, is not directly canonicalized though and thus is not permutation invariant by itself. However, if you add canonical SMILES as an intermediate step, then SELFIES are canonical. It seems that models which output a molecule (generative or supervised) benefit from using SELFIES instead of SMILES because the model does not need to learn how to make valid strings -- all strings are already valid SELFIES {cite}`rajan2020decimer`. This benefit is less clear in supervised learning and no difference has been observed empirically{cite}`chithrananda2020chemberta`. Here's a blog post giving an [overview of SELFIES and its applications](https://aspuru.substack.com/p/molecular-graph-representations-and).\n",
     "\n",
     "\n",
-    "### Demo\n",
+    "#### Demo\n",
     "\n",
     "You can get a sense for SMILES and SELFIES in this [demo page](https://whitead.github.io/molecule-dream/) that uses a RNN (discussed below) to generate SMILES and SELFIES strings.\n",
     "\n",
     "### Stereochemistry\n",
     "\n",
     "SMILES and SELFIES can treat stereoisomers, but there are a few complications. `rdkit`, the dominant Python package, [cannot treat non-tetrahedral chiral centers with SMILES](https://github.com/rdkit/rdkit/issues/3220) as of 2022. For example, even though SMILES according to its specification can correctly distinguish cisplatin and transplatin, the implementation of SMILES in `rdkit` cannot. Other examples of chirality that are present in the SMILES specification but not implementations are planar and axial chirality. SELFIES relies on SMILES (most often the `rdkit` implementation) and thus is also susceptible to this problem. This is an issue for any organometallic compounds. In organic chemistry though, most chirality is tetrahedral and correctly treated by `rdkit`.\n",
     "\n",
+    "### Other Ideas\n",
+    "\n",
+    "Recent work by Kim et al. {cite}`kim2022pure` has shown that we may actually be able to directly insert the graph as a sequence into a sequence neural network without needing to make a decision like using SMILES or SELFIES. They basically add a special character/embedding for noting if a piece of a graph is a node or edge.\n",
+    "\n",
     "### What is a chemical bond?\n",
     "\n",
     "More broadly, the idea of a chemical bond is a concept created by chemists {cite}`ball2011beyond`. You cannot measure the existence of a chemical bond in the lab and it is not some quantum mechanical operator with an observable. There are certain molecules which cannot be represented by classic single,double,triple,aromatic bonded representations, like ferrocene or diborane. This bleeds over to text encoding of a molecule where the bonding topology doesn't map neatly to bond order. The specific issue this can cause is that multiple unique molecules may appear to have the same encoding (non-injective). In situations like this, it is probably better to just work with the exact 3D coordinates and then bond order or type is less important than distance between atoms."

diff --git a/dl/flows.ipynb b/dl/flows.ipynb
@@ -15,7 +15,7 @@
     "\n",
     "A **normalizing flow** is similar to a VAE in that we try to build up $P(x)$ by starting from a simple known distribution $P(z)$. We use functions, like the decoder from a VAE, to go from $x$ to $z$. However, we make sure that the functions we choose keep the probability mass normalized ($\\sum P(x) = 1$) and can be used forward (to sample from x) and backward (to compute $P(x)$). We call these functions **bijectors** because they are bijective (surjective and injective). Recall surjective (onto) means every output has a corresponding input and injective (onto) means each output has exactly one corresponding input.\n",
     "\n",
-    "An example of a bijector is an element-wise cosine $y_i = \\cos x_i$ (assuming $x_i$ is between 0 and $\\pi$). A non-bijective function would be $y_i = \\cos x_i$ on the interval from $0$ to $2\\pi$, because it outputs all values from $[0,1]$ twice and hence is not injective. Any function which changes the number of elements is automatically not bijective (see margin note). A consequence of using only bijectors in constructing our normalizing flow is that the size of the latent space must be equal to the size of the feature space. Remember the VAE used a smaller latent space than the feature space. \n",
+    "An example of a bijector is an element-wise cosine $y_i = \\cos x_i$ (assuming $x_i$ is between $0$ and $\\pi$). A non-bijective function would be $y_i = \\cos x_i$ on the interval from $0$ to $2\\pi$, because it outputs all values from $[0,1]$ twice and hence is not injective. Any function which changes the number of elements is automatically not bijective (see margin note). A consequence of using only bijectors in constructing our normalizing flow is that the size of the latent space must be equal to the size of the feature space. Remember the VAE used a smaller latent space than the feature space. \n",
     "\n",
     "\n",
     "```{admonition} Audience & Objectives\n",

diff --git a/dl/gnn.ipynb b/dl/gnn.ipynb
@@ -250,7 +250,7 @@
     "The input to a GCN layer is $\\mathbf{V}$, $\\mathbf{E}$ and it outputs an updated $\\mathbf{V}'$. Each node feature vector is updated. The way it updates a node feature vector is by averaging the feature vectors of its neighbors, as determined by $\\mathbf{E}$. The choice of averaging over neighbors is what makes a GCN layer permutation equivariant. Averaging over neighbors is not trainable, so we must add trainable parameters. We multiply the neighbor features by a trainable matrix before the averaging, which gives the GCN the ability to learn. In Einstein notation, this process is:\n",
     "\n",
     "$$\n",
-    "v_{il} = \\sigma\\left(\\frac{1}{d_i}e_{ij}v_{jk}w_{lk}\\right)\n",
+    "v_{il} = \\sigma\\left(\\frac{1}{d_i}e_{ij}v_{jk}w_{kl}\\right)\n",
     "$$ (gcn)\n",
     "\n",
     "where $i$ is the node we're considering, $j$ is the neighbor index, $k$ is the node input feature, $l$ is the output node feature, $d_i$ is the degree of node i (which makes it an average instead of sum), $e_{ij}$ isolates neighbors so that all non-neighbor $v_{jk}$s are zero, $\\sigma$ is our activation, and $w_{lk}$ is the trainable weights. This equation is a mouthful, but it truly just is the average over neighbors with a trainable matrix thrown in. One common modification is to make all nodes neighbors of themselves. This is so that the output node features $v_{il}$ depends on the input features $v_{ik}$. We do not need to change our equation, just make the adjacency matrix have $1$s on the diagonal instead of $0$ by adding the identity matrix during pre-processing.\n",
@@ -1816,7 +1816,7 @@
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.7.8 64-bit",
    "language": "python",
    "name": "python3"
   },
@@ -1830,7 +1830,12 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.7.8"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "3e5a039a7a113538395a7d74f5574b0c5900118222149a18efb009bf03645fce"
+   }
   }
  },
  "nbformat": 4,

diff --git a/dl/xai.ipynb b/dl/xai.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Explaining Predictions\n",
     "\n",
-    "Neural network predictions are not interpretable in general. In this chapter, we explore how to explain predictions. This is part of the broader topic of explainable AI (XAI). These explanations should help us understand why particular predictions are made. This is a critical topic because being able to understand model predictions is justified from a practical, theoretical, and increasingly a regulatory stand-point. It is practical because it has been shown that people are more likely to use predictions of a model if they can understand the rationale {cite}`lee2004trust`. Another practical concern is that correctly implementing methods is much easier when one can understand how a model arrived at a prediction. A theoretical justification for transparency is that it can help identify incompleteness in model domains (i.e., covariate shift){cite}`doshi2017towards`. It is now becoming a compliance problem because both the European Union {cite}`goodman2017european` and the G20 {cite}`Development2019` have recently adopted guidelines that recommend or require explanations for machine predictions. The European Union is considering going further with more [strict draft legislation](https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence-artificial-intelligence) being considered. \n",
+    "Neural network predictions are not interpretable in general. In this chapter, we explore how to explain predictions. This is part of the broader topic of explainable AI (XAI). These explanations should help us understand why particular predictions are made. This is a critical topic because being able to understand model predictions is justified from a practical, theoretical, and increasingly a regulatory stand-point. It is practical because it has been shown that people are more likely to use predictions of a model if they can understand the rationale {cite}`lee2004trust`. Another practical concern is that correctly implementing methods is much easier when one can understand how a model arrived at a prediction. A theoretical justification for transparency is that it can help identify incompleteness in model domains (i.e., covariate shift){cite}`doshi2017towards`. It is now becoming a compliance problem because both the European Union {cite}`goodman2017european` and the G20 {cite}`Development2019` have recently adopted guidelines that recommend or require explanations for machine predictions. The US and EU are also considering going further with more [strict draft legislation](https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence-artificial-intelligence) and a so-called White House AI Bill of Rights {cite}`blumenthal2022ai`.\n",
     "\n",
     "\n",
     "```{admonition} Audience & Objectives\n",
@@ -1338,7 +1338,7 @@
  "metadata": {
   "celltoolbar": "Tags",
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.7.8 64-bit",
    "language": "python",
    "name": "python3"
   },
@@ -1352,7 +1352,12 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.7.8"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "3e5a039a7a113538395a7d74f5574b0c5900118222149a18efb009bf03645fce"
+   }
   }
  },
  "nbformat": 4,

diff --git a/ml/introduction.ipynb b/ml/introduction.ipynb
@@ -33,7 +33,7 @@
    "source": [
     "## The Ingredients \n",
     "\n",
-    "Machine learning is about constructing models by fitting them to data. Firstly, definitions:\n",
+    "Machine learning the fitting of models $\\hat{f}(\\vec{x})$ to data $\\vec{x}, y$ that we know came from some ``data generation'' process $f(x)$ . Firstly, definitions:\n",
     "\n",
     "**Features** \n",
     "\n",
@@ -51,9 +51,13 @@
     "\n",
     "&nbsp;&nbsp;&nbsp;&nbsp;set of $N$ features  $\\{\\vec{x}_i\\}$  that may have unknown $y$ labels\n",
     "\n",
+    "**Data generation process**\n",
+    "\n",
+    "&nbsp;&nbsp;&nbsp;&nbsp;The unseen process $f(\\vec{x})$ that takes a given feature vector in and returns a real label $y$ (what we're trying to model)\n",
+    "\n",
     "**Model**\n",
     "\n",
-    "&nbsp;&nbsp;&nbsp;&nbsp;A function $f(\\vec{x})$ that takes a given feature vector in and returns a predicted $\\hat{y}$\n",
+    "&nbsp;&nbsp;&nbsp;&nbsp;A function $\\hat{f}(\\vec{x})$ that takes a given feature vector in and returns a predicted $\\hat{y}$\n",
     "\n",
     "**Predictions**\n",
     "\n",