Update nlp_apps.ipynb (aimacode#888)

qu-tan-um · Mar 25, 2018 · ab2377b · ab2377b
1 parent 62080b6
commit ab2377b
Showing 1 changed file with 36 additions and 110 deletions.
diff --git a/nlp_apps.ipynb b/nlp_apps.ipynb
@@ -571,7 +571,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well."
+    "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well.\n",
+    "\n",
+    "Finally, since we are dealing with long text and the string of probability multiplications is long, we will end up with the results being rounded to 0 due to floating point underflow. To work around this problem we will use the built-in Python library `decimal`, which allows as to set decimal precision to much larger than normal."
    ]
   },
   {
@@ -581,7 +583,16 @@
    "outputs": [],
    "source": [
     "import random\n",
-    "from utils import product\n",
+    "import decimal\n",
+    "from decimal import Decimal\n",
+    "\n",
+    "decimal.getcontext().prec = 100\n",
+    "\n",
+    "def precise_product(numbers):\n",
+    "    result = 1\n",
+    "    for x in numbers:\n",
+    "        result *= Decimal(x)\n",
+    "    return result\n",
     "\n",
     "\n",
     "def NaiveBayesLearner(dist):\n",
@@ -596,20 +607,13 @@
     "        \"\"\"Predict the probabilities for each class.\"\"\"\n",
     "        def class_prob(target, e):\n",
     "            attr = attr_dist[target]\n",
-    "            return product([attr[a] for a in e])\n",
+    "            return precise_product([attr[a] for a in e])\n",
     "\n",
     "        pred = {t: class_prob(t, example) for t in dist.keys()}\n",
     "\n",
     "        total = sum(pred.values())\n",
-    "        if total == 0:\n",
-    "            # Since there are a lot of multiplications of very small numbers,\n",
-    "            # we end up with values equal to 0. To combat that, we keep\n",
-    "            # dividing the example until the sum of the values is not 0.\n",
-    "            random_words_count = max([int(3*len(example)/4), 100])\n",
-    "            pred = predict(random.sample(example, random_words_count))\n",
-    "        else:\n",
-    "            for k, v in pred.items():\n",
-    "                pred[k] = v / total\n",
+    "        for k, v in pred.items():\n",
+    "            pred[k] = v / total\n",
     "\n",
     "        return pred\n",
     "\n",
@@ -637,7 +641,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As usual, the `recognize` function will take as input a string and after removing capitalization and splitting it into words, will feed it into the Naive Bayes Classifier. Since though the classifier is probabilistic (it randomly picks words from the example to evaluate) it is better if we run the experiment a lot of times and averaged the results."
+    "As usual, the `recognize` function will take as input a string and after removing capitalization and splitting it into words, will feed it into the Naive Bayes Classifier."
    ]
   },
   {
@@ -646,22 +650,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def avg_preds(preds):\n",
-    "    d = {}\n",
-    "    for k in preds[0].keys():\n",
-    "        d[k] = 0\n",
-    "        for p in preds:\n",
-    "            d[k] += p[k]\n",
-    "    \n",
-    "    return {k: d[k] / len(preds)\n",
-    "            for k in preds[0].keys()}\n",
-    "\n",
-    "\n",
     "def recognize(sentence, nBS):\n",
-    "    sentence = sentence.lower()\n",
-    "    sentence_words = words(sentence)\n",
-    "    \n",
-    "    return avg_preds([nBS(sentence_words) for i in range(25)])"
+    "    return nBS(words(sentence.lower()))"
    ]
   },
   {
@@ -680,101 +670,37 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Paper No. 49\n",
-      "Hamilton: 0.18218476722264856\n",
-      "Madison : 0.8178151126501306\n",
-      "Jay     : 1.2012722099721584e-07\n",
-      "----------------------\n",
-      "Paper No. 50\n",
-      "Hamilton: 0.006340777113564324\n",
-      "Madison : 0.9935600714606485\n",
-      "Jay     : 9.915142578703363e-05\n",
-      "----------------------\n",
-      "Paper No. 51\n",
-      "Hamilton: 0.10807398451170964\n",
-      "Madison : 0.8919260093780947\n",
-      "Jay     : 6.11019566801153e-09\n",
-      "----------------------\n",
-      "Paper No. 52\n",
-      "Hamilton: 0.015755507847563528\n",
-      "Madison : 0.9842245750173423\n",
-      "Jay     : 1.9917135094100632e-05\n",
-      "----------------------\n",
-      "Paper No. 53\n",
-      "Hamilton: 0.16148149622286845\n",
-      "Madison : 0.8385181396174793\n",
-      "Jay     : 3.641596521788814e-07\n",
-      "----------------------\n",
-      "Paper No. 54\n",
-      "Hamilton: 0.1202445807489968\n",
-      "Madison : 0.8797554191935693\n",
-      "Jay     : 5.743394071176045e-11\n",
-      "----------------------\n",
-      "Paper No. 55\n",
-      "Hamilton: 0.10014174623125195\n",
-      "Madison : 0.8998582478040609\n",
-      "Jay     : 5.964687179083329e-09\n",
-      "----------------------\n",
-      "Paper No. 56\n",
-      "Hamilton: 0.15930217913525455\n",
-      "Madison : 0.8406948696158869\n",
-      "Jay     : 2.9512488585096405e-06\n",
-      "----------------------\n",
-      "Paper No. 57\n",
-      "Hamilton: 0.3106575736716812\n",
-      "Madison : 0.6893423580295986\n",
-      "Jay     : 6.829872019646261e-08\n",
-      "----------------------\n",
-      "Paper No. 58\n",
-      "Hamilton: 0.08144023779669217\n",
-      "Madison : 0.9185597621646735\n",
-      "Jay     : 3.8634360540381284e-11\n",
-      "----------------------\n",
-      "Paper No. 18\n",
-      "Hamilton: 7.762932414823314e-06\n",
-      "Madison : 0.5114716240007965\n",
-      "Jay     : 0.4885206130667886\n",
-      "----------------------\n",
-      "Paper No. 19\n",
-      "Hamilton: 0.011570316420346522\n",
-      "Madison : 0.5281730401297515\n",
-      "Jay     : 0.4602566434499019\n",
-      "----------------------\n",
-      "Paper No. 20\n",
-      "Hamilton: 0.14651509965391551\n",
-      "Madison : 0.5342142523806944\n",
-      "Jay     : 0.31927064796538995\n",
-      "----------------------\n",
-      "Paper No. 64\n",
-      "Hamilton: 0.5756065218890194\n",
-      "Madison : 0.3648418106830272\n",
-      "Jay     : 0.059551667427953384\n",
-      "----------------------\n"
+      "Paper No. 49: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 50: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 51: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 52: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 53: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 54: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 55: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 56: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 57: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 58: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 18: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 19: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 20: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
+      "Paper No. 64: Hamilton: 1.00 Madison: 0.00 Jay: 0.00\n"
      ]
     }
    ],
    "source": [
     "for d in disputed:\n",
-    "    print(\"Paper No. {}\".format(d))\n",
     "    probs = recognize(papers[d], nBS)\n",
-    "    h = probs[('Hamilton', 1)]\n",
-    "    m = probs[('Madison', 1)]\n",
-    "    j = probs[('Jay', 1)]\n",
-    "    print(\"Hamilton: {}\".format(h))\n",
-    "    print(\"Madison : {}\".format(m))\n",
-    "    print(\"Jay     : {}\".format(j))\n",
-    "    print(\"----------------------\")"
+    "    results = ['{}: {:.2f}'.format(name, probs[(name, 1)]) for name in 'Hamilton Madison Jay'.split()]\n",
+    "    print('Paper No. {}: {}'.format(d, ' '.join(results)))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "NOTE: Since the algorithm has an element of random, it will show different results on each run. Generally, the more the experiments, the stabler the results.\n",
-    "\n",
-    "This is a simple approach to the problem and thankfully researchers are fairly certain that papers 49-58 were all written by Madison, while 18-20 were written in collaboration between Hamilton and Madison, with Madison being credited for most of the work. Our classifier is not that far off. It should correctly classify all (or most of) the papers by Madison, even though on some occasions the classifier is not that sure. For the collaboration papers between Hamilton and Madison the classifier shows some peculiar results: most of the time it correctly implies that Madison did a lot of the work but instead of Hamilton helping him, it usually shows Jay. This might be because the collaboration between Madison and Hamilton produced some results uncharacteristic to either of them. Without further investigation it is hard to pinpoint the issue.\n",
+    "This is a simple approach to the problem and thankfully researchers are fairly certain that papers 49-58 were all written by Madison, while 18-20 were written in collaboration between Hamilton and Madison, with Madison being credited for most of the work. Our classifier is not that far off. It correctly identifies the papers written by Madison, even the ones in collaboration with Hamilton.\n",
     "\n",
-    "Unfortunately, it misses paper 64. Consensus is that the paper was written by John Jay, while our classifier believes it was written by Hamilton. The classifier went wrong there because it did not have much information on Jay's writing; only 4 papers. This is one of the problems with using unbalanced datasets such as this one, where information on some classes is sparser than information on the rest. To avoid this, we can add more writings for Jay and Madison to end up with an equal amount of data for each author."
+    "Unfortunately, it misses paper 64. Consensus is that the paper was written by John Jay, while our classifier believes it was written by Hamilton. The classifier is wrong there because it does not have much information on Jay's writing; only 4 papers. This is one of the problems with using unbalanced datasets such as this one, where information on some classes is sparser than information on the rest. To avoid this, we can add more writings for Jay and Madison to end up with an equal amount of data for each author."
    ]
   }
  ],