data/0009.yaml

- en: Building a Basic Machine Learning Model in Python
  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 在Python中构建基础机器学习模型
- en: 原文：[https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02](https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02)
  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
  zh: 原文：[https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02](https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02)
- en: '*Extensive essay on how to pick the right problem and how to develop a basic
    classifier*'
  id: totrans-2
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: '*关于如何选择合适问题和如何开发基础分类器的详细论文*'
- en: '[](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[![Juras
    Juršėnas](../Images/eb2ca720f2c8688dbf8079879c028d12.png)](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)[![Towards
    Data Science](../Images/a6ff2676ffcc0c7aad8aaf1d79379785.png)](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
    [Juras Juršėnas](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)'
  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
  zh: '[](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[![Juras
    Juršėnas](../Images/eb2ca720f2c8688dbf8079879c028d12.png)](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)[![Towards
    Data Science](../Images/a6ff2676ffcc0c7aad8aaf1d79379785.png)](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
    [Juras Juršėnas](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)'
- en: ·
  id: totrans-4
  prefs: []
  type: TYPE_NORMAL
  zh: ·
- en: '[Follow](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2F3041473d9e3c&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=post_page-3041473d9e3c----d7cca929ee62---------------------post_header-----------)
    Published in [Towards Data Science](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
    ·20 min read·Jan 2, 2023[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Ftowards-data-science%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=-----d7cca929ee62---------------------clap_footer-----------)'
  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
  zh: '[点击查看](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2F3041473d9e3c&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=post_page-3041473d9e3c----d7cca929ee62---------------------post_header-----------)
    发布于 [Towards Data Science](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
    ·20 min 阅读·2023年1月2日[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Ftowards-data-science%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=-----d7cca929ee62---------------------clap_footer-----------)'
- en: --
  id: totrans-6
  prefs: []
  type: TYPE_NORMAL
  zh: --
- en: '[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&source=-----d7cca929ee62---------------------bookmark_footer-----------)![](../Images/01ff8323628648cdfec674b9023fa9f2.png)'
  id: totrans-7
  prefs: []
  type: TYPE_NORMAL
  zh: '[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&source=-----d7cca929ee62---------------------bookmark_footer-----------)![](../Images/01ff8323628648cdfec674b9023fa9f2.png)'
- en: Photo by [charlesdeluvio](https://unsplash.com/@charlesdeluvio?utm_source=medium&utm_medium=referral)
    on [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
  id: totrans-8
  prefs: []
  type: TYPE_NORMAL
  zh: 照片由 [charlesdeluvio](https://unsplash.com/@charlesdeluvio?utm_source=medium&utm_medium=referral)
    提供，来源于 [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
- en: By now, all of us have seen the results of various basic machine learning (ML)
    models. The internet is rife with images, videos, and articles showing off how
    a computer identifies, correctly or not, various animals.
  id: totrans-9
  prefs: []
  type: TYPE_NORMAL
  zh: 目前，我们都见过各种基础机器学习（ML）模型的结果。互联网充斥着展示计算机如何识别各种动物的图像、视频和文章，无论识别是否正确。
- en: While we have moved towards more intricate machine learning models, such as
    ones that generate or upscale images, those basic ones still form the foundation
    of those efforts. Mastering the basics can become a launchpad for much greater
    future endeavors.
  id: totrans-10
  prefs: []
  type: TYPE_NORMAL
  zh: 尽管我们已经朝着更复杂的机器学习模型迈进，例如生成或提升图像的模型，但这些基础模型仍然构成了这些努力的基础。掌握基础知识可以成为未来更大事业的跳板。
- en: So, I decided to revisit the basics myself and build a basic machine learning
    model with several caveats — it must be somewhat useful, as simplistic as possible,
    and return reasonably accurate results.
  id: totrans-11
  prefs: []
  type: TYPE_NORMAL
  zh: 所以，我决定自己重新审视基础知识，并构建一个具有几个警告的基本机器学习模型——它必须具有一定的实用性，尽可能简单，并返回合理准确的结果。
- en: Unlike many other tutorials on the internet, however, I want to present my entire
    thought process from beginning to end. As such, the coding part will begin quite
    a bit later as problem selection in both the theoretical and practical realm is
    equally important. In the end, I believe that understanding *why* will go further
    than *how to*.
  id: totrans-12
  prefs: []
  type: TYPE_NORMAL
  zh: 然而，与互联网上的许多其他教程不同，我想从头到尾展示我的整个思考过程。因此，编码部分将会稍晚开始，因为理论和实践领域中的问题选择同样重要。最后，我相信理解*为什么*比*如何*更为重要。
- en: Picking the correct problem for ML
  id: totrans-13
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 选择适合机器学习的问题
- en: Although machine learning can solve a great deal of challenges, it’s not a one-size-fits-all
    approach. Even if we were to temporarily forget about the financial, temporal,
    and other resource costs, ML models would still be great at some things and terrible
    at others.
  id: totrans-14
  prefs: []
  type: TYPE_NORMAL
  zh: 尽管机器学习可以解决许多挑战，但它并不是一种万能的解决方案。即使我们暂时忽略财务、时间和其他资源成本，机器学习模型在某些方面仍然表现出色，而在其他方面则表现糟糕。
- en: Categorization is a great example of where machine learning may shine. Whenever
    we deal with real world data (i.e., we’re not dealing with categories created
    within the code itself), figuring out all possible rules that define a phenomenon
    is nearly impossible.
  id: totrans-15
  prefs: []
  type: TYPE_NORMAL
  zh: 分类是机器学习可能发挥作用的一个很好的例子。每当我们处理真实世界的数据（即我们不处理代码中创建的类别）时，找出定义现象的所有可能规则几乎是不可能的。
- en: As I’ve written previously, if we were to attempt to take a rule-based approach
    to categorize whether an object is a cat or not, we’d quickly run into issues.
    There seems to be no defining quality that makes any physical object what it is
    — there are cats without tails, fur, ears, one eye, a different number of legs,
    etc., but all of them still fall within the same category.
  id: totrans-16
  prefs: []
  type: TYPE_NORMAL
  zh: 正如我之前所写的，如果我们尝试使用基于规则的方法来分类一个物体是否是猫，我们会很快遇到问题。似乎没有定义任何物理对象的特征——有些猫没有尾巴、毛发、耳朵、一只眼睛、不同数量的腿等等，但它们仍然都属于同一类别。
- en: Enumerating all of the possible rules and exceptions to them is likely impossible,
    maybe there even isn’t some eternal list, and we make them up as we go. Machine
    learning, in some sense, mimics our thinking by eating up an enormous amount of
    data to make predictions.
  id: totrans-17
  prefs: []
  type: TYPE_NORMAL
  zh: 列举所有可能的规则及其例外可能是不可能的，也许甚至没有某种永恒的清单，我们只能在过程中逐步制定。机器学习在某种程度上通过消耗大量数据来进行预测，模仿了我们的思维。
- en: In other words, we should carefully consider the problem we’re trying to solve
    before trying to figure out which model would fit best, how much data we’ll need,
    and many other things we concern ourselves with once we start the task.
  id: totrans-18
  prefs: []
  type: TYPE_NORMAL
  zh: 换句话说，我们应该在尝试确定哪种模型最合适、需要多少数据以及开始任务后关注的其他事项之前，仔细考虑我们要解决的问题。
- en: In search of practical application
  id: totrans-19
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 寻求实际应用
- en: Making models that differentiate between dogs and cats is certainly interesting
    and fun but unlikely to net any benefit, even if we scale up the operation to
    immense levels. Additionally, there have been millions of tutorials for such models
    created online.
  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
  zh: 制作区分狗和猫的模型确实有趣且有趣，但即使我们将操作规模扩大到巨大的程度，也不太可能获得任何好处。此外，已经有数以百万计的此类模型教程在网上创建。
- en: 'I decided to pick word categorization, as it hasn’t been as frequently written
    about, and it has some practical application. Our SEO team had an interesting
    proposition — they needed to categorize keywords according to three types:'
  id: totrans-21
  prefs: []
  type: TYPE_NORMAL
  zh: 我决定选择词汇分类，因为它相对较少被写到，并且具有一定的实际应用。我们的SEO团队提出了一个有趣的提议——他们需要根据三种类型来分类关键词：
- en: '**Informational** — users searching for knowledge about a topic (e.g., “what
    is a proxy”)'
  id: totrans-22
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '**信息型** — 寻找关于某个主题的知识的用户（例如，“什么是代理”）'
- en: '**Transactional** — users seeking for a product or service (e.g., “best proxies”)'
  id: totrans-23
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '**交易型** — 寻找产品或服务的用户（例如，“最佳代理”）'
- en: '**Navigational** — users seeking for a specific brand or an offshoot of it
    (e.g., “Oxylabs)'
  id: totrans-24
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '**导航型**——用户寻找特定品牌或其分支（例如，“Oxylabs”）'
- en: Categorizing thousands of keywords manually is a bit of a pain. Such a task
    seems (almost) perfect for machine learning, although there’s an inherent issue
    that is nearly impossible to solve, which I will expand upon later.
  id: totrans-25
  prefs: []
  type: TYPE_NORMAL
  zh: 手动分类成千上万的关键词有点麻烦。这样的任务（几乎）完美适合机器学习，尽管存在一个几乎无法解决的固有问题，我将在后面详细说明。
- en: Finally, it made data collection and management a significantly easier task
    than it would otherwise have been. SEO specialists use a variety of tools to track
    keywords, most of which can export thousands of them into a CSV sheet. All that
    needs to be done is to assign categories to the keywords.
  id: totrans-26
  prefs: []
  type: TYPE_NORMAL
  zh: 最终，它使数据收集和管理变得比其他情况下要简单得多。SEO 专家使用各种工具来跟踪关键词，其中大多数可以将它们导出到 CSV 表中。只需将类别分配给关键词即可。
- en: Building a pre-MVP
  id: totrans-27
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 构建一个预 MVP
- en: Deciding how many data points you’ll need before building a model is nearly
    impossible. There are some dependencies on the stated goal (i.e., more or less
    categories), however, calculating these with precision is a fool’s errand. Picking
    a sufficiently large number (e.g., 1000 entries) is a good starting point.
  id: totrans-28
  prefs: []
  type: TYPE_NORMAL
  zh: 在建立模型之前决定需要多少数据点几乎是不可能的。虽然有一些依赖于既定目标（即，更多或更少的类别），但精确计算这些数据几乎是不可能的。选择一个足够大的数字（例如，1000
    条记录）是一个好的起点。
- en: One thing I’d caution against is working with the entire dataset first. Since
    it is likely, it’s the first time you’re developing a model, a lot of things can
    go wrong. In general, you’re better off writing the code and running it on a small
    sample (e.g., 10% of the total) just to ensure there are no semantic errors or
    any other horrors.
  id: totrans-29
  prefs: []
  type: TYPE_NORMAL
  zh: 我建议不要一开始就处理整个数据集。由于这是你第一次开发模型，很多事情可能会出错。一般来说，最好先编写代码并在小样本（例如总数据的10%）上运行，以确保没有语义错误或其他问题。
- en: Once you get the desired result, start working with the entire dataset. While
    it’s unlikely that you’ll have to throw out the project entirely, you don’t want
    to end up spending hours of (boring) work and have nothing to show for.
  id: totrans-30
  prefs: []
  type: TYPE_NORMAL
  zh: 一旦你得到所需的结果，就开始处理整个数据集。虽然你可能不会完全放弃项目，但你不希望花费几个小时（枯燥）的工作却没有任何成果。
- en: Regardless, with some samples in hand, we can begin the development experience
    properly. I’ve chosen Python as it’s a fairly common language with decent support
    for machine learning through its numerous libraries.
  id: totrans-31
  prefs: []
  type: TYPE_NORMAL
  zh: 无论如何，有了一些样本，我们可以正式开始开发过程。我选择了 Python，因为它是一种相当常见的语言，并且通过众多库为机器学习提供了不错的支持。
- en: Libraries
  id: totrans-32
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 库
- en: '[Pandas](https://pypi.org/project/pandas/). While not strictly necessary, reading
    and exporting to CSV is going to make our lives significantly easier.'
  id: totrans-33
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '[Pandas](https://pypi.org/project/pandas/)。虽然不是绝对必要，但读取和导出 CSV 文件将大大简化我们的工作。'
- en: '[SciKit-Learn](https://pypi.org/project/scikit-learn/). A fairly powerful and
    flexible machine learning library, which will form the foundation for our classification
    model. We’ll be using various *sklearn* features throughout the tutorial.'
  id: totrans-34
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '[SciKit-Learn](https://pypi.org/project/scikit-learn/)。这是一个相当强大且灵活的机器学习库，它将成为我们分类模型的基础。在整个教程中，我们将使用各种
    *sklearn* 功能。'
- en: '[NLTK](https://pypi.org/project/nltk/) (Natural Language Toolkit). As we’ll
    be processing natural language, NLTK does the job perfectly. *Stopwords* will
    be absolutely necessary from the package.'
  id: totrans-35
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '[NLTK](https://pypi.org/project/nltk/)（自然语言工具包）。由于我们将处理自然语言，NLTK 完美地完成了这个任务。*停用词*
    是包中绝对必要的内容。'
- en: Imports
  id: totrans-36
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 导入
- en: '[PRE0]'
  id: totrans-37
  prefs: []
  type: TYPE_PRE
  zh: '[PRE0]'
- en: Line 1
  id: totrans-38
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第1行
- en: Fairly self-explanatory. *Pandas* allows us to read and write CSV and other
    spreadsheet files by creating data frames. Since we’ll be dealing with keywords,
    most SEO tools export lists of them in CSV, which will reduce the data processing
    we need to do manually.
  id: totrans-39
  prefs: []
  type: TYPE_NORMAL
  zh: 相当自解释。*Pandas* 允许我们通过创建数据框来读取和写入 CSV 以及其他电子表格文件。由于我们将处理关键词，大多数 SEO 工具会将它们导出为
    CSV，这将减少我们需要手动处理的数据。
- en: Line 2
  id: totrans-40
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第2行
- en: From the SciKit-Learn library, we’ll pick up several things, *TfidfVectorizer*
    being our first choice.
  id: totrans-41
  prefs: []
  type: TYPE_NORMAL
  zh: 从 SciKit-Learn 库中，我们将挑选几个东西，*TfidfVectorizer* 是我们的首选。
- en: Vectorizers convert our strings into feature vectors, which results in two important
    changes. First, strings are converted into numerical representations. Each unique
    string is converted into an index, which is then turned into a vector (the offshoot
    of a matrix).
  id: totrans-42
  prefs: []
  type: TYPE_NORMAL
  zh: 向量化器将我们的字符串转换为特征向量，这会导致两个重要的变化。首先，字符串被转换为数值表示。每个唯一的字符串被转换为一个索引，然后转化为向量（矩阵的衍生物）。
- en: '**Sentence #1**: “The dog is brown.”'
  id: totrans-43
  prefs: []
  type: TYPE_NORMAL
  zh: '**句子 #1**：“狗是棕色的。”'
- en: '**Sentence #2**: “The dog is black.”'
  id: totrans-44
  prefs: []
  type: TYPE_NORMAL
  zh: '**句子 #2**：“狗是黑色的。”'
- en: 'Vectorization would take both sentences and create an index of:'
  id: totrans-45
  prefs: []
  type: TYPE_NORMAL
  zh: 向量化将处理这两个句子并创建一个索引：
- en: '[PRE1]'
  id: totrans-46
  prefs: []
  type: TYPE_PRE
  zh: '[PRE1]'
- en: Outside of turning strings into numerical values, vectorization also optimizes
    data processing. Instead of having to go through identical strings several times,
    the same index is used akin to compressing files.
  id: totrans-47
  prefs: []
  type: TYPE_NORMAL
  zh: 除了将字符串转换为数值外，向量化还优化了数据处理。与其多次处理相同的字符串，不如使用相同的索引，类似于文件压缩。
- en: Finally, TFIDF (term frequency-inverse document frequency) is one of the ways
    to weigh term importance across documents. In simple terms, it takes each term,
    assesses its frequency divided by the document length, and assigns a weighted
    value to it. As a result, words that repeat frequently are considered more important.
  id: totrans-48
  prefs: []
  type: TYPE_NORMAL
  zh: 最后，TFIDF（词频-逆文档频率）是衡量文档中词语重要性的一种方法。简单来说，它对每个词进行处理，评估其频率与文档长度的比值，并分配一个加权值。因此，重复出现的词语被认为更重要。
- en: Line 3
  id: totrans-49
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第三行
- en: '*LogisticRegression* is one of the ways to discover relationships between variables.
    Since our task is a classic example of classification, logistic regressions work
    perfectly as they take some input variable *x* (keyword)and assign it a value
    of *y* (informational/transactional/navigational).'
  id: totrans-50
  prefs: []
  type: TYPE_NORMAL
  zh: '*LogisticRegression* 是发现变量之间关系的一种方法。由于我们的任务是经典的分类问题，逻辑回归非常适合，因为它接受某些输入变量 *x*（关键字），并将其分配一个值
    *y*（信息性/交易性/导航性）。'
- en: There are other options, such as *LinearSVC,* which involves significantly more
    complicated mathematics. In extremely simplistic terms, SVC takes several clusters
    of data points and finds the values in each that are closest to the opposing cluster(s).
    These are called support vectors.
  id: totrans-51
  prefs: []
  type: TYPE_NORMAL
  zh: 还有其他选项，例如 *LinearSVC*，它涉及到更复杂的数学运算。极其简单地说，SVC 会对多个数据点簇进行处理，找到每个簇中最接近对方簇的值。这些值称为支持向量。
- en: A hyperplane (i.e., an *n-dimensional* geometrical object in an *n+1-dimensional*
    space) is drawn in such a way that the distances between it and each support vector
    is maximized.
  id: totrans-52
  prefs: []
  type: TYPE_NORMAL
  zh: 一个超平面（即在 *n+1维* 空间中的 *n维* 几何对象）被绘制成使其与每个支持向量的距离最大化。
- en: '![](../Images/eb722b44930e8f5e8c3e04443201fa5c.png)'
  id: totrans-53
  prefs: []
  type: TYPE_IMG
  zh: '![](../Images/eb722b44930e8f5e8c3e04443201fa5c.png)'
- en: Image by author
  id: totrans-54
  prefs: []
  type: TYPE_NORMAL
  zh: 作者提供的图片
- en: '[There exists research to state that using Support Vector Machines](https://link.springer.com/chapter/10.1007/BFb0026683)
    might produce better results in text classification, however, it’s likely due
    to the significantly more complicated nature of the task. These advantages aren’t
    entirely relevant in our case as they surface when feature counts reach inordinately
    high numbers, so linear regressions should work just fine.'
  id: totrans-55
  prefs: []
  type: TYPE_NORMAL
  zh: '[已有研究表明使用支持向量机](https://link.springer.com/chapter/10.1007/BFb0026683) 可能在文本分类中产生更好的结果，但这可能是由于任务的复杂性显著增加。这些优势在我们的情况下并不完全相关，因为它们在特征数量达到极高水平时才会显现，因此线性回归应该也能很好地工作。'
- en: Line 4
  id: totrans-56
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第四行
- en: '*Pipeline* is a flexible machine learning tool that lets you create an object
    that assembles several steps of the entire process into one. It has numerous benefits
    — from helping you write neater code to preventing data leakage.'
  id: totrans-57
  prefs: []
  type: TYPE_NORMAL
  zh: '*Pipeline* 是一个灵活的机器学习工具，它让你创建一个对象，将整个过程的多个步骤组合成一个。它有许多好处——从帮助你编写更整洁的代码到防止数据泄漏。'
- en: Line 5
  id: totrans-58
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第五行
- en: While not entirely necessary in our case, *SelectKBest* and *chi2* help optimize
    models by improving accuracy and reducing training time. *SelectKBest* allows
    us to set a maximum number of features that are used.
  id: totrans-59
  prefs: []
  type: TYPE_NORMAL
  zh: 虽然在我们的情况下并不是绝对必要的，*SelectKBest* 和 *chi2* 通过提高准确性和减少训练时间来优化模型。*SelectKBest* 允许我们设置最大特征数量。
- en: '*Chi2* (or *chi-squared*) is a statistical test for the independence of variables
    that helps us select the best features (hence, *SelectKBest)* for training:'
  id: totrans-60
  prefs: []
  type: TYPE_NORMAL
  zh: '*Chi2*（或*卡方检验*）是一种用于变量独立性的统计测试，有助于我们选择最佳特征（因此，*SelectKBest*）进行训练：'
- en: '![](../Images/706e58897a99f3e0263bb1a57a7447b6.png)'
  id: totrans-61
  prefs: []
  type: TYPE_IMG
  zh: '![](../Images/706e58897a99f3e0263bb1a57a7447b6.png)'
- en: Image by author.
  id: totrans-62
  prefs: []
  type: TYPE_NORMAL
  zh: 作者提供的图片。
- en: '[PRE2]'
  id: totrans-63
  prefs: []
  type: TYPE_PRE
  zh: '[PRE2]'
- en: Expected values are calculated by accepting the null hypothesis (variables are
    independent). These are then hedged against our observed values. If observed values
    deviate a significant margin from the expected ones, we can reject the null hypothesis,
    which forces us to accept that variables are dependent.
  id: totrans-64
  prefs: []
  type: TYPE_NORMAL
  zh: 期望值是通过接受原假设（变量独立）来计算的。这些值然后与我们的观测值进行对比。如果观测值与期望值有显著偏差，我们可以拒绝原假设，这迫使我们接受变量之间的依赖关系。
- en: If variables are dependent, they are acceptable for the machine learning model
    as that’s exactly what we’re looking for — relations between objects. In turn,
    *SelectKBest* takes all *chi2* results and selects those that have the strongest
    relationships.
  id: totrans-65
  prefs: []
  type: TYPE_NORMAL
  zh: 如果变量是相关的，它们对于机器学习模型是可接受的，因为这正是我们所寻找的 —— 对象之间的关系。反过来，*SelectKBest* 获取所有 *chi2*
    结果，并选择那些具有最强关系的结果。
- en: In our case, since our number of features is relatively small, *SelectKBest*
    might not bring the optimization we’d be interested in, but it becomes essential
    once the numbers start rising.
  id: totrans-66
  prefs: []
  type: TYPE_NORMAL
  zh: 在我们的情况下，由于特征数量相对较少，*SelectKBest* 可能无法带来我们感兴趣的优化，但一旦数量开始增加，它就变得至关重要。
- en: Line 6
  id: totrans-67
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 行 6
- en: Our final import is from NLTK, which we will only use for the *stopwords* list.
    Unfortunately, the default list isn’t suitable for our task at hand. Most such
    lists include words like “how,” “what,” “why,” and many others that, while useless
    in regular categorization, indicate search intent.
  id: totrans-68
  prefs: []
  type: TYPE_NORMAL
  zh: 我们最终的导入来自 NLTK，我们将仅将其用于 *stopwords* 列表。不幸的是，默认列表不适合我们当前的任务。大多数这样的列表包含像“how”，“what”，“why”等词，这些词在常规分类中无用，但能指示搜索意图。
- en: In fact, there’s a case to be made that these words are more important than
    any remainder in keywords like “how to build a web scraper.” Since we’re interested
    in the category of the sentence rather than any other value, the *stopwords* create
    the best shot at deciding what it might be.
  id: totrans-69
  prefs: []
  type: TYPE_NORMAL
  zh: 事实上，可以说这些词比“如何构建网页抓取器”这样的关键词中的任何剩余词更重要。由于我们对句子的类别感兴趣而非其他值，*stopwords* 是决定它可能是什么的最佳途径。
- en: As such, removing some of the entries from the stopwords list is vital. Luckily,
    NLTK stopwords are just text files which you can edit with any word processor.
  id: totrans-70
  prefs: []
  type: TYPE_NORMAL
  zh: 因此，删除一些停用词列表中的条目是至关重要的。幸运的是，NLTK 的停用词只是文本文件，你可以使用任何文字处理器进行编辑。
- en: '[NLTK downloads are stored in user directories by default](https://sites.pitt.edu/~naraehan/python3/faq.html#Q-where-nltk-data)
    but can be changed if necessary through the use of *download_dir=.*'
  id: totrans-71
  prefs: []
  type: TYPE_NORMAL
  zh: '[NLTK 下载默认存储在用户目录](https://sites.pitt.edu/~naraehan/python3/faq.html#Q-where-nltk-data)中，但可以通过使用
    *download_dir=* 进行更改（如果需要的话）。'
- en: Dataframes and stopwords
  id: totrans-72
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 数据框和停用词
- en: All machine learning models begin with data preparation and processing. Since
    we’re working with SEO keywords, these can be easily exported through CSV from
    popular tools that measure performance.
  id: totrans-73
  prefs: []
  type: TYPE_NORMAL
  zh: 所有机器学习模型都从数据准备和处理开始。由于我们处理的是 SEO 关键词，这些关键词可以通过流行的性能测量工具轻松导出为 CSV。
- en: There is something to be said about picking a random sample that should include
    close to equal amounts of our categories. As we’re producing a pre-MVP, that shouldn’t
    be a concern, as data can be added as we go if the model delivers the results
    we need.
  id: totrans-74
  prefs: []
  type: TYPE_NORMAL
  zh: 选择一个随机样本，其中应包括接近相等数量的各类别，这一点是值得注意的。由于我们正在制作一个前期 MVP，这不应该成为问题，因为如果模型提供了我们需要的结果，可以随时添加数据。
- en: Before proceeding onwards, it would be wise to select a few dozen keywords out
    of a CSV file and label them. Once we get to a working model, we can label the
    rest. Since *Pandas* creates data frames in a tabular format, the easiest way
    is to simply add a new column, “Category” or “Label,” and assign each keyword
    row with *Informational, Transactional, or Navigational.*
  id: totrans-75
  prefs: []
  type: TYPE_NORMAL
  zh: 在继续之前，明智的做法是从 CSV 文件中选择几打关键词并进行标注。一旦我们得到一个有效的模型，就可以标注其余的。由于 *Pandas* 以表格格式创建数据框，最简单的方法是添加一个新列，“Category”
    或 “Label”，并将每个关键词行标记为 *Informational, Transactional, or Navigational*。
- en: '[PRE3]'
  id: totrans-76
  prefs: []
  type: TYPE_PRE
  zh: '[PRE3]'
- en: Line 1 & 2
  id: totrans-77
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 行 1 和 2
- en: Whenever we have a CSV of any sort, *Pandas* requires us to create a dataframe.
    First, we’ll read the keyword list supplied by SEO tools. Remember that the CSV
    files should already have some keyword categorization involved, otherwise there
    will be nothing to train the model on.
  id: totrans-78
  prefs: []
  type: TYPE_NORMAL
  zh: 每当我们有任何形式的 CSV 时，*Pandas* 要求我们创建一个数据框。首先，我们将读取由 SEO 工具提供的关键词列表。请记住，CSV 文件应该已经包含一些关键词分类，否则将没有东西可以用于训练模型。
- en: After reading the file, we create a dataframe object from our CSV.
  id: totrans-79
  prefs: []
  type: TYPE_NORMAL
  zh: 阅读文件后，我们从 CSV 文件创建一个数据框对象。
- en: Line 3
  id: totrans-80
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 行 3
- en: We’ll use NLTK to grab the stopwords file, however, we can’t use it as it is.
    NLTK’s default includes many words we consider essential for keyword categorization
    (e.g., “what,” “how,” “where,” etc.). As such, it will have to be adjusted to
    fit our purposes.
  id: totrans-81
  prefs: []
  type: TYPE_NORMAL
  zh: 我们将使用 NLTK 获取停用词文件，不过我们不能直接使用它。NLTK 的默认列表包含许多我们认为对关键词分类至关重要的词（例如，“what”，“how”，“where”等）。因此，它必须调整以适应我们的目的。
- en: While there are no hard and fast rules in such a case, indefinite and definite
    articles can stay (e.g., “a,” “an,” “the,” etc.) as they provide no information.
    Everything that could potentially show user intention, however, will have to be
    removed from the default file.
  id: totrans-82
  prefs: []
  type: TYPE_NORMAL
  zh: 虽然在这种情况下没有硬性规定，但不定冠词和定冠词可以保留（例如，“a”，“an”，“the”等），因为它们不提供信息。然而，所有可能显示用户意图的内容都必须从默认文件中删除。
- en: I created a copy called ‘english_adjusted’ to make things easier for myself.
    Additionally, in case I need the original version for whatever reason, it will
    always be available without redownload.
  id: totrans-83
  prefs: []
  type: TYPE_NORMAL
  zh: 我创建了一个名为‘english_adjusted’的副本，以便于操作。此外，以防万一我需要原始版本，它将始终可用，无需重新下载。
- en: Finally, you’ll likely need to run NLTK once with the regular parameter ‘english’
    to download the files, which can be done at any stage. Otherwise, you’ll receive
    an error.
  id: totrans-84
  prefs: []
  type: TYPE_NORMAL
  zh: 最后，你可能需要运行一次NLTK，使用常规参数‘english’来下载文件，这可以在任何阶段完成。否则，你会收到错误。
- en: Setting up the pipeline
  id: totrans-85
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 设置管道
- en: After all of these preparatory steps, we finally get to move on to the actual
    machine learning bit. These are the most important bits and pieces of the model.
    It’s likely that you’ll spend quite a bit of time tinkering with these parameters
    to find out the best options.
  id: totrans-86
  prefs: []
  type: TYPE_NORMAL
  zh: 在所有这些准备步骤之后，我们终于可以进入实际的机器学习部分。这些是模型中最重要的部分。你可能会花费相当多的时间调整这些参数，以找到最佳选项。
- en: Unfortunately, there isn’t a lot of guidance that would apply in all cases.
    Some experimentation and reasoning will be required to reduce the amount of testing
    that’s needed, but eliminating it completely is impossible.
  id: totrans-87
  prefs: []
  type: TYPE_NORMAL
  zh: 不幸的是，没有很多指导方针适用于所有情况。需要进行一些实验和推理，以减少所需的测试量，但完全消除测试是不可能的。
- en: '[PRE4]'
  id: totrans-88
  prefs: []
  type: TYPE_PRE
  zh: '[PRE4]'
- en: Some may notice that I’m not splitting the dataset into a train and test split
    through *scikit-learn*. Again, that is a luxury awarded by the nature of the problem.
    SEO tools can export thousands of (unlabeled) keywords in less than a minute,
    meaning you can procure a test set separately without much effort.
  id: totrans-89
  prefs: []
  type: TYPE_NORMAL
  zh: 有些人可能会注意到我没有通过*scikit-learn*将数据集拆分为训练集和测试集。这是问题的性质所赋予的奢侈。SEO工具可以在不到一分钟的时间内导出数千个（未标记的）关键字，这意味着你可以单独采购测试集而不费吹灰之力。
- en: So, due to optimization reasons, I’ll be simply using a second dataset that
    has no labels as our testing grounds. Since, however, the *train_test_split* is
    so ubiquitous, I’ll show a version of the same model using it in the addendum
    at the bottom of the article.
  id: totrans-90
  prefs: []
  type: TYPE_NORMAL
  zh: 因此，出于优化原因，我将使用没有标签的第二个数据集作为我们的测试基础。然而，由于*train_test_split* 非常普遍，我将在文章末尾的附录中展示一个使用它的相同模型版本。
- en: Line 1
  id: totrans-91
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第1行
- en: Pipeline allows us to truncate and simplify long processes into a single object,
    making it a lot easier to work with the settings of the model. It will also reduce
    the likelihood of making errors.
  id: totrans-92
  prefs: []
  type: TYPE_NORMAL
  zh: 管道允许我们将长时间的过程简化为一个对象，使处理模型设置变得更加容易。它还将减少出错的可能性。
- en: We’ll start by defining our vectorizer. I’ve noted above that we’ll be using
    *TFIDFVectorizer* as it produces better results due to the way it weighs words
    found in documents. *CountVectorizer* is an option, however, you’d have to import
    it, and the results may vary.
  id: totrans-93
  prefs: []
  type: TYPE_NORMAL
  zh: 我们将从定义我们的向量化器开始。我在上面提到过我们将使用*TFIDFVectorizer*，因为它根据文档中单词的权重来产生更好的结果。*CountVectorizer*
    是一个选项，但你需要导入它，结果可能会有所不同。
- en: '*Ngram_range* is an interesting reasoning challenge. To get the best results,
    you have to decide how many tokens (in our case, words) have to be counted. *Ngram_range*
    of (1, 1) would take a single word (unigram), of (1, 2) would take both a single
    word and the two nearest (bigram) in combination, of (1, 3) would take a single
    word, two, and three (trigram) in combination.'
  id: totrans-94
  prefs: []
  type: TYPE_NORMAL
  zh: '*Ngram_range* 是一个有趣的推理挑战。为了获得最佳结果，你必须决定要计算多少个词元（在我们的情况下是单词）。*Ngram_range* 为
    (1, 1) 会计算单个词（单词），(1, 2) 会计算单个词和两个相邻的词（双词组）的组合，(1, 3) 会计算单个词、两个词和三个词（三词组）的组合。'
- en: I chose *ngram_range(1, 3)* for several reasons. First, since the model is relatively
    simple and performance is not an issue, I can afford to run a larger range of
    ngrams, so the lower bound can be set to be minimal.
  id: totrans-95
  prefs: []
  type: TYPE_NORMAL
  zh: 我选择了*ngram_range(1, 3)*，有几个原因。首先，由于模型相对简单，性能不是问题，我可以运行更大范围的n-gram，因此下限可以设置为最小。
- en: On the other hand, once we remove stopwords, we should think about what ngram
    upper end would be enough to glean meaning from the keywords. If possible, I find
    it easier to pick the hardest and easiest examples out of the dataset. In our
    case, the easiest examples are any question (“how to get proxies”), and the hardest
    are nouns (“web scraper”) or names (“Oxylabs”)
  id: totrans-96
  prefs: []
  type: TYPE_NORMAL
  zh: 另一方面，一旦我们去除停用词，我们应该考虑什么样的 ngram 上限足以从关键词中提取意义。如果可能，我发现从数据集中选择最难和最简单的例子更容易。在我们的情况下，最简单的例子是任何问题（“如何获取代理”），最难的是名词（“网络爬虫”）或名称（“Oxylabs”）。
- en: Since we’ll be removing words like “to”, we get a trigram in question cases
    (“how get proxies”), which is completely clear. In fact, you could make the argument
    that a bigram (“how get”) is enough as the intention is still clear.
  id: totrans-97
  prefs: []
  type: TYPE_NORMAL
  zh: 由于我们将移除像“to”这样的词，我们会在问题案例中得到三元组（“how get proxies”），这是完全清晰的。事实上，你可以认为二元组（“how
    get”）也足够，因为意图仍然清晰。
- en: Hardest examples, however, will usually be shorter than a trigram as the ease
    of understanding search intent correlates with query length. Therefore, *ngram_range
    (1, 3)* should strike a decent balance for performance and accuracy.
  id: totrans-98
  prefs: []
  type: TYPE_NORMAL
  zh: 然而，最难的例子通常会比三元组短，因为理解搜索意图的难易程度与查询长度相关。因此，*ngram_range (1, 3)* 应该在性能和准确性之间取得一个不错的平衡。
- en: 'Finally, there’s an argument to be made for *sublinear_tf*, which is a modification
    of the regular TF-IDF calculations. If set to *True,* weight is calculated through
    a logarithmic function: *1 + log(tf)*. In other words, term frequency gains diminishing
    returns.'
  id: totrans-99
  prefs: []
  type: TYPE_NORMAL
  zh: 最后，对于 *sublinear_tf* 有一个论点，即它是常规 TF-IDF 计算的一个修改。如果设置为 *True*，权重通过对数函数计算：*1 +
    log(tf)*。换句话说，词频会获得递减的回报。
- en: With *sublinear_tf,* words that appear frequently and in many documents would
    not be weighed as heavily. Since we have a collection of somewhat random keywords,
    we never know which ones get preferential treatment, however, these could often
    be terms such as “how,” “what,” etc., which are ones we’d like to be weighed heavily.
  id: totrans-100
  prefs: []
  type: TYPE_NORMAL
  zh: 使用 *sublinear_tf* 时，频繁出现且出现在多个文档中的词语不会被赋予过重的权重。由于我们有一组相对随机的关键词，我们无法知道哪些会得到优待，但这些通常是像“how”，“what”等我们希望被赋予较重权重的词。
- en: Throughout testing, I found that the model performed better without *sublinear_tf*,
    but I recommend tinkering a bit to see whether it would grant any benefits.
  id: totrans-101
  prefs: []
  type: TYPE_NORMAL
  zh: 在测试过程中，我发现模型在没有 *sublinear_tf* 的情况下表现更好，但我建议稍微调整一下，看看是否会带来任何好处。
- en: The *Stopwords* parameter is, by now, self explanatory as we’ve discussed previously.
  id: totrans-102
  prefs: []
  type: TYPE_NORMAL
  zh: '*Stopwords* 参数现在已经不言自明，因为我们之前已经讨论过了。'
- en: Line 2
  id: totrans-103
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第 2 行
- en: While not technically a new line, I’ll be separating these out for clarity and
    brevity purposes. We’ll be now invoking *SelectKBest*, which I’ve written fairly
    extensively about above. Our point of interest is the *k* value.
  id: totrans-104
  prefs: []
  type: TYPE_NORMAL
  zh: 虽然不严格来说是新的一行，但我将为清晰和简洁的目的将其分开。我们现在将调用 *SelectKBest*，我在上面已经对其进行了相当详细的描述。我们的关注点是
    *k* 值。
- en: These will be different, depending on the size of your dataset. *SelectKBest*
    is intended to optimize performance and accuracy. In my case, sending in ‘all’
    works, but you’ll usually have to pick some large enough *N* that matches your
    own dataset.
  id: totrans-105
  prefs: []
  type: TYPE_NORMAL
  zh: 这些会有所不同，具体取决于你的数据集的大小。*SelectKBest* 旨在优化性能和准确性。在我的情况下，使用“all”是有效的，但你通常需要选择一个足够大的
    *N* 来匹配你自己的数据集。
- en: Line 3
  id: totrans-106
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第 3 行
- en: Finally, we get to the method that will be used for the model. *LogisticRegression*
    is our choice, as mentioned previously, but there’s a lot of tinkering to be done
    with the parameters.
  id: totrans-107
  prefs: []
  type: TYPE_NORMAL
  zh: 最后，我们来到将用于模型的方法。*LogisticRegression* 是我们的选择，如前所述，但需要对参数进行大量的调整。
- en: “C”value is a hyperparameter, which is a parameter that tells the model the
    parameters it should pick. Hyperparameters are highly complicated parts of the
    model that have a tremendous impact on the end results.
  id: totrans-108
  prefs: []
  type: TYPE_NORMAL
  zh: “C”值是一个超参数，它告诉模型应该选择哪些参数。超参数是模型中非常复杂的部分，对最终结果有着巨大的影响。
- en: In extremely simple terms, the *C* value is the trust score for your training
    data. A high *C* value means that a higher weight, when fitting, will be placed
    on training data and a lower weight on penalties. Low C values place higher emphasis
    on penalties and lower weight on training data.
  id: totrans-109
  prefs: []
  type: TYPE_NORMAL
  zh: 从极其简单的角度来看，*C* 值是你训练数据的信任分数。较高的 *C* 值意味着在拟合时，对训练数据的权重会较高，而对惩罚的权重较低。较低的 C 值则将更多强调惩罚，训练数据的权重较低。
- en: There should always be some penalty in place as training will never fully represent
    real world values (due to being a small subset of it, regardless of how much you
    collect). Additionally, having outliers and not penalizing them means the model
    [will inch closer to being overfit](https://medium.com/p/7aeef64755d2).
  id: totrans-110
  prefs: []
  type: TYPE_NORMAL
  zh: 应始终存在一定的惩罚，因为训练永远无法完全代表现实世界的值（因为它只是一个小的子集，无论你收集多少）。此外，如果存在异常值而不进行惩罚，模型[将会越来越贴近过拟合](https://medium.com/p/7aeef64755d2)。
- en: The *penalty* parameter is the operation that will be used for the hyperparameter.
    There are three types of penalties offered by *SciKit-Learn* — *‘l1’*, *‘l2’*,
    and *‘elasticnet’*. ‘*None*’is also an option, but it should be used sparingly,
    if ever.
  id: totrans-111
  prefs: []
  type: TYPE_NORMAL
  zh: '*penalty* 参数是用于超参数的操作。*SciKit-Learn* 提供了三种类型的惩罚——*‘l1’*、*‘l2’*和*‘elasticnet’*。‘*None*’也是一个选项，但如果使用的话应该尽量少。'
- en: ‘*L1*’ is the absolute sum of the magnitude of all coefficients. In simple terms,
    it pulls all coefficients towards some central point. If large penalties are applied,
    some data points can become zero (i.e., be eliminated).
  id: totrans-112
  prefs: []
  type: TYPE_NORMAL
  zh: ‘*L1*’ 是所有系数的绝对值之和。简单来说，它将所有系数拉向某个中心点。如果施加了大的惩罚，一些数据点可能会变成零（即被消除）。
- en: ‘*L1*’ should be used in cases where there is either multicollinearity (several
    variables are correlated) or when you want to simplify the model. Since *l1* eliminates
    some data points, models nearly always become simpler. It doesn’t work as well,
    however, when you already have a relatively simple distribution of data points.
  id: totrans-113
  prefs: []
  type: TYPE_NORMAL
  zh: 在存在多重共线性（多个变量相关）或需要简化模型的情况下，应该使用‘*L1*’。由于*L1*会消除一些数据点，因此模型几乎总是变得更简单。然而，当数据点的分布已经相对简单时，它的效果不如预期。
- en: ‘*L2*’ is a different version of a similar process. Instead of being the absolute
    sum, it’s the sum of the square of all coefficient values. As such, all coefficients
    are shrunk by an identical value, but none are eliminated. ‘*L2*’ is the default
    setting as it’s the most flexible and rarely causes issues.
  id: totrans-114
  prefs: []
  type: TYPE_NORMAL
  zh: ‘*L2*’ 是类似过程的不同版本。它不是绝对和，而是所有系数值的平方和。因此，所有系数都按相同的值缩小，但没有被消除。‘*L2*’ 是默认设置，因为它最灵活且很少引发问题。
- en: ‘*Elasticnet*’ is a combination of both of the above methods. There [has been
    quite an extensive commentary](https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge)
    written on whether ‘*elasticnet*’ should be the default approach, however, not
    all solvers support it. In our case, we’d need to switch to the “saga” solver,
    which is intended for large datasets.
  id: totrans-115
  prefs: []
  type: TYPE_NORMAL
  zh: ‘*Elasticnet*’ 是上述两种方法的结合。关于是否应该将‘*elasticnet*’作为默认方法，[已有相当广泛的评论](https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge)，然而，并不是所有的求解器都支持它。在我们的情况下，我们需要切换到“saga”求解器，它是为大型数据集设计的。
- en: There would likely be little benefit to using ‘*elasticnet*’ in a tutorial-level
    machine learning model. Just keep in mind that it may be beneficial in the future.
  id: totrans-116
  prefs: []
  type: TYPE_NORMAL
  zh: 在教程级别的机器学习模型中使用‘*elasticnet*’可能收益甚微。只需记住，将来它可能会有益。
- en: Moving on to *‘max_iter*’, the parameter will set the maximum number of iterations
    the model will perform until convergence. In simple terms, convergence is the
    point at which further iterations are unlikely to occur and serves as the stopping
    point.
  id: totrans-117
  prefs: []
  type: TYPE_NORMAL
  zh: 继续讨论*‘max_iter’*，该参数将设置模型在收敛之前执行的最大迭代次数。简单来说，收敛是指进一步迭代不太可能发生的点，作为停止点。
- en: Higher values increase computational complexity but may result in better overall
    behavior. In cases where the datasets are relatively simplistic, *‘max_iter’*
    can be set to thousands and above as it won’t be too taxing on the system.
  id: totrans-118
  prefs: []
  type: TYPE_NORMAL
  zh: 较高的值会增加计算复杂性，但可能会导致更好的整体表现。在数据集相对简单的情况下，*‘max_iter’* 可以设置为数千及以上，因为这对系统的负担不会太大。
- en: If the values are too low and convergence fails, a warning message will be displayed.
    As such, it’s not that difficult to find the lowest possible value and to work
    up from there.
  id: totrans-119
  prefs: []
  type: TYPE_NORMAL
  zh: 如果值过低且收敛失败，将显示警告消息。因此，找到最低可能的值并从中开始并不困难。
- en: Fitting the model and outputting data
  id: totrans-120
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 拟合模型并输出数据
- en: We’re nearing the end of the tutorial as we finally get to fitting the model
    and receiving the output.
  id: totrans-121
  prefs: []
  type: TYPE_NORMAL
  zh: 我们接近教程的结束，最终进入模型拟合和接收输出的阶段。
- en: '[PRE5]'
  id: totrans-122
  prefs: []
  type: TYPE_PRE
  zh: '[PRE5]'
- en: Line 1–3
  id: totrans-123
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第 1-3 行
- en: Within line 1, we use our established pipeline to fit the model to the training
    data. In case some debugging or additional analysis is needed, the pipeline enables
    us to create named steps, which can be called later on.
  id: totrans-124
  prefs: []
  type: TYPE_NORMAL
  zh: 在第1行中，我们使用我们建立的管道将模型拟合到训练数据中。如果需要进行调试或额外的分析，管道允许我们创建命名的步骤，这些步骤可以在后续调用。
- en: Line 4–8
  id: totrans-125
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第4–8行
- en: We create another dataframe from a CSV file that holds only the keywords. We
    will be using our newly created model to predict each keyword and its category.
  id: totrans-126
  prefs: []
  type: TYPE_NORMAL
  zh: 我们从一个只包含关键词的CSV文件中创建另一个数据框。我们将使用新创建的模型来预测每个关键词及其类别。
- en: Since our dataframe contains only keywords, we add a new column “type” and run
    *model.predict* to provide us with an output.
  id: totrans-127
  prefs: []
  type: TYPE_NORMAL
  zh: 由于我们的数据框仅包含关键词，我们添加了一个新的列“类型”，并运行*model.predict*以提供输出结果。
- en: Finally, all of it is moved to an output CSV file, which will be created in
    the local directory. Usually, you’d like to set some destination, but for testing
    purposes, there’s often no need to do so.
  id: totrans-128
  prefs: []
  type: TYPE_NORMAL
  zh: 最终，所有结果被移动到一个输出的CSV文件中，该文件将在本地目录中创建。通常，你会想设置一些目标，但为了测试目的，通常没有必要这样做。
- en: There’s a commented-out line that I’d like to mention that calls the *score*
    function. *SciKit* provides us with numerous ways to estimate the predictive power
    of our model. These shouldn’t be understood as gospel, as predicted accuracy and
    real world accuracy can often diverge.
  id: totrans-129
  prefs: []
  type: TYPE_NORMAL
  zh: 有一行被注释掉的代码我想提一下，它调用了*score*函数。*SciKit*为我们提供了多种方法来估计模型的预测能力。这些方法不应被视为绝对真理，因为预测准确度与实际准确度通常可能有所偏差。
- en: Scores, however, are useful as a rule of thumb and as a quick way to evaluate
    whether the parameters have had some influence on the model. While there are plenty
    of scoring methods, the basic *model.score* uses *R squared*, which is helpful
    in most cases whenever we’re tuning parameters.
  id: totrans-130
  prefs: []
  type: TYPE_NORMAL
  zh: 然而，得分作为经验法则和快速评估参数对模型的影响是有用的。虽然有很多评分方法，但基本的*model.score*使用*R平方*，在调整参数时通常很有帮助。
- en: Examining the results
  id: totrans-131
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 结果分析
- en: My training data had a mere 1300 entries with three distinct categories, which
    I have mentioned above. Even with such a small set, the model managed to arrive
    at a decent accuracy score of about 80%.
  id: totrans-132
  prefs: []
  type: TYPE_NORMAL
  zh: 我的训练数据仅有1300条条目，包含三种不同的类别，如上所述。即使在这样的小数据集中，模型仍然达到了约80%的不错准确度。
- en: Some of these, as one would expect, are debatable, and even Google thinks so.
    For example, “web scrape” was a keyword frequently searched for. There’s no clear
    indication of whether the query is transactional or informational. Google SERPs
    think as much as there are results for products and informational articles in
    the top 5.
  id: totrans-133
  prefs: []
  type: TYPE_NORMAL
  zh: 其中一些，如预期的那样，是有争议的，甚至Google也这么认为。例如，“网页抓取”是一个经常被搜索的关键词。是否查询是交易性的还是信息性的没有明确的指示。Google的搜索结果页面显示，前5条结果中有产品和信息文章。
- en: There’s one area the model struggled with — navigational keywords. If I were
    to guess, the model predicted the category correctly about 5–10% of the time.
    There are several reasons for such an occurrence.
  id: totrans-134
  prefs: []
  type: TYPE_NORMAL
  zh: 模型在一个领域遇到了困难——导航关键词。如果我要猜测，模型大约5-10%的时间能正确预测类别。出现这种情况有几个原因。
- en: 'The distribution of the dataset could be blamed as it’s heavily imbalanced:'
  id: totrans-135
  prefs: []
  type: TYPE_NORMAL
  zh: 数据集的分布可能是一个问题，因为它严重不平衡：
- en: '**Transactional** — 0.457353%'
  id: totrans-136
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
  zh: '**交易型** — 0.457353%'
- en: '**Informational** — 0.450735%'
  id: totrans-137
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
  zh: '**信息型** — 0.450735%'
- en: '**Navigational** — 0.091912%'
  id: totrans-138
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
  zh: '**导航型** — 0.091912%'
- en: While real world scenarios would present a similar distribution (due to the
    inherent rarity of Navigational keywords), the training data is too sparse for
    proper fitting. Additionally, the frequency of navigational keywords is so low
    that the model would produce greater accuracy by always assigning the other two.
  id: totrans-139
  prefs: []
  type: TYPE_NORMAL
  zh: 虽然实际世界场景会呈现出类似的分布（由于导航关键词的固有稀有性），但训练数据过于稀疏，无法进行适当的拟合。此外，导航关键词的频率非常低，以至于模型通过总是分配其他两个类别可以获得更高的准确性。
- en: I don’t think, however, that presenting the training data with more navigational
    keywords would produce a much better result. It’s a problem that is extremely
    difficult to solve through textual analysis, whatever kind we choose.
  id: totrans-140
  prefs: []
  type: TYPE_NORMAL
  zh: 然而，我认为展示更多的导航关键词的训练数据不会产生更好的结果。这是一个通过文本分析解决的极其困难的问题，无论我们选择哪种方法。
- en: Navigational keywords consist mostly of brand names, which are neologisms or
    other newly produced words. Nothing within them follows the natural language,
    and, as such, connections between them can only be discovered *a posteriori*.
    In other words, we’d have to first know it’s a brand name, from other data sources,
    to assign the category correctly.
  id: totrans-141
  prefs: []
  type: TYPE_NORMAL
  zh: 导航关键词主要由品牌名称组成，这些名称是新造词或其他新产生的词。它们中没有任何内容遵循自然语言，因此，它们之间的联系只能*事后*发现。换句话说，我们必须首先从其他数据源知道这是一个品牌名称，才能正确分配类别。
- en: If I had to guess, Google and other search engines discover brand names through
    the way users act when they query a new word. They might look for domain matches
    or other data, but predicting that something is a navigational keyword without
    human interaction is extremely difficult.
  id: totrans-142
  prefs: []
  type: TYPE_NORMAL
  zh: 如果我得猜测，谷歌和其他搜索引擎可能通过用户查询新词时的行为来发现品牌名称。他们可能会查找域名匹配或其他数据，但在没有人工互动的情况下预测某个词是导航关键词是极其困难的。
- en: Feature engineering would be a potential solution to the problem. We’d have
    to discover new connections between the navigational and other categories and
    implement assignments through other approaches.
  id: totrans-143
  prefs: []
  type: TYPE_NORMAL
  zh: 特征工程可能是解决问题的潜在方案。我们需要发现导航类别和其他类别之间的新联系，并通过其他方法实施分配。
- en: As feature engineering is a different topic entirely and one that deserves its
    own article, I’ll provide a single example. Navigational keywords will rarely
    be queried as questions (outside of “what is”) as they would otherwise make no
    sense (e.g., “how to Oxylabs,” “how to get Oxylabs.”)
  id: totrans-144
  prefs: []
  type: TYPE_NORMAL
  zh: 由于特征工程是一个完全不同的主题，并且值得单独写一篇文章，所以我将提供一个示例。导航关键词很少会以问题的形式被查询（除了“什么是”），否则它们没有意义（例如，“如何使用Oxylabs”，“如何获取Oxylabs”）。
- en: There’s a debatable point as to whether “how to get Oxylabs proxies” would be
    considered transactional or navigational. It would definitely fit within the transactional
    category, however, so it could be considered so.
  id: totrans-145
  prefs: []
  type: TYPE_NORMAL
  zh: 是否将“如何获取Oxylabs代理”视为交易型还是导航型存在争议。然而，它确实符合交易型类别，因此可以被认为是交易型。
- en: By knowing that relatively few navigational keywords would be formed as questions,
    we could build a model that would filter out most questions, leaving us with a
    smaller subset of potential targets.
  id: totrans-146
  prefs: []
  type: TYPE_NORMAL
  zh: 通过知道相对较少的导航关键词会以问题的形式出现，我们可以构建一个模型来过滤掉大多数问题，留下较小的潜在目标子集。
- en: Additionally, many navigational keywords have significantly shorter query lengths,
    mostly consisting of a single word, while the other categories have the same length
    relatively rarely.
  id: totrans-147
  prefs: []
  type: TYPE_NORMAL
  zh: 此外，许多导航关键词的查询长度显著较短，通常由单个词组成，而其他类别的查询长度相对较少。
- en: Both of these methods and many others can be combined to improve the model’s
    accuracy when selecting navigational keywords. Getting into feature engineering,
    however, is much more complicated than a basic tutorial should cover.
  id: totrans-148
  prefs: []
  type: TYPE_NORMAL
  zh: 这些方法及其他许多方法可以组合使用，以提高选择导航关键词时模型的准确性。然而，深入特征工程要比基础教程覆盖的内容复杂得多。
- en: For now, word classification should be covered with an overall better understanding
    of how machine learning models work. Hopefully, the explanation of the many parameters
    and tools available will let you create a functional model from the get-go.
  id: totrans-149
  prefs: []
  type: TYPE_NORMAL
  zh: 目前，词汇分类应通过对机器学习模型如何工作的整体更好理解来覆盖。希望对众多参数和工具的解释能让你从一开始就创建一个功能性的模型。
- en: Conclusion
  id: totrans-150
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 结论
- en: Even if the article has been exceedingly long, you might have noticed that writing
    a machine learning model isn’t all that difficult. In fact, one may say that doing
    so is the smallest part of the project, at least in this case.
  id: totrans-151
  prefs: []
  type: TYPE_NORMAL
  zh: 即使文章非常长，你可能已经注意到，编写机器学习模型并不那么困难。事实上，可以说，在这个案例中，这只是项目的一小部分。
- en: 'Machine learning heavily relies on preparation, of which we can outline several
    parts:'
  id: totrans-152
  prefs: []
  type: TYPE_NORMAL
  zh: 机器学习在很大程度上依赖于准备，我们可以概述几个部分：
- en: '**Picking the right problem**. Some problems are simply better solved with
    other approaches. Don’t buy into the hype and try to solve everything through
    machine learning. With rule-based systems, you might be able to save time and
    resources while producing even better results.'
  id: totrans-153
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '**选择正确的问题**。有些问题用其他方法解决可能更好。不要被炒作所迷惑，尝试通过机器学习解决所有问题。使用基于规则的系统，你可能能够节省时间和资源，同时产生更好的结果。'
- en: '**Preparing the data**. A model will only be as good as the data. If your data
    is labeled incorrectly, lacks veracity, or is otherwise faulty, no amount of development
    and resources build something that creates reliable outputs.'
  id: totrans-154
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '**准备数据**。一个模型的好坏取决于数据。如果你的数据标记不正确、缺乏真实性或其他方面存在问题，那么再多的开发和资源也无法创建出可靠的输出。'
- en: '**Picking the model**. It’s easy to default to logistic regression or any other
    model because you’ve done it so many times. *Sci-Kit Learn* has other options
    such as those I haven’t even mentioned, such as *PassiveAggressiveClassifier,*
    which use different mathematical approaches. Again, I stress the importance of
    picking the right problem as it should decide what modeling method you choose.'
  id: totrans-155
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
  zh: '**选择模型**。由于你已经做过很多次，可能很容易选择逻辑回归或其他模型。*Sci-Kit Learn*还有其他选项，比如我没有提到的*PassiveAggressiveClassifier*，它们使用不同的数学方法。再次强调，选择正确的问题非常重要，因为它应该决定你选择什么建模方法。'
- en: I hope that this article will serve many newcomers to machine learning by providing
    not only the practical part, but also provide the way of thinking one should approach
    problems with.
  id: totrans-156
  prefs: []
  type: TYPE_NORMAL
  zh: 我希望这篇文章能为许多机器学习新手提供帮助，不仅提供实践部分，还提供处理问题的思路。
- en: 'Addendum: Original full code block'
  id: totrans-157
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 附录：原始完整代码块
- en: '[PRE6]'
  id: totrans-158
  prefs: []
  type: TYPE_PRE
  zh: '[PRE6]'
- en: 'Addendum II: Train_test_split'
  id: totrans-159
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
  zh: 附录 II：Train_test_split
- en: Imports
  id: totrans-160
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 导入
- en: '[PRE7]'
  id: totrans-161
  prefs: []
  type: TYPE_PRE
  zh: '[PRE7]'
- en: As per usual, we have to import the *train_test_split* itself (line 5).
  id: totrans-162
  prefs: []
  type: TYPE_NORMAL
  zh: 按照惯例，我们需要导入*train_test_split*本身（第5行）。
- en: Setting up the split
  id: totrans-163
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 设置拆分
- en: '[PRE8]'
  id: totrans-164
  prefs: []
  type: TYPE_PRE
  zh: '[PRE8]'
- en: Line 1
  id: totrans-165
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第1行
- en: As our dataset has only two features (*keyword* & *category*), we’ll need two
    variables for each. One of them will store the training data, and the other one
    will be used for testing purposes.
  id: totrans-166
  prefs: []
  type: TYPE_NORMAL
  zh: 由于我们的数据集只有两个特征（*keyword*和*category*），我们需要为每个特征准备两个变量。其中一个用于存储训练数据，另一个用于测试目的。
- en: We’ll take the data frame created in previous steps and assign the column names
    (in my dataset, they were called “Keyword” and “Type,” as evidenced by the parameters).
  id: totrans-167
  prefs: []
  type: TYPE_NORMAL
  zh: 我们将使用之前步骤中创建的数据框，并指定列名（在我的数据集中，它们被称为“Keyword”和“Type”，如参数所示）。
- en: Finally, *SciKit-Learn* solves data splitting problems for us by allowing automated
    divisions for both sets. The *train_test_split* takes float and integer values
    that represent percentages to be used for either the test set size or the training
    set size. If both are left as *None*, it will default to 0.25.
  id: totrans-168
  prefs: []
  type: TYPE_NORMAL
  zh: 最后，*SciKit-Learn*通过允许对两个数据集进行自动分割来解决数据拆分问题。*train_test_split*接受表示测试集大小或训练集大小百分比的浮点和整数值。如果两个值都设置为*None*，默认值将为0.25。
- en: Some tinkering will be required to get the best results. I’ve tried many different
    splits, with 0.3 producing the best results. Generally, you’ll find that many
    models will work best on splits ranging from 0.2 to 0.3.
  id: totrans-169
  prefs: []
  type: TYPE_NORMAL
  zh: 需要进行一些调整才能获得最佳结果。我尝试了许多不同的拆分，其中0.3产生了最佳结果。一般来说，你会发现许多模型在0.2到0.3范围内的拆分效果最佳。
- en: Particular splits tend to have less effect on accuracy when data point numbers
    increase. In fact, in extremely large datasets, splitting on 0.1 might improve
    computational performance.
  id: totrans-170
  prefs: []
  type: TYPE_NORMAL
  zh: 特定的拆分对准确性的影响较小，当数据点数量增加时更是如此。实际上，在极大的数据集上，拆分为0.1可能会提高计算性能。
- en: Relations between statistical units are complicated, however, the abstract field
    of connections that can be made is finite, so that accuracy can be understood
    as a requirement for a flat number of data points instead of a specific ratio.
    In other words, there is some *N* where results don’t get any better, so if datasets
    are large, smaller ratios might be more optimal.
  id: totrans-171
  prefs: []
  type: TYPE_NORMAL
  zh: 统计单位之间的关系很复杂，但是可以建立的连接的抽象领域是有限的，因此准确性可以理解为对一定数量的数据点的要求，而不是特定的比例。换句话说，有一个*N*，在这个点上结果不会再变得更好，因此如果数据集很大，较小的比例可能更为优化。
- en: There are some [highly technical articles written about the topic](https://link.springer.com/article/10.1007/s41664-018-0068-2)
    that explain the idea in much greater depth and provide ways to calculate the
    optimal split.
  id: totrans-172
  prefs: []
  type: TYPE_NORMAL
  zh: 有一些[关于这个主题的高度技术文章](https://link.springer.com/article/10.1007/s41664-018-0068-2)深入解释了这个想法，并提供了计算最佳拆分的方法。
- en: Everything else in this code block follows the same steps as in the original
    tutorial.
  id: totrans-173
  prefs: []
  type: TYPE_NORMAL
  zh: 此代码块中的其他部分遵循与原始教程相同的步骤。
- en: Fitting the model and outputting data (again)
  id: totrans-174
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 拟合模型并输出数据（再次）
- en: '[PRE9]'
  id: totrans-175
  prefs: []
  type: TYPE_PRE
  zh: '[PRE9]'
- en: Line 1
  id: totrans-176
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第1行
- en: Instead of training our model on the labeled dataset directly, we’ll be training
    it on the previously split one, naming *x_train* and *y_train*. Lines 2 and 3
    remain identical.
  id: totrans-177
  prefs: []
  type: TYPE_NORMAL
  zh: 我们不会直接在标记数据集上训练模型，而是在之前拆分的那个数据集上进行训练，命名为*x_train*和*y_train*。第2行和第3行保持不变。
- en: Line 4
  id: totrans-178
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
  zh: 第4行
- en: Since there is no separate dataset, we’ll be using the test part of our initial
    one for predictions. So, we create a dataframe that has the column *Keyword* in
    which we will output all of the keywords from the test dataset. In the second
    column, *Type*, we’ll use the model to predict the category of the keyword, drawing
    from the same dataset.
  id: totrans-179
  prefs: []
  type: TYPE_NORMAL
  zh: 由于没有单独的数据集，我们将使用初始数据集中的测试部分进行预测。因此，我们创建一个数据框，其中包含*关键词*这一列，我们将在该列中输出测试数据集中的所有关键词。在第二列*类型*中，我们将使用模型来预测关键词的类别，依然使用相同的数据集。
- en: Finally, as per the original version, all of that will be outputted into a results
    file. Printing accuracy score is also an option if one is interested in how well
    the model thinks it’s performing.
  id: totrans-180
  prefs: []
  type: TYPE_NORMAL
  zh: 最终，按照原始版本，所有结果将输出到一个结果文件中。如果有人对模型的表现如何感兴趣，也可以选择打印准确率分数。