generated from OpenDocCN/doc-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
0009.yaml
1261 lines (1261 loc) · 72.4 KB
/
0009.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
- en: Building a Basic Machine Learning Model in Python
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 在Python中构建基础机器学习模型
- en: 原文:[https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02](https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02](https://towardsdatascience.com/building-a-basic-machine-learning-model-in-python-d7cca929ee62?source=collection_archive---------2-----------------------#2023-01-02)
- en: '*Extensive essay on how to pick the right problem and how to develop a basic
classifier*'
id: totrans-2
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: '*关于如何选择合适问题和如何开发基础分类器的详细论文*'
- en: '[](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[![Juras
Juršėnas](../Images/eb2ca720f2c8688dbf8079879c028d12.png)](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)[![Towards
Data Science](../Images/a6ff2676ffcc0c7aad8aaf1d79379785.png)](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
[Juras Juršėnas](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)'
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: '[](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[![Juras
Juršėnas](../Images/eb2ca720f2c8688dbf8079879c028d12.png)](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)[](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)[![Towards
Data Science](../Images/a6ff2676ffcc0c7aad8aaf1d79379785.png)](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
[Juras Juršėnas](https://medium.com/@juras.jursenas?source=post_page-----d7cca929ee62--------------------------------)'
- en: ·
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: ·
- en: '[Follow](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2F3041473d9e3c&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=post_page-3041473d9e3c----d7cca929ee62---------------------post_header-----------)
Published in [Towards Data Science](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
·20 min read·Jan 2, 2023[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Ftowards-data-science%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=-----d7cca929ee62---------------------clap_footer-----------)'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: '[点击查看](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2F3041473d9e3c&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=post_page-3041473d9e3c----d7cca929ee62---------------------post_header-----------)
发布于 [Towards Data Science](https://towardsdatascience.com/?source=post_page-----d7cca929ee62--------------------------------)
·20 min 阅读·2023年1月2日[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Ftowards-data-science%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&user=Juras+Jur%C5%A1%C4%97nas&userId=3041473d9e3c&source=-----d7cca929ee62---------------------clap_footer-----------)'
- en: --
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: --
- en: '[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&source=-----d7cca929ee62---------------------bookmark_footer-----------)![](../Images/01ff8323628648cdfec674b9023fa9f2.png)'
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: '[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Fd7cca929ee62&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuilding-a-basic-machine-learning-model-in-python-d7cca929ee62&source=-----d7cca929ee62---------------------bookmark_footer-----------)![](../Images/01ff8323628648cdfec674b9023fa9f2.png)'
- en: Photo by [charlesdeluvio](https://unsplash.com/@charlesdeluvio?utm_source=medium&utm_medium=referral)
on [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 照片由 [charlesdeluvio](https://unsplash.com/@charlesdeluvio?utm_source=medium&utm_medium=referral)
提供,来源于 [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
- en: By now, all of us have seen the results of various basic machine learning (ML)
models. The internet is rife with images, videos, and articles showing off how
a computer identifies, correctly or not, various animals.
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 目前,我们都见过各种基础机器学习(ML)模型的结果。互联网充斥着展示计算机如何识别各种动物的图像、视频和文章,无论识别是否正确。
- en: While we have moved towards more intricate machine learning models, such as
ones that generate or upscale images, those basic ones still form the foundation
of those efforts. Mastering the basics can become a launchpad for much greater
future endeavors.
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 尽管我们已经朝着更复杂的机器学习模型迈进,例如生成或提升图像的模型,但这些基础模型仍然构成了这些努力的基础。掌握基础知识可以成为未来更大事业的跳板。
- en: So, I decided to revisit the basics myself and build a basic machine learning
model with several caveats — it must be somewhat useful, as simplistic as possible,
and return reasonably accurate results.
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 所以,我决定自己重新审视基础知识,并构建一个具有几个警告的基本机器学习模型——它必须具有一定的实用性,尽可能简单,并返回合理准确的结果。
- en: Unlike many other tutorials on the internet, however, I want to present my entire
thought process from beginning to end. As such, the coding part will begin quite
a bit later as problem selection in both the theoretical and practical realm is
equally important. In the end, I believe that understanding *why* will go further
than *how to*.
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 然而,与互联网上的许多其他教程不同,我想从头到尾展示我的整个思考过程。因此,编码部分将会稍晚开始,因为理论和实践领域中的问题选择同样重要。最后,我相信理解*为什么*比*如何*更为重要。
- en: Picking the correct problem for ML
id: totrans-13
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 选择适合机器学习的问题
- en: Although machine learning can solve a great deal of challenges, it’s not a one-size-fits-all
approach. Even if we were to temporarily forget about the financial, temporal,
and other resource costs, ML models would still be great at some things and terrible
at others.
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: 尽管机器学习可以解决许多挑战,但它并不是一种万能的解决方案。即使我们暂时忽略财务、时间和其他资源成本,机器学习模型在某些方面仍然表现出色,而在其他方面则表现糟糕。
- en: Categorization is a great example of where machine learning may shine. Whenever
we deal with real world data (i.e., we’re not dealing with categories created
within the code itself), figuring out all possible rules that define a phenomenon
is nearly impossible.
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 分类是机器学习可能发挥作用的一个很好的例子。每当我们处理真实世界的数据(即我们不处理代码中创建的类别)时,找出定义现象的所有可能规则几乎是不可能的。
- en: As I’ve written previously, if we were to attempt to take a rule-based approach
to categorize whether an object is a cat or not, we’d quickly run into issues.
There seems to be no defining quality that makes any physical object what it is
— there are cats without tails, fur, ears, one eye, a different number of legs,
etc., but all of them still fall within the same category.
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 正如我之前所写的,如果我们尝试使用基于规则的方法来分类一个物体是否是猫,我们会很快遇到问题。似乎没有定义任何物理对象的特征——有些猫没有尾巴、毛发、耳朵、一只眼睛、不同数量的腿等等,但它们仍然都属于同一类别。
- en: Enumerating all of the possible rules and exceptions to them is likely impossible,
maybe there even isn’t some eternal list, and we make them up as we go. Machine
learning, in some sense, mimics our thinking by eating up an enormous amount of
data to make predictions.
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 列举所有可能的规则及其例外可能是不可能的,也许甚至没有某种永恒的清单,我们只能在过程中逐步制定。机器学习在某种程度上通过消耗大量数据来进行预测,模仿了我们的思维。
- en: In other words, we should carefully consider the problem we’re trying to solve
before trying to figure out which model would fit best, how much data we’ll need,
and many other things we concern ourselves with once we start the task.
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 换句话说,我们应该在尝试确定哪种模型最合适、需要多少数据以及开始任务后关注的其他事项之前,仔细考虑我们要解决的问题。
- en: In search of practical application
id: totrans-19
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 寻求实际应用
- en: Making models that differentiate between dogs and cats is certainly interesting
and fun but unlikely to net any benefit, even if we scale up the operation to
immense levels. Additionally, there have been millions of tutorials for such models
created online.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 制作区分狗和猫的模型确实有趣且有趣,但即使我们将操作规模扩大到巨大的程度,也不太可能获得任何好处。此外,已经有数以百万计的此类模型教程在网上创建。
- en: 'I decided to pick word categorization, as it hasn’t been as frequently written
about, and it has some practical application. Our SEO team had an interesting
proposition — they needed to categorize keywords according to three types:'
id: totrans-21
prefs: []
type: TYPE_NORMAL
zh: 我决定选择词汇分类,因为它相对较少被写到,并且具有一定的实际应用。我们的SEO团队提出了一个有趣的提议——他们需要根据三种类型来分类关键词:
- en: '**Informational** — users searching for knowledge about a topic (e.g., “what
is a proxy”)'
id: totrans-22
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '**信息型** — 寻找关于某个主题的知识的用户(例如,“什么是代理”)'
- en: '**Transactional** — users seeking for a product or service (e.g., “best proxies”)'
id: totrans-23
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '**交易型** — 寻找产品或服务的用户(例如,“最佳代理”)'
- en: '**Navigational** — users seeking for a specific brand or an offshoot of it
(e.g., “Oxylabs)'
id: totrans-24
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '**导航型**——用户寻找特定品牌或其分支(例如,“Oxylabs”)'
- en: Categorizing thousands of keywords manually is a bit of a pain. Such a task
seems (almost) perfect for machine learning, although there’s an inherent issue
that is nearly impossible to solve, which I will expand upon later.
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: 手动分类成千上万的关键词有点麻烦。这样的任务(几乎)完美适合机器学习,尽管存在一个几乎无法解决的固有问题,我将在后面详细说明。
- en: Finally, it made data collection and management a significantly easier task
than it would otherwise have been. SEO specialists use a variety of tools to track
keywords, most of which can export thousands of them into a CSV sheet. All that
needs to be done is to assign categories to the keywords.
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 最终,它使数据收集和管理变得比其他情况下要简单得多。SEO 专家使用各种工具来跟踪关键词,其中大多数可以将它们导出到 CSV 表中。只需将类别分配给关键词即可。
- en: Building a pre-MVP
id: totrans-27
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 构建一个预 MVP
- en: Deciding how many data points you’ll need before building a model is nearly
impossible. There are some dependencies on the stated goal (i.e., more or less
categories), however, calculating these with precision is a fool’s errand. Picking
a sufficiently large number (e.g., 1000 entries) is a good starting point.
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 在建立模型之前决定需要多少数据点几乎是不可能的。虽然有一些依赖于既定目标(即,更多或更少的类别),但精确计算这些数据几乎是不可能的。选择一个足够大的数字(例如,1000
条记录)是一个好的起点。
- en: One thing I’d caution against is working with the entire dataset first. Since
it is likely, it’s the first time you’re developing a model, a lot of things can
go wrong. In general, you’re better off writing the code and running it on a small
sample (e.g., 10% of the total) just to ensure there are no semantic errors or
any other horrors.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 我建议不要一开始就处理整个数据集。由于这是你第一次开发模型,很多事情可能会出错。一般来说,最好先编写代码并在小样本(例如总数据的10%)上运行,以确保没有语义错误或其他问题。
- en: Once you get the desired result, start working with the entire dataset. While
it’s unlikely that you’ll have to throw out the project entirely, you don’t want
to end up spending hours of (boring) work and have nothing to show for.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 一旦你得到所需的结果,就开始处理整个数据集。虽然你可能不会完全放弃项目,但你不希望花费几个小时(枯燥)的工作却没有任何成果。
- en: Regardless, with some samples in hand, we can begin the development experience
properly. I’ve chosen Python as it’s a fairly common language with decent support
for machine learning through its numerous libraries.
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: 无论如何,有了一些样本,我们可以正式开始开发过程。我选择了 Python,因为它是一种相当常见的语言,并且通过众多库为机器学习提供了不错的支持。
- en: Libraries
id: totrans-32
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 库
- en: '[Pandas](https://pypi.org/project/pandas/). While not strictly necessary, reading
and exporting to CSV is going to make our lives significantly easier.'
id: totrans-33
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[Pandas](https://pypi.org/project/pandas/)。虽然不是绝对必要,但读取和导出 CSV 文件将大大简化我们的工作。'
- en: '[SciKit-Learn](https://pypi.org/project/scikit-learn/). A fairly powerful and
flexible machine learning library, which will form the foundation for our classification
model. We’ll be using various *sklearn* features throughout the tutorial.'
id: totrans-34
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[SciKit-Learn](https://pypi.org/project/scikit-learn/)。这是一个相当强大且灵活的机器学习库,它将成为我们分类模型的基础。在整个教程中,我们将使用各种
*sklearn* 功能。'
- en: '[NLTK](https://pypi.org/project/nltk/) (Natural Language Toolkit). As we’ll
be processing natural language, NLTK does the job perfectly. *Stopwords* will
be absolutely necessary from the package.'
id: totrans-35
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[NLTK](https://pypi.org/project/nltk/)(自然语言工具包)。由于我们将处理自然语言,NLTK 完美地完成了这个任务。*停用词*
是包中绝对必要的内容。'
- en: Imports
id: totrans-36
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 导入
- en: '[PRE0]'
id: totrans-37
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Line 1
id: totrans-38
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第1行
- en: Fairly self-explanatory. *Pandas* allows us to read and write CSV and other
spreadsheet files by creating data frames. Since we’ll be dealing with keywords,
most SEO tools export lists of them in CSV, which will reduce the data processing
we need to do manually.
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: 相当自解释。*Pandas* 允许我们通过创建数据框来读取和写入 CSV 以及其他电子表格文件。由于我们将处理关键词,大多数 SEO 工具会将它们导出为
CSV,这将减少我们需要手动处理的数据。
- en: Line 2
id: totrans-40
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第2行
- en: From the SciKit-Learn library, we’ll pick up several things, *TfidfVectorizer*
being our first choice.
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: 从 SciKit-Learn 库中,我们将挑选几个东西,*TfidfVectorizer* 是我们的首选。
- en: Vectorizers convert our strings into feature vectors, which results in two important
changes. First, strings are converted into numerical representations. Each unique
string is converted into an index, which is then turned into a vector (the offshoot
of a matrix).
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 向量化器将我们的字符串转换为特征向量,这会导致两个重要的变化。首先,字符串被转换为数值表示。每个唯一的字符串被转换为一个索引,然后转化为向量(矩阵的衍生物)。
- en: '**Sentence #1**: “The dog is brown.”'
id: totrans-43
prefs: []
type: TYPE_NORMAL
zh: '**句子 #1**:“狗是棕色的。”'
- en: '**Sentence #2**: “The dog is black.”'
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: '**句子 #2**:“狗是黑色的。”'
- en: 'Vectorization would take both sentences and create an index of:'
id: totrans-45
prefs: []
type: TYPE_NORMAL
zh: 向量化将处理这两个句子并创建一个索引:
- en: '[PRE1]'
id: totrans-46
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: Outside of turning strings into numerical values, vectorization also optimizes
data processing. Instead of having to go through identical strings several times,
the same index is used akin to compressing files.
id: totrans-47
prefs: []
type: TYPE_NORMAL
zh: 除了将字符串转换为数值外,向量化还优化了数据处理。与其多次处理相同的字符串,不如使用相同的索引,类似于文件压缩。
- en: Finally, TFIDF (term frequency-inverse document frequency) is one of the ways
to weigh term importance across documents. In simple terms, it takes each term,
assesses its frequency divided by the document length, and assigns a weighted
value to it. As a result, words that repeat frequently are considered more important.
id: totrans-48
prefs: []
type: TYPE_NORMAL
zh: 最后,TFIDF(词频-逆文档频率)是衡量文档中词语重要性的一种方法。简单来说,它对每个词进行处理,评估其频率与文档长度的比值,并分配一个加权值。因此,重复出现的词语被认为更重要。
- en: Line 3
id: totrans-49
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第三行
- en: '*LogisticRegression* is one of the ways to discover relationships between variables.
Since our task is a classic example of classification, logistic regressions work
perfectly as they take some input variable *x* (keyword)and assign it a value
of *y* (informational/transactional/navigational).'
id: totrans-50
prefs: []
type: TYPE_NORMAL
zh: '*LogisticRegression* 是发现变量之间关系的一种方法。由于我们的任务是经典的分类问题,逻辑回归非常适合,因为它接受某些输入变量 *x*(关键字),并将其分配一个值
*y*(信息性/交易性/导航性)。'
- en: There are other options, such as *LinearSVC,* which involves significantly more
complicated mathematics. In extremely simplistic terms, SVC takes several clusters
of data points and finds the values in each that are closest to the opposing cluster(s).
These are called support vectors.
id: totrans-51
prefs: []
type: TYPE_NORMAL
zh: 还有其他选项,例如 *LinearSVC*,它涉及到更复杂的数学运算。极其简单地说,SVC 会对多个数据点簇进行处理,找到每个簇中最接近对方簇的值。这些值称为支持向量。
- en: A hyperplane (i.e., an *n-dimensional* geometrical object in an *n+1-dimensional*
space) is drawn in such a way that the distances between it and each support vector
is maximized.
id: totrans-52
prefs: []
type: TYPE_NORMAL
zh: 一个超平面(即在 *n+1维* 空间中的 *n维* 几何对象)被绘制成使其与每个支持向量的距离最大化。
- en: '![](../Images/eb722b44930e8f5e8c3e04443201fa5c.png)'
id: totrans-53
prefs: []
type: TYPE_IMG
zh: '![](../Images/eb722b44930e8f5e8c3e04443201fa5c.png)'
- en: Image by author
id: totrans-54
prefs: []
type: TYPE_NORMAL
zh: 作者提供的图片
- en: '[There exists research to state that using Support Vector Machines](https://link.springer.com/chapter/10.1007/BFb0026683)
might produce better results in text classification, however, it’s likely due
to the significantly more complicated nature of the task. These advantages aren’t
entirely relevant in our case as they surface when feature counts reach inordinately
high numbers, so linear regressions should work just fine.'
id: totrans-55
prefs: []
type: TYPE_NORMAL
zh: '[已有研究表明使用支持向量机](https://link.springer.com/chapter/10.1007/BFb0026683) 可能在文本分类中产生更好的结果,但这可能是由于任务的复杂性显著增加。这些优势在我们的情况下并不完全相关,因为它们在特征数量达到极高水平时才会显现,因此线性回归应该也能很好地工作。'
- en: Line 4
id: totrans-56
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第四行
- en: '*Pipeline* is a flexible machine learning tool that lets you create an object
that assembles several steps of the entire process into one. It has numerous benefits
— from helping you write neater code to preventing data leakage.'
id: totrans-57
prefs: []
type: TYPE_NORMAL
zh: '*Pipeline* 是一个灵活的机器学习工具,它让你创建一个对象,将整个过程的多个步骤组合成一个。它有许多好处——从帮助你编写更整洁的代码到防止数据泄漏。'
- en: Line 5
id: totrans-58
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第五行
- en: While not entirely necessary in our case, *SelectKBest* and *chi2* help optimize
models by improving accuracy and reducing training time. *SelectKBest* allows
us to set a maximum number of features that are used.
id: totrans-59
prefs: []
type: TYPE_NORMAL
zh: 虽然在我们的情况下并不是绝对必要的,*SelectKBest* 和 *chi2* 通过提高准确性和减少训练时间来优化模型。*SelectKBest* 允许我们设置最大特征数量。
- en: '*Chi2* (or *chi-squared*) is a statistical test for the independence of variables
that helps us select the best features (hence, *SelectKBest)* for training:'
id: totrans-60
prefs: []
type: TYPE_NORMAL
zh: '*Chi2*(或*卡方检验*)是一种用于变量独立性的统计测试,有助于我们选择最佳特征(因此,*SelectKBest*)进行训练:'
- en: '![](../Images/706e58897a99f3e0263bb1a57a7447b6.png)'
id: totrans-61
prefs: []
type: TYPE_IMG
zh: '![](../Images/706e58897a99f3e0263bb1a57a7447b6.png)'
- en: Image by author.
id: totrans-62
prefs: []
type: TYPE_NORMAL
zh: 作者提供的图片。
- en: '[PRE2]'
id: totrans-63
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: Expected values are calculated by accepting the null hypothesis (variables are
independent). These are then hedged against our observed values. If observed values
deviate a significant margin from the expected ones, we can reject the null hypothesis,
which forces us to accept that variables are dependent.
id: totrans-64
prefs: []
type: TYPE_NORMAL
zh: 期望值是通过接受原假设(变量独立)来计算的。这些值然后与我们的观测值进行对比。如果观测值与期望值有显著偏差,我们可以拒绝原假设,这迫使我们接受变量之间的依赖关系。
- en: If variables are dependent, they are acceptable for the machine learning model
as that’s exactly what we’re looking for — relations between objects. In turn,
*SelectKBest* takes all *chi2* results and selects those that have the strongest
relationships.
id: totrans-65
prefs: []
type: TYPE_NORMAL
zh: 如果变量是相关的,它们对于机器学习模型是可接受的,因为这正是我们所寻找的 —— 对象之间的关系。反过来,*SelectKBest* 获取所有 *chi2*
结果,并选择那些具有最强关系的结果。
- en: In our case, since our number of features is relatively small, *SelectKBest*
might not bring the optimization we’d be interested in, but it becomes essential
once the numbers start rising.
id: totrans-66
prefs: []
type: TYPE_NORMAL
zh: 在我们的情况下,由于特征数量相对较少,*SelectKBest* 可能无法带来我们感兴趣的优化,但一旦数量开始增加,它就变得至关重要。
- en: Line 6
id: totrans-67
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 行 6
- en: Our final import is from NLTK, which we will only use for the *stopwords* list.
Unfortunately, the default list isn’t suitable for our task at hand. Most such
lists include words like “how,” “what,” “why,” and many others that, while useless
in regular categorization, indicate search intent.
id: totrans-68
prefs: []
type: TYPE_NORMAL
zh: 我们最终的导入来自 NLTK,我们将仅将其用于 *stopwords* 列表。不幸的是,默认列表不适合我们当前的任务。大多数这样的列表包含像“how”,“what”,“why”等词,这些词在常规分类中无用,但能指示搜索意图。
- en: In fact, there’s a case to be made that these words are more important than
any remainder in keywords like “how to build a web scraper.” Since we’re interested
in the category of the sentence rather than any other value, the *stopwords* create
the best shot at deciding what it might be.
id: totrans-69
prefs: []
type: TYPE_NORMAL
zh: 事实上,可以说这些词比“如何构建网页抓取器”这样的关键词中的任何剩余词更重要。由于我们对句子的类别感兴趣而非其他值,*stopwords* 是决定它可能是什么的最佳途径。
- en: As such, removing some of the entries from the stopwords list is vital. Luckily,
NLTK stopwords are just text files which you can edit with any word processor.
id: totrans-70
prefs: []
type: TYPE_NORMAL
zh: 因此,删除一些停用词列表中的条目是至关重要的。幸运的是,NLTK 的停用词只是文本文件,你可以使用任何文字处理器进行编辑。
- en: '[NLTK downloads are stored in user directories by default](https://sites.pitt.edu/~naraehan/python3/faq.html#Q-where-nltk-data)
but can be changed if necessary through the use of *download_dir=.*'
id: totrans-71
prefs: []
type: TYPE_NORMAL
zh: '[NLTK 下载默认存储在用户目录](https://sites.pitt.edu/~naraehan/python3/faq.html#Q-where-nltk-data)中,但可以通过使用
*download_dir=* 进行更改(如果需要的话)。'
- en: Dataframes and stopwords
id: totrans-72
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 数据框和停用词
- en: All machine learning models begin with data preparation and processing. Since
we’re working with SEO keywords, these can be easily exported through CSV from
popular tools that measure performance.
id: totrans-73
prefs: []
type: TYPE_NORMAL
zh: 所有机器学习模型都从数据准备和处理开始。由于我们处理的是 SEO 关键词,这些关键词可以通过流行的性能测量工具轻松导出为 CSV。
- en: There is something to be said about picking a random sample that should include
close to equal amounts of our categories. As we’re producing a pre-MVP, that shouldn’t
be a concern, as data can be added as we go if the model delivers the results
we need.
id: totrans-74
prefs: []
type: TYPE_NORMAL
zh: 选择一个随机样本,其中应包括接近相等数量的各类别,这一点是值得注意的。由于我们正在制作一个前期 MVP,这不应该成为问题,因为如果模型提供了我们需要的结果,可以随时添加数据。
- en: Before proceeding onwards, it would be wise to select a few dozen keywords out
of a CSV file and label them. Once we get to a working model, we can label the
rest. Since *Pandas* creates data frames in a tabular format, the easiest way
is to simply add a new column, “Category” or “Label,” and assign each keyword
row with *Informational, Transactional, or Navigational.*
id: totrans-75
prefs: []
type: TYPE_NORMAL
zh: 在继续之前,明智的做法是从 CSV 文件中选择几打关键词并进行标注。一旦我们得到一个有效的模型,就可以标注其余的。由于 *Pandas* 以表格格式创建数据框,最简单的方法是添加一个新列,“Category”
或 “Label”,并将每个关键词行标记为 *Informational, Transactional, or Navigational*。
- en: '[PRE3]'
id: totrans-76
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Line 1 & 2
id: totrans-77
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 行 1 和 2
- en: Whenever we have a CSV of any sort, *Pandas* requires us to create a dataframe.
First, we’ll read the keyword list supplied by SEO tools. Remember that the CSV
files should already have some keyword categorization involved, otherwise there
will be nothing to train the model on.
id: totrans-78
prefs: []
type: TYPE_NORMAL
zh: 每当我们有任何形式的 CSV 时,*Pandas* 要求我们创建一个数据框。首先,我们将读取由 SEO 工具提供的关键词列表。请记住,CSV 文件应该已经包含一些关键词分类,否则将没有东西可以用于训练模型。
- en: After reading the file, we create a dataframe object from our CSV.
id: totrans-79
prefs: []
type: TYPE_NORMAL
zh: 阅读文件后,我们从 CSV 文件创建一个数据框对象。
- en: Line 3
id: totrans-80
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 行 3
- en: We’ll use NLTK to grab the stopwords file, however, we can’t use it as it is.
NLTK’s default includes many words we consider essential for keyword categorization
(e.g., “what,” “how,” “where,” etc.). As such, it will have to be adjusted to
fit our purposes.
id: totrans-81
prefs: []
type: TYPE_NORMAL
zh: 我们将使用 NLTK 获取停用词文件,不过我们不能直接使用它。NLTK 的默认列表包含许多我们认为对关键词分类至关重要的词(例如,“what”,“how”,“where”等)。因此,它必须调整以适应我们的目的。
- en: While there are no hard and fast rules in such a case, indefinite and definite
articles can stay (e.g., “a,” “an,” “the,” etc.) as they provide no information.
Everything that could potentially show user intention, however, will have to be
removed from the default file.
id: totrans-82
prefs: []
type: TYPE_NORMAL
zh: 虽然在这种情况下没有硬性规定,但不定冠词和定冠词可以保留(例如,“a”,“an”,“the”等),因为它们不提供信息。然而,所有可能显示用户意图的内容都必须从默认文件中删除。
- en: I created a copy called ‘english_adjusted’ to make things easier for myself.
Additionally, in case I need the original version for whatever reason, it will
always be available without redownload.
id: totrans-83
prefs: []
type: TYPE_NORMAL
zh: 我创建了一个名为‘english_adjusted’的副本,以便于操作。此外,以防万一我需要原始版本,它将始终可用,无需重新下载。
- en: Finally, you’ll likely need to run NLTK once with the regular parameter ‘english’
to download the files, which can be done at any stage. Otherwise, you’ll receive
an error.
id: totrans-84
prefs: []
type: TYPE_NORMAL
zh: 最后,你可能需要运行一次NLTK,使用常规参数‘english’来下载文件,这可以在任何阶段完成。否则,你会收到错误。
- en: Setting up the pipeline
id: totrans-85
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 设置管道
- en: After all of these preparatory steps, we finally get to move on to the actual
machine learning bit. These are the most important bits and pieces of the model.
It’s likely that you’ll spend quite a bit of time tinkering with these parameters
to find out the best options.
id: totrans-86
prefs: []
type: TYPE_NORMAL
zh: 在所有这些准备步骤之后,我们终于可以进入实际的机器学习部分。这些是模型中最重要的部分。你可能会花费相当多的时间调整这些参数,以找到最佳选项。
- en: Unfortunately, there isn’t a lot of guidance that would apply in all cases.
Some experimentation and reasoning will be required to reduce the amount of testing
that’s needed, but eliminating it completely is impossible.
id: totrans-87
prefs: []
type: TYPE_NORMAL
zh: 不幸的是,没有很多指导方针适用于所有情况。需要进行一些实验和推理,以减少所需的测试量,但完全消除测试是不可能的。
- en: '[PRE4]'
id: totrans-88
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: Some may notice that I’m not splitting the dataset into a train and test split
through *scikit-learn*. Again, that is a luxury awarded by the nature of the problem.
SEO tools can export thousands of (unlabeled) keywords in less than a minute,
meaning you can procure a test set separately without much effort.
id: totrans-89
prefs: []
type: TYPE_NORMAL
zh: 有些人可能会注意到我没有通过*scikit-learn*将数据集拆分为训练集和测试集。这是问题的性质所赋予的奢侈。SEO工具可以在不到一分钟的时间内导出数千个(未标记的)关键字,这意味着你可以单独采购测试集而不费吹灰之力。
- en: So, due to optimization reasons, I’ll be simply using a second dataset that
has no labels as our testing grounds. Since, however, the *train_test_split* is
so ubiquitous, I’ll show a version of the same model using it in the addendum
at the bottom of the article.
id: totrans-90
prefs: []
type: TYPE_NORMAL
zh: 因此,出于优化原因,我将使用没有标签的第二个数据集作为我们的测试基础。然而,由于*train_test_split* 非常普遍,我将在文章末尾的附录中展示一个使用它的相同模型版本。
- en: Line 1
id: totrans-91
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第1行
- en: Pipeline allows us to truncate and simplify long processes into a single object,
making it a lot easier to work with the settings of the model. It will also reduce
the likelihood of making errors.
id: totrans-92
prefs: []
type: TYPE_NORMAL
zh: 管道允许我们将长时间的过程简化为一个对象,使处理模型设置变得更加容易。它还将减少出错的可能性。
- en: We’ll start by defining our vectorizer. I’ve noted above that we’ll be using
*TFIDFVectorizer* as it produces better results due to the way it weighs words
found in documents. *CountVectorizer* is an option, however, you’d have to import
it, and the results may vary.
id: totrans-93
prefs: []
type: TYPE_NORMAL
zh: 我们将从定义我们的向量化器开始。我在上面提到过我们将使用*TFIDFVectorizer*,因为它根据文档中单词的权重来产生更好的结果。*CountVectorizer*
是一个选项,但你需要导入它,结果可能会有所不同。
- en: '*Ngram_range* is an interesting reasoning challenge. To get the best results,
you have to decide how many tokens (in our case, words) have to be counted. *Ngram_range*
of (1, 1) would take a single word (unigram), of (1, 2) would take both a single
word and the two nearest (bigram) in combination, of (1, 3) would take a single
word, two, and three (trigram) in combination.'
id: totrans-94
prefs: []
type: TYPE_NORMAL
zh: '*Ngram_range* 是一个有趣的推理挑战。为了获得最佳结果,你必须决定要计算多少个词元(在我们的情况下是单词)。*Ngram_range* 为
(1, 1) 会计算单个词(单词),(1, 2) 会计算单个词和两个相邻的词(双词组)的组合,(1, 3) 会计算单个词、两个词和三个词(三词组)的组合。'
- en: I chose *ngram_range(1, 3)* for several reasons. First, since the model is relatively
simple and performance is not an issue, I can afford to run a larger range of
ngrams, so the lower bound can be set to be minimal.
id: totrans-95
prefs: []
type: TYPE_NORMAL
zh: 我选择了*ngram_range(1, 3)*,有几个原因。首先,由于模型相对简单,性能不是问题,我可以运行更大范围的n-gram,因此下限可以设置为最小。
- en: On the other hand, once we remove stopwords, we should think about what ngram
upper end would be enough to glean meaning from the keywords. If possible, I find
it easier to pick the hardest and easiest examples out of the dataset. In our
case, the easiest examples are any question (“how to get proxies”), and the hardest
are nouns (“web scraper”) or names (“Oxylabs”)
id: totrans-96
prefs: []
type: TYPE_NORMAL
zh: 另一方面,一旦我们去除停用词,我们应该考虑什么样的 ngram 上限足以从关键词中提取意义。如果可能,我发现从数据集中选择最难和最简单的例子更容易。在我们的情况下,最简单的例子是任何问题(“如何获取代理”),最难的是名词(“网络爬虫”)或名称(“Oxylabs”)。
- en: Since we’ll be removing words like “to”, we get a trigram in question cases
(“how get proxies”), which is completely clear. In fact, you could make the argument
that a bigram (“how get”) is enough as the intention is still clear.
id: totrans-97
prefs: []
type: TYPE_NORMAL
zh: 由于我们将移除像“to”这样的词,我们会在问题案例中得到三元组(“how get proxies”),这是完全清晰的。事实上,你可以认为二元组(“how
get”)也足够,因为意图仍然清晰。
- en: Hardest examples, however, will usually be shorter than a trigram as the ease
of understanding search intent correlates with query length. Therefore, *ngram_range
(1, 3)* should strike a decent balance for performance and accuracy.
id: totrans-98
prefs: []
type: TYPE_NORMAL
zh: 然而,最难的例子通常会比三元组短,因为理解搜索意图的难易程度与查询长度相关。因此,*ngram_range (1, 3)* 应该在性能和准确性之间取得一个不错的平衡。
- en: 'Finally, there’s an argument to be made for *sublinear_tf*, which is a modification
of the regular TF-IDF calculations. If set to *True,* weight is calculated through
a logarithmic function: *1 + log(tf)*. In other words, term frequency gains diminishing
returns.'
id: totrans-99
prefs: []
type: TYPE_NORMAL
zh: 最后,对于 *sublinear_tf* 有一个论点,即它是常规 TF-IDF 计算的一个修改。如果设置为 *True*,权重通过对数函数计算:*1 +
log(tf)*。换句话说,词频会获得递减的回报。
- en: With *sublinear_tf,* words that appear frequently and in many documents would
not be weighed as heavily. Since we have a collection of somewhat random keywords,
we never know which ones get preferential treatment, however, these could often
be terms such as “how,” “what,” etc., which are ones we’d like to be weighed heavily.
id: totrans-100
prefs: []
type: TYPE_NORMAL
zh: 使用 *sublinear_tf* 时,频繁出现且出现在多个文档中的词语不会被赋予过重的权重。由于我们有一组相对随机的关键词,我们无法知道哪些会得到优待,但这些通常是像“how”,“what”等我们希望被赋予较重权重的词。
- en: Throughout testing, I found that the model performed better without *sublinear_tf*,
but I recommend tinkering a bit to see whether it would grant any benefits.
id: totrans-101
prefs: []
type: TYPE_NORMAL
zh: 在测试过程中,我发现模型在没有 *sublinear_tf* 的情况下表现更好,但我建议稍微调整一下,看看是否会带来任何好处。
- en: The *Stopwords* parameter is, by now, self explanatory as we’ve discussed previously.
id: totrans-102
prefs: []
type: TYPE_NORMAL
zh: '*Stopwords* 参数现在已经不言自明,因为我们之前已经讨论过了。'
- en: Line 2
id: totrans-103
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第 2 行
- en: While not technically a new line, I’ll be separating these out for clarity and
brevity purposes. We’ll be now invoking *SelectKBest*, which I’ve written fairly
extensively about above. Our point of interest is the *k* value.
id: totrans-104
prefs: []
type: TYPE_NORMAL
zh: 虽然不严格来说是新的一行,但我将为清晰和简洁的目的将其分开。我们现在将调用 *SelectKBest*,我在上面已经对其进行了相当详细的描述。我们的关注点是
*k* 值。
- en: These will be different, depending on the size of your dataset. *SelectKBest*
is intended to optimize performance and accuracy. In my case, sending in ‘all’
works, but you’ll usually have to pick some large enough *N* that matches your
own dataset.
id: totrans-105
prefs: []
type: TYPE_NORMAL
zh: 这些会有所不同,具体取决于你的数据集的大小。*SelectKBest* 旨在优化性能和准确性。在我的情况下,使用“all”是有效的,但你通常需要选择一个足够大的
*N* 来匹配你自己的数据集。
- en: Line 3
id: totrans-106
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第 3 行
- en: Finally, we get to the method that will be used for the model. *LogisticRegression*
is our choice, as mentioned previously, but there’s a lot of tinkering to be done
with the parameters.
id: totrans-107
prefs: []
type: TYPE_NORMAL
zh: 最后,我们来到将用于模型的方法。*LogisticRegression* 是我们的选择,如前所述,但需要对参数进行大量的调整。
- en: “C”value is a hyperparameter, which is a parameter that tells the model the
parameters it should pick. Hyperparameters are highly complicated parts of the
model that have a tremendous impact on the end results.
id: totrans-108
prefs: []
type: TYPE_NORMAL
zh: “C”值是一个超参数,它告诉模型应该选择哪些参数。超参数是模型中非常复杂的部分,对最终结果有着巨大的影响。
- en: In extremely simple terms, the *C* value is the trust score for your training
data. A high *C* value means that a higher weight, when fitting, will be placed
on training data and a lower weight on penalties. Low C values place higher emphasis
on penalties and lower weight on training data.
id: totrans-109
prefs: []
type: TYPE_NORMAL
zh: 从极其简单的角度来看,*C* 值是你训练数据的信任分数。较高的 *C* 值意味着在拟合时,对训练数据的权重会较高,而对惩罚的权重较低。较低的 C 值则将更多强调惩罚,训练数据的权重较低。
- en: There should always be some penalty in place as training will never fully represent
real world values (due to being a small subset of it, regardless of how much you
collect). Additionally, having outliers and not penalizing them means the model
[will inch closer to being overfit](https://medium.com/p/7aeef64755d2).
id: totrans-110
prefs: []
type: TYPE_NORMAL
zh: 应始终存在一定的惩罚,因为训练永远无法完全代表现实世界的值(因为它只是一个小的子集,无论你收集多少)。此外,如果存在异常值而不进行惩罚,模型[将会越来越贴近过拟合](https://medium.com/p/7aeef64755d2)。
- en: The *penalty* parameter is the operation that will be used for the hyperparameter.
There are three types of penalties offered by *SciKit-Learn* — *‘l1’*, *‘l2’*,
and *‘elasticnet’*. ‘*None*’is also an option, but it should be used sparingly,
if ever.
id: totrans-111
prefs: []
type: TYPE_NORMAL
zh: '*penalty* 参数是用于超参数的操作。*SciKit-Learn* 提供了三种类型的惩罚——*‘l1’*、*‘l2’*和*‘elasticnet’*。‘*None*’也是一个选项,但如果使用的话应该尽量少。'
- en: ‘*L1*’ is the absolute sum of the magnitude of all coefficients. In simple terms,
it pulls all coefficients towards some central point. If large penalties are applied,
some data points can become zero (i.e., be eliminated).
id: totrans-112
prefs: []
type: TYPE_NORMAL
zh: ‘*L1*’ 是所有系数的绝对值之和。简单来说,它将所有系数拉向某个中心点。如果施加了大的惩罚,一些数据点可能会变成零(即被消除)。
- en: ‘*L1*’ should be used in cases where there is either multicollinearity (several
variables are correlated) or when you want to simplify the model. Since *l1* eliminates
some data points, models nearly always become simpler. It doesn’t work as well,
however, when you already have a relatively simple distribution of data points.
id: totrans-113
prefs: []
type: TYPE_NORMAL
zh: 在存在多重共线性(多个变量相关)或需要简化模型的情况下,应该使用‘*L1*’。由于*L1*会消除一些数据点,因此模型几乎总是变得更简单。然而,当数据点的分布已经相对简单时,它的效果不如预期。
- en: ‘*L2*’ is a different version of a similar process. Instead of being the absolute
sum, it’s the sum of the square of all coefficient values. As such, all coefficients
are shrunk by an identical value, but none are eliminated. ‘*L2*’ is the default
setting as it’s the most flexible and rarely causes issues.
id: totrans-114
prefs: []
type: TYPE_NORMAL
zh: ‘*L2*’ 是类似过程的不同版本。它不是绝对和,而是所有系数值的平方和。因此,所有系数都按相同的值缩小,但没有被消除。‘*L2*’ 是默认设置,因为它最灵活且很少引发问题。
- en: ‘*Elasticnet*’ is a combination of both of the above methods. There [has been
quite an extensive commentary](https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge)
written on whether ‘*elasticnet*’ should be the default approach, however, not
all solvers support it. In our case, we’d need to switch to the “saga” solver,
which is intended for large datasets.
id: totrans-115
prefs: []
type: TYPE_NORMAL
zh: ‘*Elasticnet*’ 是上述两种方法的结合。关于是否应该将‘*elasticnet*’作为默认方法,[已有相当广泛的评论](https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge),然而,并不是所有的求解器都支持它。在我们的情况下,我们需要切换到“saga”求解器,它是为大型数据集设计的。
- en: There would likely be little benefit to using ‘*elasticnet*’ in a tutorial-level
machine learning model. Just keep in mind that it may be beneficial in the future.
id: totrans-116
prefs: []
type: TYPE_NORMAL
zh: 在教程级别的机器学习模型中使用‘*elasticnet*’可能收益甚微。只需记住,将来它可能会有益。
- en: Moving on to *‘max_iter*’, the parameter will set the maximum number of iterations
the model will perform until convergence. In simple terms, convergence is the
point at which further iterations are unlikely to occur and serves as the stopping
point.
id: totrans-117
prefs: []
type: TYPE_NORMAL
zh: 继续讨论*‘max_iter’*,该参数将设置模型在收敛之前执行的最大迭代次数。简单来说,收敛是指进一步迭代不太可能发生的点,作为停止点。
- en: Higher values increase computational complexity but may result in better overall
behavior. In cases where the datasets are relatively simplistic, *‘max_iter’*
can be set to thousands and above as it won’t be too taxing on the system.
id: totrans-118
prefs: []
type: TYPE_NORMAL
zh: 较高的值会增加计算复杂性,但可能会导致更好的整体表现。在数据集相对简单的情况下,*‘max_iter’* 可以设置为数千及以上,因为这对系统的负担不会太大。
- en: If the values are too low and convergence fails, a warning message will be displayed.
As such, it’s not that difficult to find the lowest possible value and to work
up from there.
id: totrans-119
prefs: []
type: TYPE_NORMAL
zh: 如果值过低且收敛失败,将显示警告消息。因此,找到最低可能的值并从中开始并不困难。
- en: Fitting the model and outputting data
id: totrans-120
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 拟合模型并输出数据
- en: We’re nearing the end of the tutorial as we finally get to fitting the model
and receiving the output.
id: totrans-121
prefs: []
type: TYPE_NORMAL
zh: 我们接近教程的结束,最终进入模型拟合和接收输出的阶段。
- en: '[PRE5]'
id: totrans-122
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: Line 1–3
id: totrans-123
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第 1-3 行
- en: Within line 1, we use our established pipeline to fit the model to the training
data. In case some debugging or additional analysis is needed, the pipeline enables
us to create named steps, which can be called later on.
id: totrans-124
prefs: []
type: TYPE_NORMAL
zh: 在第1行中,我们使用我们建立的管道将模型拟合到训练数据中。如果需要进行调试或额外的分析,管道允许我们创建命名的步骤,这些步骤可以在后续调用。
- en: Line 4–8
id: totrans-125
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 第4–8行
- en: We create another dataframe from a CSV file that holds only the keywords. We
will be using our newly created model to predict each keyword and its category.
id: totrans-126
prefs: []
type: TYPE_NORMAL
zh: 我们从一个只包含关键词的CSV文件中创建另一个数据框。我们将使用新创建的模型来预测每个关键词及其类别。
- en: Since our dataframe contains only keywords, we add a new column “type” and run
*model.predict* to provide us with an output.
id: totrans-127
prefs: []
type: TYPE_NORMAL
zh: 由于我们的数据框仅包含关键词,我们添加了一个新的列“类型”,并运行*model.predict*以提供输出结果。
- en: Finally, all of it is moved to an output CSV file, which will be created in
the local directory. Usually, you’d like to set some destination, but for testing
purposes, there’s often no need to do so.
id: totrans-128
prefs: []
type: TYPE_NORMAL
zh: 最终,所有结果被移动到一个输出的CSV文件中,该文件将在本地目录中创建。通常,你会想设置一些目标,但为了测试目的,通常没有必要这样做。
- en: There’s a commented-out line that I’d like to mention that calls the *score*
function. *SciKit* provides us with numerous ways to estimate the predictive power
of our model. These shouldn’t be understood as gospel, as predicted accuracy and
real world accuracy can often diverge.
id: totrans-129
prefs: []
type: TYPE_NORMAL
zh: 有一行被注释掉的代码我想提一下,它调用了*score*函数。*SciKit*为我们提供了多种方法来估计模型的预测能力。这些方法不应被视为绝对真理,因为预测准确度与实际准确度通常可能有所偏差。
- en: Scores, however, are useful as a rule of thumb and as a quick way to evaluate
whether the parameters have had some influence on the model. While there are plenty
of scoring methods, the basic *model.score* uses *R squared*, which is helpful
in most cases whenever we’re tuning parameters.
id: totrans-130
prefs: []
type: TYPE_NORMAL
zh: 然而,得分作为经验法则和快速评估参数对模型的影响是有用的。虽然有很多评分方法,但基本的*model.score*使用*R平方*,在调整参数时通常很有帮助。
- en: Examining the results
id: totrans-131
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 结果分析
- en: My training data had a mere 1300 entries with three distinct categories, which
I have mentioned above. Even with such a small set, the model managed to arrive
at a decent accuracy score of about 80%.
id: totrans-132
prefs: []
type: TYPE_NORMAL
zh: 我的训练数据仅有1300条条目,包含三种不同的类别,如上所述。即使在这样的小数据集中,模型仍然达到了约80%的不错准确度。
- en: Some of these, as one would expect, are debatable, and even Google thinks so.
For example, “web scrape” was a keyword frequently searched for. There’s no clear
indication of whether the query is transactional or informational. Google SERPs
think as much as there are results for products and informational articles in
the top 5.
id: totrans-133
prefs: []
type: TYPE_NORMAL
zh: 其中一些,如预期的那样,是有争议的,甚至Google也这么认为。例如,“网页抓取”是一个经常被搜索的关键词。是否查询是交易性的还是信息性的没有明确的指示。Google的搜索结果页面显示,前5条结果中有产品和信息文章。
- en: There’s one area the model struggled with — navigational keywords. If I were
to guess, the model predicted the category correctly about 5–10% of the time.
There are several reasons for such an occurrence.
id: totrans-134
prefs: []
type: TYPE_NORMAL
zh: 模型在一个领域遇到了困难——导航关键词。如果我要猜测,模型大约5-10%的时间能正确预测类别。出现这种情况有几个原因。
- en: 'The distribution of the dataset could be blamed as it’s heavily imbalanced:'
id: totrans-135
prefs: []
type: TYPE_NORMAL
zh: 数据集的分布可能是一个问题,因为它严重不平衡:
- en: '**Transactional** — 0.457353%'
id: totrans-136
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '**交易型** — 0.457353%'
- en: '**Informational** — 0.450735%'
id: totrans-137
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '**信息型** — 0.450735%'
- en: '**Navigational** — 0.091912%'
id: totrans-138
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '**导航型** — 0.091912%'
- en: While real world scenarios would present a similar distribution (due to the
inherent rarity of Navigational keywords), the training data is too sparse for
proper fitting. Additionally, the frequency of navigational keywords is so low
that the model would produce greater accuracy by always assigning the other two.
id: totrans-139
prefs: []
type: TYPE_NORMAL
zh: 虽然实际世界场景会呈现出类似的分布(由于导航关键词的固有稀有性),但训练数据过于稀疏,无法进行适当的拟合。此外,导航关键词的频率非常低,以至于模型通过总是分配其他两个类别可以获得更高的准确性。
- en: I don’t think, however, that presenting the training data with more navigational
keywords would produce a much better result. It’s a problem that is extremely
difficult to solve through textual analysis, whatever kind we choose.
id: totrans-140
prefs: []
type: TYPE_NORMAL
zh: 然而,我认为展示更多的导航关键词的训练数据不会产生更好的结果。这是一个通过文本分析解决的极其困难的问题,无论我们选择哪种方法。
- en: Navigational keywords consist mostly of brand names, which are neologisms or
other newly produced words. Nothing within them follows the natural language,
and, as such, connections between them can only be discovered *a posteriori*.
In other words, we’d have to first know it’s a brand name, from other data sources,
to assign the category correctly.
id: totrans-141
prefs: []
type: TYPE_NORMAL
zh: 导航关键词主要由品牌名称组成,这些名称是新造词或其他新产生的词。它们中没有任何内容遵循自然语言,因此,它们之间的联系只能*事后*发现。换句话说,我们必须首先从其他数据源知道这是一个品牌名称,才能正确分配类别。
- en: If I had to guess, Google and other search engines discover brand names through
the way users act when they query a new word. They might look for domain matches
or other data, but predicting that something is a navigational keyword without