generated from OpenDocCN/doc-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
0054.yaml
692 lines (692 loc) · 36.5 KB
/
0054.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
- en: Build Trail Recommender for TrailForks
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 为 TrailForks 构建推荐系统
- en: 原文:[https://towardsdatascience.com/build-trail-recommender-for-trailforks-8ea64b1a2fe4?source=collection_archive---------13-----------------------#2023-01-04](https://towardsdatascience.com/build-trail-recommender-for-trailforks-8ea64b1a2fe4?source=collection_archive---------13-----------------------#2023-01-04)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://towardsdatascience.com/build-trail-recommender-for-trailforks-8ea64b1a2fe4?source=collection_archive---------13-----------------------#2023-01-04](https://towardsdatascience.com/build-trail-recommender-for-trailforks-8ea64b1a2fe4?source=collection_archive---------13-----------------------#2023-01-04)
- en: How I won the Outside 2022 Innovation Days Award
id: totrans-2
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 我是如何赢得 Outside 2022 创新日奖的
- en: '[](https://medium.com/@wen_yang?source=post_page-----8ea64b1a2fe4--------------------------------)[![Wen
Yang](../Images/5eac438762d015a0ab128757cc951967.png)](https://medium.com/@wen_yang?source=post_page-----8ea64b1a2fe4--------------------------------)[](https://towardsdatascience.com/?source=post_page-----8ea64b1a2fe4--------------------------------)[![Towards
Data Science](../Images/a6ff2676ffcc0c7aad8aaf1d79379785.png)](https://towardsdatascience.com/?source=post_page-----8ea64b1a2fe4--------------------------------)
[Wen Yang](https://medium.com/@wen_yang?source=post_page-----8ea64b1a2fe4--------------------------------)'
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: '[](https://medium.com/@wen_yang?source=post_page-----8ea64b1a2fe4--------------------------------)[![Wen
Yang](../Images/5eac438762d015a0ab128757cc951967.png)](https://medium.com/@wen_yang?source=post_page-----8ea64b1a2fe4--------------------------------)[](https://towardsdatascience.com/?source=post_page-----8ea64b1a2fe4--------------------------------)[![Towards
Data Science](../Images/a6ff2676ffcc0c7aad8aaf1d79379785.png)](https://towardsdatascience.com/?source=post_page-----8ea64b1a2fe4--------------------------------)
[Wen Yang](https://medium.com/@wen_yang?source=post_page-----8ea64b1a2fe4--------------------------------)'
- en: ·
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: ·
- en: '[Follow](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2Fcbb5383bd438&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuild-trail-recommender-for-trailforks-8ea64b1a2fe4&user=Wen+Yang&userId=cbb5383bd438&source=post_page-cbb5383bd438----8ea64b1a2fe4---------------------post_header-----------)
Published in [Towards Data Science](https://towardsdatascience.com/?source=post_page-----8ea64b1a2fe4--------------------------------)
·11 min read·Jan 4, 2023[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Ftowards-data-science%2F8ea64b1a2fe4&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuild-trail-recommender-for-trailforks-8ea64b1a2fe4&user=Wen+Yang&userId=cbb5383bd438&source=-----8ea64b1a2fe4---------------------clap_footer-----------)'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: '[关注](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2Fcbb5383bd438&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuild-trail-recommender-for-trailforks-8ea64b1a2fe4&user=Wen+Yang&userId=cbb5383bd438&source=post_page-cbb5383bd438----8ea64b1a2fe4---------------------post_header-----------)
发表在 [Towards Data Science](https://towardsdatascience.com/?source=post_page-----8ea64b1a2fe4--------------------------------)
· 11 分钟阅读 · 2023 年 1 月 4 日 [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Ftowards-data-science%2F8ea64b1a2fe4&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuild-trail-recommender-for-trailforks-8ea64b1a2fe4&user=Wen+Yang&userId=cbb5383bd438&source=-----8ea64b1a2fe4---------------------clap_footer-----------)'
- en: --
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: --
- en: '[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F8ea64b1a2fe4&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuild-trail-recommender-for-trailforks-8ea64b1a2fe4&source=-----8ea64b1a2fe4---------------------bookmark_footer-----------)![](../Images/8c1b04655799b55dc2ea6e187f82f7a4.png)'
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: '[](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F8ea64b1a2fe4&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fbuild-trail-recommender-for-trailforks-8ea64b1a2fe4&source=-----8ea64b1a2fe4---------------------bookmark_footer-----------)![](../Images/8c1b04655799b55dc2ea6e187f82f7a4.png)'
- en: Photo by [Kristin Snippe](https://unsplash.com/@frausnippe?utm_source=medium&utm_medium=referral)
on [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 摄影:[Kristin Snippe](https://unsplash.com/@frausnippe?utm_source=medium&utm_medium=referral)
摄于 [Unsplash](https://unsplash.com/?utm_source=medium&utm_medium=referral)
- en: Long Background Story
id: totrans-9
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 长背景故事
- en: When I first joined [Outside Inc](https://www.outsideinc.com/) in May 2021,
my job was about building a personalized recommender system to power Outside Feed,
which comprises a wide range of outdoor and active lifestyle content in a mixed
medium format such as articles, videos, Outside films, and podcasts. Our goal
is to inspire people to go outside.
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 当我在 2021 年 5 月首次加入 [Outside Inc](https://www.outsideinc.com/) 时,我的工作是构建一个个性化推荐系统,以驱动
Outside Feed。Outside Feed 包含各种户外和活跃生活方式的内容,以文章、视频、Outside 电影和播客等混合媒介格式呈现。我们的目标是激励人们走出户外。
- en: 'Fast-forward to 17 months later, we have made two major transformations for
Outside Feed’s recommender system:'
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 快进到 17 个月后,我们对 Outside Feed 的推荐系统进行了两次重大改造:
- en: 'From reverse chronological feed to a hybrid-recommender powered feed, which
includes three classic recsys functionalities: Collaborative Filtering, Content-based
Filtering, and Hot & Trending. Model training and inference are batch-based. To
use the 80–20 rule, this kind of system can get you from 0 to 80% in terms of
providing personalized recommendations with relatively low effort. “Low” effort
here is mainly meant in the sense of algorithm complexity.'
id: totrans-12
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 从逆时间顺序的动态信息流到由混合推荐器驱动的动态信息流,其中包括三种经典的推荐系统功能:协同过滤、基于内容的过滤和热门与趋势。模型训练和推理是基于批次的。使用
80–20 法则,这种系统可以让你在提供个性化推荐时,借助相对较少的努力达到 80%。这里的“低”努力主要是指算法复杂性方面。
- en: From a batch-based hybrid recommendation system to a real-time recommender system.
We started to integrate [Miso.ai](https://miso.ai/) — a third-party tool to make
the best use of clickstream events data collected by MetaRouter, and generate
real-time recommendations. Since we use GraphQL, there are many lessons learned
about wrapping rest API in Apollo as well as monitoring in DataDog, which deserves
its own post for another day.
id: totrans-13
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 从基于批次的混合推荐系统到实时推荐系统。我们开始集成 [Miso.ai](https://miso.ai/) —— 一个第三方工具,用来充分利用 MetaRouter
收集的点击流事件数据,并生成实时推荐。由于我们使用 GraphQL,关于在 Apollo 中封装 REST API 以及在 DataDog 中进行监控,我们学到了很多经验,这些经验值得另写一篇文章。
- en: 'For this post, I documented my experience participating in Outside Innovation
Day, in which I built a prototype trail recommender using trailforks data within
one day. (PS: yes trailforks is part of the Outside family now!) Perhaps it is
because a lot of my coworkers are avid hikers, mountain bikers, and outdoor enthusiasts,
and they know firsthand the joy and relaxation of having a trail recommender to
help discover trails to explore. They all kindly voted for my project and I won
the Outside Innovation Day Award for Best embodying our mission to get people
outside! ( Very fortunate and happy 😁 ~~~).'
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: 在这篇文章中,我记录了我参加 Outside Innovation Day 的经验,在一天内使用 Trailforks 数据构建了一个原型小径推荐系统。(PS:是的,Trailforks
现在是 Outside 家族的一部分!)也许因为我的很多同事都是热衷的徒步旅行者、山地自行车爱好者和户外运动爱好者,他们亲身体验了有一个小径推荐系统来帮助发现探索小径的乐趣和放松。他们都友好地为我的项目投了票,我赢得了
Outside Innovation Day 最能体现我们使命的奖项!(非常幸运和开心 😁 ~~~)。
- en: 'Below I will share three things:'
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 以下是我将分享的三件事:
- en: 🏔️ 1\. How to build an 80% personalized recommender system with 20% effort (with
Trailforks data)?
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 🏔️ 1\. 如何以 20% 的努力构建 80% 个性化的推荐系统(使用 Trailforks 数据)?
- en: 🏔️ 2\. Things to consider for preparing hackathon type of projects
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 🏔️ 2\. 为准备黑客马拉松类型的项目考虑的事项
- en: 🏔️ 3\. Further considerations for productionize recommender system
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 🏔️ 3\. 生产化推荐系统的进一步考虑
- en: 🏔️ 1\. The 80–20 Rule of Trail Recommender
id: totrans-19
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 🏔️ 1\. Trail Recommender 的 80–20 法则
- en: There are so many advanced recommender systems, but a hybrid of Collaborative
filtering + Content-based recommenders is really the sweet spot of personalized
recsys. It can cover 80% of the ground with about 20% of engineering effort.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 有很多高级推荐系统,但协同过滤 + 基于内容的推荐系统的混合真的是个性化推荐系统的最佳选择。它可以以约 20% 的工程努力覆盖 80% 的内容。
- en: Collaborative filtering involves analyzing the preferences of other users who
have similar tastes to the current user and recommending trails based on those
users’ preferences.
id: totrans-21
prefs: []
type: TYPE_NORMAL
zh: 协同过滤涉及分析其他具有相似口味的用户的偏好,并基于这些用户的偏好推荐小径。
- en: '![](../Images/886a3fc481dc5d4448af5d0f4e0240f7.png)'
id: totrans-22
prefs: []
type: TYPE_IMG
zh: '![](../Images/886a3fc481dc5d4448af5d0f4e0240f7.png)'
- en: image created by Wen Yang
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: 由 Wen Yang 创建的图像
- en: Content-based filtering, on the other hand, involves analyzing the characteristics
of the trails themselves and recommending trails with similar characteristics
to the user’s preferences.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 另一方面,基于内容的过滤涉及分析小径本身的特征,并推荐与用户偏好相似的小径。
- en: '![](../Images/04c0fdaa440136b442203dbff7fe2393.png)'
id: totrans-25
prefs: []
type: TYPE_IMG
zh: '![](../Images/04c0fdaa440136b442203dbff7fe2393.png)'
- en: image by Wen Yang
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: Wen Yang 制作的图像
- en: Building an 80–20 trail recommender involves three steps.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 构建一个 80–20 小径推荐系统涉及三个步骤。
- en: '**Step 1: First, you need to gather data.**'
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: '**步骤 1:首先,你需要收集数据。**'
- en: There are two types of data most useful for building recommender systems.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 有两种数据对于构建推荐系统特别有用。
- en: '**Interaction dataset**: includes `userid` , `trailid` , and `activitytype`
. The first two are most essential, and the third feature `activitytype` is not
used in my project.'
id: totrans-30
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '**交互数据集**:包括 `userid`、`trailid` 和 `activitytype`。前两个最为重要,而第三个特征 `activitytype`
在我的项目中未使用。'
- en: '![](../Images/53860b726aa3889cc1812f4f65fa82dc.png)'
id: totrans-31
prefs: []
type: TYPE_IMG
zh: '![](../Images/53860b726aa3889cc1812f4f65fa82dc.png)'
- en: Preview Trailforks Interaction dataset
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 预览 Trailforks 互动数据集
- en: '**Catalog dataset**: includes selected trail information such as the trail’s
location, length, difficulty level, global rank score and etc.'
id: totrans-33
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '**目录数据集**: 包括选定小径的信息,如小径的位置、长度、难度等级、全球排名分数等。'
- en: '![](../Images/8fbb63a0c303d99b4264c46df888e589.png)'
id: totrans-34
prefs: []
type: TYPE_IMG
zh: '![](../Images/8fbb63a0c303d99b4264c46df888e589.png)'
- en: '**Step 2: Data Visualization with Altair**'
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: '**步骤 2: 使用 Altair 进行数据可视化**'
- en: From the interaction dataset, there are only 10 users, and one user alone (userid
= 454369) explored 2524 unique trails. (PS. His name is Trevor and he is our founding
engineer at TrailForks — totally makes sense!)
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 从互动数据集中,只有 10 个用户,其中一个用户(userid = 454369)探索了 2524 条独特的小径。(PS:他的名字是 Trevor,他是
TrailForks 的创始工程师——这完全合理!)
- en: '[PRE0]'
id: totrans-37
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: '![](../Images/cf1c1f11f32a3e3927c4e614e18975a9.png)'
id: totrans-38
prefs: []
type: TYPE_IMG
zh: '![](../Images/cf1c1f11f32a3e3927c4e614e18975a9.png)'
- en: 'The below code can make a bar chart per variable name:'
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: 以下代码可以为每个变量名生成一个柱状图:
- en: '[PRE1]'
id: totrans-40
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: 'For example, let’s look at trail direction, physical rating difficulty title,
and country name:'
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: 比如,我们来看看小径方向、物理评分难度标题和国家名称:
- en: '![](../Images/ce682760565f644e135762906c6d43b6.png)'
id: totrans-42
prefs: []
type: TYPE_IMG
zh: '![](../Images/ce682760565f644e135762906c6d43b6.png)'
- en: Initially, I was trying to use `global_rank_score` as the “popularity” feature,
but there are too many records missing this feature. Fortunately, `rating` is
positively correlated to `global_rank_score` . Below is one of my favorite types
of visualizations, which can be done with `altair` fairly easily.
id: totrans-43
prefs: []
type: TYPE_NORMAL
zh: 起初,我尝试使用 `global_rank_score` 作为“流行度”特征,但有太多记录缺少这个特征。幸运的是,`rating` 与 `global_rank_score`
存在正相关。下面是我最喜欢的可视化类型之一,使用 `altair` 可以很容易地完成。
- en: '[PRE2]'
id: totrans-44
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: '![](../Images/ba5b73d62b18134ec20350beb2406f28.png)'
id: totrans-45
prefs: []
type: TYPE_IMG
zh: '![](../Images/ba5b73d62b18134ec20350beb2406f28.png)'
- en: Similarly, we can use `difficulty_title` instead of `physical_rating` to represent
the difficulty level, since the former has fewer missing values than the latter.
id: totrans-46
prefs: []
type: TYPE_NORMAL
zh: 同样,我们可以使用 `difficulty_title` 替代 `physical_rating` 来表示难度等级,因为前者的缺失值比后者少。
- en: Moderate → Blue, Black Diamond
id: totrans-47
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 中等 → 蓝色,黑钻
- en: Hard → Black Diamond, Blue
id: totrans-48
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 困难 → 黑钻,蓝色
- en: Extreme → Double Black Diamond
id: totrans-49
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 极端 → 双黑钻
- en: Easy → Green
id: totrans-50
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 简单 → 绿色
- en: '[PRE3]'
id: totrans-51
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: '![](../Images/0ead52bffcd14b16db778c925d14fdfa.png)'
id: totrans-52
prefs: []
type: TYPE_IMG
zh: '![](../Images/0ead52bffcd14b16db778c925d14fdfa.png)'
- en: '**Step 3a: Build a collaborative filtering model using** `**implicit**`'
id: totrans-53
prefs: []
type: TYPE_NORMAL
zh: '**步骤 3a: 使用** `**implicit**` **构建协同过滤模型**'
- en: 'For collaborative filtering, the only dataset you need is the interaction dataset.
Here I calculated three other features:'
id: totrans-54
prefs: []
type: TYPE_NORMAL
zh: 对于协同过滤,你所需要的唯一数据集是互动数据集。这里我计算了三个其他特征:
- en: '`n_times_interacted` : how many times has this user interacted with this trail'
id: totrans-55
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '`n_times_interacted` : 该用户与此小径互动了多少次'
- en: '`n_trails_interacted` : how many trails were explored by this user'
id: totrans-56
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '`n_trails_interacted` : 该用户探索了多少条小径'
- en: '`n_users_interacted` : how many unique users explored this trail'
id: totrans-57
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '`n_users_interacted` : 多少独特用户探索了这条小径'
- en: 'The idea is the use `n_times_interacted` as an implicit feedback feature: the
more times a user explored a certain trail, the more confidence that the user
likes this trail.'
id: totrans-58
prefs: []
type: TYPE_NORMAL
zh: 思路是使用 `n_times_interacted` 作为隐式反馈特征:用户探索某条小径的次数越多,就越能确定用户喜欢这条小径。
- en: '![](../Images/c773975dfc96646a4944a445c1ff8d27.png)'
id: totrans-59
prefs: []
type: TYPE_IMG
zh: '![](../Images/c773975dfc96646a4944a445c1ff8d27.png)'
- en: 'Model Training:'
id: totrans-60
prefs: []
type: TYPE_NORMAL
zh: 模型训练:
- en: '[PRE4]'
id: totrans-61
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: 'Useful functions for inference:'
id: totrans-62
prefs: []
type: TYPE_NORMAL
zh: 推理的有用函数:
- en: '[PRE5]'
id: totrans-63
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: '🌄 DEMO: recommend trails for Trevor using Collaborative Filtering'
id: totrans-64
prefs: []
type: TYPE_NORMAL
zh: '🌄 DEMO: 使用协同过滤为 Trevor 推荐小径'
- en: 'I used `pandas-profiling` to find out Trevor’s preference on trails, and here’s
the observation:'
id: totrans-65
prefs: []
type: TYPE_NORMAL
zh: 我使用了 `pandas-profiling` 来了解 Trevor 对小径的偏好,以下是观察结果:
- en: 'physical_rating: moderate, hard'
id: totrans-66
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 'physical_rating: 中等,困难'
- en: 'difficulty_title: Blue, Black Diamond'
id: totrans-67
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 'difficulty_title: 蓝色,黑钻'
- en: 'trailtype: Singletrack'
id: totrans-68
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 'trailtype: 单轨'
- en: 'direction: Downhill Only'
id: totrans-69
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 方向:仅限下坡
- en: 'most explored trail: Fitzsimmons Connector (319 times!)'
id: totrans-70
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 最多被探索的小径:Fitzsimmons Connector(319 次!)
- en: '![](../Images/076484d011994423377d05dba4643b31.png)![](../Images/f47ad8752446f8cdccdf56e5a08092f7.png)'
id: totrans-71
prefs: []
type: TYPE_IMG
zh: '![](../Images/076484d011994423377d05dba4643b31.png)![](../Images/f47ad8752446f8cdccdf56e5a08092f7.png)'
- en: Let’s get recommendations from the Collaborative Filtering Model →
id: totrans-72
prefs: []
type: TYPE_NORMAL
zh: 让我们从协同过滤模型中获取推荐 →
- en: '![](../Images/e67b2d492610b6b63dcbb6ea3f458ee1.png)'
id: totrans-73
prefs: []
type: TYPE_IMG
zh: '![](../Images/e67b2d492610b6b63dcbb6ea3f458ee1.png)'
- en: 'And the photo and map for the first recommended trail:'
id: totrans-74
prefs: []
type: TYPE_NORMAL
zh: 以及第一个推荐小径的照片和地图:
- en: '![](../Images/b06915379122cf1299c6aece3e2436b9.png)'
id: totrans-75
prefs: []
type: TYPE_IMG
zh: '![](../Images/b06915379122cf1299c6aece3e2436b9.png)'
- en: '**Step 3b: Content-based Filtering**'
id: totrans-76
prefs: []
type: TYPE_NORMAL
zh: '**步骤 3b: 基于内容的过滤**'
- en: Since Collaborative filtering is based on the interaction dataset, this means
that it won’t work for the cold-user scenario. If a new user hasn’t explored any
trails, we won’t be able to find similar users to this new user. That’s why we
need Content-based filtering to fill the gap.
id: totrans-77
prefs: []
type: TYPE_NORMAL
zh: 由于协同过滤基于交互数据集,这意味着它在冷启动用户场景下不起作用。如果新用户没有探索任何路线,我们将无法找到与该新用户相似的用户。这就是为什么我们需要基于内容的过滤来填补这个空白。
- en: 'Content-based filtering is all about finding “similar trails”, therefore we
need to decide two things:'
id: totrans-78
prefs: []
type: TYPE_NORMAL
zh: 基于内容的过滤完全是关于寻找“相似的路线”,因此我们需要决定两件事:
- en: similar in what characteristics
id: totrans-79
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 在什么特征上相似
- en: how to measure similarity
id: totrans-80
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 如何测量相似性
- en: 'The first question is a feature engineering question. The below features are
selected as trail characteristics:'
id: totrans-81
prefs: []
type: TYPE_NORMAL
zh: 第一个问题是特征工程问题。以下特征被选为路线特征:
- en: 'Numerical features: `stat_climb` , `stat_descent` , `stat_distance` , `rating`'
id: totrans-82
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 数值特征:`stat_climb` 、`stat_descent` 、`stat_distance` 、`rating`
- en: 'Text features: `title`, `difficulty_title` , `trailtype` , `direction` and
`country_title`'
id: totrans-83
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 文本特征:`title`、`difficulty_title`、`trailtype`、`direction` 和 `country_title`
- en: As for the second question, we use cosine similarity which is a commonly used
measure in terms of RecSys practice.
id: totrans-84
prefs: []
type: TYPE_NORMAL
zh: 对于第二个问题,我们使用余弦相似度,这是在推荐系统实践中常用的度量方法。
- en: 💻 Code example for feature engineering and similarity calculation
id: totrans-85
prefs: []
type: TYPE_NORMAL
zh: 💻 特征工程和相似度计算的代码示例
- en: '[PRE6]'
id: totrans-86
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: '🌄 DEMO: recommend trails for Trevor using Content-based Filtering'
id: totrans-87
prefs: []
type: TYPE_NORMAL
zh: 🌄 DEMO:使用基于内容的过滤为 Trevor 推荐路线
- en: Let’s find trails similar to Trevor’s second most explored trail (‘A-Line —
Lower’)
id: totrans-88
prefs: []
type: TYPE_NORMAL
zh: 让我们找到与 Trevor 的第二条最常探索的路线(‘A-Line — Lower’)相似的路线
- en: '![](../Images/6828e474a35e00a6a97a82eff7f6e82d.png)![](../Images/0986a632ebcd8e64f77ca19a62f4c360.png)'
id: totrans-89
prefs: []
type: TYPE_IMG
zh: '![](../Images/6828e474a35e00a6a97a82eff7f6e82d.png)![](../Images/0986a632ebcd8e64f77ca19a62f4c360.png)'
- en: Wow! The content-based recommender is pretty impressive, which recommended most
medium to hard difficulty levels, and downhill direction trails. My coworker Trevor
is a serious mountain biker, and he’s quite pleased about such recommendations!
id: totrans-90
prefs: []
type: TYPE_NORMAL
zh: 哇!基于内容的推荐器非常令人印象深刻,它推荐了大多数中等到困难难度的山地车路线和下坡方向的路线。我的同事 Trevor 是一个资深的山地车骑行者,他对这样的推荐非常满意!
- en: OK, now the hard work is done. I’d like to share a few reflections.
id: totrans-91
prefs: []
type: TYPE_NORMAL
zh: 好吧,现在艰苦的工作已经完成。我想分享一些反思。
- en: 🏔️ 2\. Things to consider for preparing hackathon type of projects
id: totrans-92
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 🏔️ 2. 准备黑客马拉松类型项目时需要考虑的事项
- en: 'The biggest lesson I learned from participating in hackathon type of projects
is that “less is more”. It’s very tempting to build a recommender system including
both the data science component, the backend database component, and the frontend
or just streamlit component for a nicer demo. But hackathons are typical with
time limitations, you need to be aware of which parts you are willing to sacrifice
a bit to give time for other things. For me, such parts are:'
id: totrans-93
prefs: []
type: TYPE_NORMAL
zh: 从参与黑客马拉松类型的项目中我学到的最大教训是“少即是多”。尽管很诱人去构建一个包括数据科学组件、后端数据库组件以及前端或只是用于更好演示的 streamlit
组件的推荐系统,但黑客马拉松通常时间有限,你需要意识到哪些部分你愿意稍微牺牲一点,以便留出时间做其他事情。对我来说,这些部分是:
- en: 'EDA and data visualization: exploratory data analysis is like a black hole,
it can suck you in and it feels no obvious ending point. The same goes for data
visualization, you could spend endless time to make one chart pretty and prettier,
but that would only benefit your project from 80 to 82, which is the opposite
of the 80–20 rule. My solutions are `pandas-profiling` and `altair` . Before performing
any data analysis, I would use `pandas-profiling` to check the missing values,
the general distribution, and correlations, this will help me narrow down the
data I can keep and prepare for feature engineering. `altair` is really simple
to use and with a few compacted lines, you can easily make nice-looking bar charts
and scatter+distribution charts. One of my tips is that I would just copy a few
code snippets on the notebook beforehand, and only use those already in my notebook
to remove any further temptations.'
id: totrans-94
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: EDA 和数据可视化:探索性数据分析就像一个黑洞,它可以把你吸进去,并且没有明显的结束点。数据可视化也是如此,你可能会花费无尽的时间来美化一个图表,但这只会将你的项目从
80 提升到 82,这与 80-20 规则相反。我的解决方案是`pandas-profiling` 和 `altair`。在进行任何数据分析之前,我会使用
`pandas-profiling` 来检查缺失值、总体分布和相关性,这将帮助我缩小可以保留的数据范围并为特征工程做准备。`altair` 使用起来非常简单,通过几行紧凑的代码,你可以轻松制作出漂亮的条形图和散点+分布图。我有一个小技巧,就是提前在笔记本中复制几个代码片段,只使用已经在笔记本中的那些代码,以避免任何进一步的诱惑。
- en: The demo is about storytelling. You can spend all-nighters to have the fanciest
and most complex hackathon project, but you need to reserve enough time to prepare
how you want to tell the story. A story can start with a Why (the motivation and
background of why you want to do this project), a What (what happened, what’s
surprising, what works), a little bit of How (keep it at a high level, equations
and code are impressive in itself but it creates unnecessary intimidation and
fatigue for people who only have 5 mins to understand your project). For example,
this post is definitely too long and has too many details to demo standard. Lastly,
adding a bit of drama and humor is always an advantage.
id: totrans-95
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 演示是关于讲故事的。你可以熬夜制作最华丽和最复杂的黑客松项目,但你需要留出足够的时间来准备如何讲述这个故事。一个故事可以从 Why(你为什么要做这个项目的动机和背景)、What(发生了什么,什么令人惊讶,什么有效)、一点点
How(保持在高层次,方程式和代码本身很令人印象深刻,但它会给那些只有5分钟时间理解你的项目的人带来不必要的威慑和疲惫)开始。例如,这篇文章绝对过长,细节过多,不符合演示标准。最后,加入一些戏剧性和幽默感总是一个优势。
- en: 🏔️ 3\. Further considerations for productionize recommender system
id: totrans-96
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 🏔️ 3\. 将推荐系统生产化的进一步考虑
- en: 'Lastly, if you made it this far, and you really want to productionize the prototype
recommender system, there are three pieces of ghost knowledge* I’d like to share:'
id: totrans-97
prefs: []
type: TYPE_NORMAL
zh: 最后,如果你已经看到这里,并且真的想将原型推荐系统投入生产,我有三点幽灵知识*想与大家分享:
- en: '*ghost knowledge (*[*source*](https://vickiboykis.com/2021/03/26/the-ghosts-in-the-data/)*)
:*'
id: totrans-98
prefs: []
type: TYPE_NORMAL
zh: '*幽灵知识 (*[*来源*](https://vickiboykis.com/2021/03/26/the-ghosts-in-the-data/)*)
:*'
- en: Knowledge that is present somewhere in the epistemic community, and is perhaps
readily accessible to some central member of that community, but it is not really
written down anywhere and it’s not clear how to access it.
id: totrans-99
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 知识存在于某个认知社区中,也许某个社区的核心成员很容易获得,但实际上并没有被写下来,也不清楚如何获取它。
- en: 'Batch vs real-time recommender system: batch system sounds simple and straightforward,
but it has its own complications. If you use `implicit` for collaborative filtering
and using `cosine similarity` to retrieve similar content, it is very likely the
latency will not satisfy your frontend API requirement. You can try to improve
it using indexing like FAISS, but it might still be higher than 500ms. That’s
why you might consider doing batch computations for each user and each item. Depending
on the size of users and items, the batch recommendations might take more than
4hours and it will not refresh until you retrain the model. A quick tip is that
only using users have any interactions in the past 90 days, or only do batch updates
for items published in the past 90 days.'
id: totrans-100
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 批量 vs 实时推荐系统:批量系统听起来简单直接,但它有其自身的复杂性。如果你使用`implicit`进行协同过滤并使用`cosine similarity`检索类似内容,很可能延迟无法满足你的前端
API 要求。你可以尝试使用像 FAISS 这样的索引来提高速度,但它可能仍然高于500ms。这就是为什么你可能考虑对每个用户和每个项目进行批量计算。根据用户和项目的规模,批量推荐可能需要超过4小时,并且直到你重新训练模型才会刷新。一个快速提示是,仅使用过去90天内有任何互动的用户,或仅对过去90天内发布的项目进行批量更新。
- en: '“Cold-Start” might not be as severe as you think: I remember when I first learned
the recommender system, books and talks made sure to emphasize the “cold-start”
problem as if it is the most difficult scenario to solve. But for Outside Feed,
this problem is not even top-3\. The reason is that a lot of people are used to
reading the freshest and most recent content on their Outside Feed, and it is
fairly easy to rank content by publish date to surface the new items (cold items),
thus it is easy to guarantee new items get pageview (interactions) data. For us,
the much harder problem is actually “how to recommend ever-green content”? If
we recommend an all-time popular article first written in 2016, even though this
piece of content might be relevant to users'' tastes, users might get the impression
that we didn’t have fresh content to recommend.'
id: totrans-101
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: “Cold-Start” 可能并不像你想象的那么严重:我记得当我第一次学习推荐系统时,书籍和讲座总是强调“cold-start”问题,好像这是最难解决的场景。然而,对于
Outside Feed 来说,这个问题甚至不在前3名。原因是很多人习惯于在 Outside Feed 上阅读最新的内容,因此通过发布日期对内容进行排序,以便展示新项(冷项)是相当容易的,从而能够保证新项获得页面浏览(互动)数据。对我们来说,更难的问题实际上是“如何推荐永恒的内容”?如果我们推荐一篇最初于2016年撰写的全时热门文章,即使这篇内容可能符合用户的口味,用户可能会觉得我们没有新鲜的内容可以推荐。
- en: 'The third piece of knowledge is an anecdote I heard from another recommender
system practitioner (Andy - the CTO from [miso.ai](https://miso.ai/)): he once
built a homepage recommender for a clothing website. Even though the learn-to-rank
ranking system learned that the “Black” color is likely to be the most popular
one, thus it recommended the first page with all clothing in different styles
but all in black color. Well, the model isn’t wrong, but it looks terrible at
first glance and it performed poorly in terms of enticing people to buy clothes.
He joked that he’s been studying recommender systems for years and years, but
nobody mentioned that a good color complementary design might have a huge effect.'
id: totrans-102
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 第三个知识点是我从另一位推荐系统从业者(Andy - 来自[miso.ai](https://miso.ai/)的CTO)那里听到的趣闻:他曾经为一个服装网站构建了一个主页推荐系统。尽管学习排名的系统了解到“黑色”可能是最受欢迎的颜色,因此推荐了首页上全是黑色的不同风格的服装。好吧,模型并没有错,但乍一看效果很糟糕,并且在吸引人们购买衣物方面表现不佳。他开玩笑说,他研究推荐系统已经很多年了,但没有人提到好的颜色搭配设计可能会有很大的影响。
- en: That’s it for my last medium post in the year 2022\. Thank you tremendously
for the support. If you have any thoughts or reflections or better yet — more
recsys anecdotes, please send them my way! Love to compile them into the fascinating
book I dream to write < The Pleasure and Sorrow of Building RecSys> 🐳~~~
id: totrans-103
prefs: []
type: TYPE_NORMAL
zh: 这就是我在2022年的最后一篇中等帖子。非常感谢你的支持。如果你有任何想法或反思,或者更好——更多的推荐系统趣闻,请发送给我!我很乐意将它们编入我梦想写的迷人书籍《构建推荐系统的乐趣与悲哀》中
< The Pleasure and Sorrow of Building RecSys> 🐳~~~