召回01:基于物品的协同过滤(ItemCF)
Item Based Collaborative Filtering,缩写 ItemCF
ItemCF 的原理:如果用户喜欢物品1,而且物品1与物品2相似,那么用户很可能喜欢物品2。
1. 如何计算两个物品之间的相似度。
2. 如何预估用户对候选物品的兴趣。
3. 如何利用索引在线上快速做召回。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-69-1024x441.png)
ItemCF的实现
![](http://139.9.1.231/wp-content/uploads/2023/02/image-70-1024x564.png)
两个物体的受众重合度越高,表示两个物体越相似。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-71.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-72.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-73.png)
ItemCF完整流程:
step1:事先做离线计算
![](http://139.9.1.231/wp-content/uploads/2023/02/image-74.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-75-1024x350.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-76-1024x353.png)
step2 线上做召回:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-77.png)
为什么要用索引:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-78.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-79-1024x473.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-80.png)
召回02:Swing召回通道
ItemCF :(缺点)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-81-1024x627.png)
Swing模型:(为用户设置权重)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-82.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-83.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-84.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-85.png)
召回03:基于用户的协同过滤(UserCF)
UserCF 的原理:如果用户1跟用户2相似,而且用户2喜欢某物品,那么用户1很可能喜欢该物品。
关键:1. 如何计算两个用户之间的相似度。 2. 如何预估用户对候选物品的兴趣。 3. 如何利用索引在线上快速做召回。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-86-1024x445.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-87-1024x597.png)
用户相似度:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-88.png)
如果是热门物体,那么大概率两个用户都会喜欢,因此需要降低热门物体的权重:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-89-1024x610.png)
UserCF召回的完整流程
step1:离线计算
![](http://139.9.1.231/wp-content/uploads/2023/02/image-90.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-91-1024x462.png)
step2:线上做召回
![](http://139.9.1.231/wp-content/uploads/2023/02/image-92.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-93-1024x459.png)
召回04: 向量召回 ,离散特征处理
one-hot encoding (独热编码) 和 embedding (嵌入)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-94.png)
召回05: 向量召回 ,矩阵补充、最近邻查找
矩阵补充(matrix completion),它是一种向量召回通道。矩阵补充的本质是对用户 ID 和物品 ID 做 embedding,并用两个 embedding 向量的内积预估用户对物品的兴趣。值得注意的是,矩阵补充存在诸多缺点,在实践中效果远不及双塔模型。 做向量召回需要做最近邻查找(nearest neighbor search)。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-95.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-96.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-97.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-98.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-99-1024x618.png)
用绿色的信息做训练,来预测灰色的值,进而为用户做召回
![](http://139.9.1.231/wp-content/uploads/2023/02/image-100.png)
近似最近邻查找:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-101.png)
step1:划分区域
![](http://139.9.1.231/wp-content/uploads/2023/02/image-102.png)
step2:用一个向量表示但各区域,给定一个物体,则返回所在的区域的物体
![](http://139.9.1.231/wp-content/uploads/2023/02/image-103.png)
step3:只需计算该区域所在的相似度。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-104.png)
召回06:双塔模型:矩阵补充的升级版
![](http://139.9.1.231/wp-content/uploads/2023/02/image-105.png)
双塔模型(two-tower)也叫 DSSM,是推荐系统中最重要的召回通道,没有之一。双塔模型有两个塔:用户塔、物品塔。两个塔各输出一个向量,作为用户、物品的表征。两个向量的内积或余弦相似度作为对兴趣的预估。有三种训练双塔模型的方式:pointwise、pairwise、listwise。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-106-1024x596.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-107-1024x575.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-108-1024x569.png)
双塔模型训练:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-109.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-110.png)
pointwise:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-111.png)
pariwise:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-112-1024x514.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-113.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-114.png)
listwise:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-115.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-116-1024x478.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-117.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-118.png)
召回07:双塔模型–正负样本选择
双塔模型(two-tower,也叫 DSSM)正负样本的选取。正样本是有点击的物品。负样本是被召回、排序淘汰的物品,分为简单负样本和困难负样本。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-119.png)
负样本:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-120-1024x343.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-121.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-122.png)
简单负样本:Batch内负样本
![](http://139.9.1.231/wp-content/uploads/2023/02/image-123-1024x472.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-124.png)
热门样本成为负样本的概率过大,解决办法:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-125.png)
困难负样本:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-126.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-127.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-128-1024x658.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-129.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-130.png)
召回08:双塔模型的线上服务和模型更新
在开始线上服务之前,需要把物品向量存储到Milvus、Faiss、HnswLib这类向量数据库,供最近邻查找(KNN 或 ANN)。当用户发起推荐请求时,用户塔用用户ID和用户画像现算一个用户向量,作为query,去向量数据库中做最近邻查找。
模型需要定期做更新,分为全量更新(天级别)和增量更新(实时)。全量更新会训练整个模型,包括embedding和全连接层。而增量更新只需要训练embedding层。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-131-1024x629.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-132.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-133.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-134.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-135.png)
模型更新:
全量更新:跟新用户和物体向量
![](http://139.9.1.231/wp-content/uploads/2023/02/image-136.png)
增量更新:每隔几小时实现用户模型更新
为什么需要这个:用户的兴趣可能会随时改变,因此需要随时做用户的更新,且只更新用户embedding参数,其他参数不需要更新。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-137.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-138-1024x429.png)
注意:每天的全量更新是基于昨天的全量更新后的模型进行训练的。
问题:能否只做增量更新,不做全量更新
![](http://139.9.1.231/wp-content/uploads/2023/02/image-139.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-140.png)
召回09:双塔模型的改进–自监督学习
改进双塔模型的方法,叫做自监督学习(self-supervised learning),用在双塔模型上可以提升业务指标。这种方法由谷歌在2021年提出,工业界(包括小红书)普遍验证有效。
长尾效应:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-141.png)
参考文献: Tiansheng Yao et al. Self-supervised Learning for Large-scale Item Recommendations. In CIKM, 2021.
![](http://139.9.1.231/wp-content/uploads/2023/02/image-142-1024x662.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-143.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-144.png)
自监督学习:
对物品进行不同的特征变换得到的特征向量同类之间尽可能的相同,不同物体之间尽可能不同。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-145-1024x551.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-146-1024x560.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-147.png)
特征变换方法:
1、Random Mask
![](http://139.9.1.231/wp-content/uploads/2023/02/image-148.png)
2、Dropout
![](http://139.9.1.231/wp-content/uploads/2023/02/image-149.png)
3、互补特征
![](http://139.9.1.231/wp-content/uploads/2023/02/image-150.png)
最好的办法 :Random mask(将一组相关联的特征全部mask)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-151.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-152.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-153.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-154-1024x563.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-155-1024x525.png)
召回10:其他召回通道
地理位置召回包括GeoHash召回和同城召回。作者召回包括关注作者、有交互作者、相似作者。缓存召回是储存精排打分高、而且未曝光的笔记。
![](http://139.9.1.231/wp-content/uploads/2023/02/image-156.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-157-1024x283.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-158.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-159.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-160.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-161.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-162.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-163.png)
![](http://139.9.1.231/wp-content/uploads/2023/02/image-164.png)
6个其他召回通道:
![](http://139.9.1.231/wp-content/uploads/2023/02/image-165.png)