基于重要性采样的期望估计——sampled softmax推导

admin • 2023-02-04 19:53 • 人工智能

一、背景

在推荐召回系统中，通常会采用tow-tower模型并利用log softmax作为损失进行优化，设

[

]

[B]

$[B]$ 为mini-batch，

[

]

[C]

$[C]$ 为全局语料库，

(

)

s(x, y)

$s (x, y)$ 为query x和item y的相似度分数，则有如下的损失函数：

−

∑

∈

[

]

(

)

∑

∈

[

]

(

)

−

∑

∈

[

]

{

(

)

−

∑

∈

[

]

(

)

}

begin{align} mathcal{L} &= - frac {1}{B} sum_{i in [B]}log(frac {e^{s(x_i,y_i)}}{sum_{jin [C]} e^{s(x_i,y_j)}}) \ &= - frac{1}{B} sum_{i in [B]}{s(x_i,y_i) - logsum_{jin [C]} e^{s(x_i,y_j)}} end{align}

$L = - \frac{1}{B} i \in [B] \sum l o g (\frac{e ^{s (x_{i}, y_{i})}}{\sum _{j \in [C]} e ^{s (x_{i}, y_{j})}}) = - \frac{1}{B} i \in [B] \sum {s (x_{i}, y_{i}) - l o g j \in [C] \sum e^{s (x_{i}, y_{j})}}$

对损失求导

∇

−

∑

∈

[

]

{

∇

(

)

−

∑

∈

[

]

(

)

∑

∈

[

]

(

)

∇

(

)

}

−

∑

∈

[

]

{

∇

(

)

−

∑

∈

[

]

(

∣

)

∇

(

)

}

−

∑

∈

[

]

{

∇

(

)

⏟

−

[

∇

(

)

]

⏟

}

begin{align} mathcal{nabla_theta L} &=- frac{1}{B} sum_{i in [B]} { nabla_{theta} s(x_i, y_i) - sum_{j in [C]} frac{e^{s(x_i, y_j)}}{sum_{kin [C]} e^{s(x_i, y_k)}} nabla_ theta s(x_i, y_j)} \ &= - frac{1}{B} sum_{i in [B]} { nabla_{theta} s(x_i, y_i) - sum_{j in [C]} P(y_j|x_i) nabla_ theta s(x_i, y_j)} \ &= - frac{1}{B} sum_{i in [B]} { underbrace{nabla_{theta} s(x_i, y_i)}_{part one} - underbrace{E_{P}[nabla_theta s(x_i, y_j)]}_{part two}} end{align}

$\nabla_{θ} L = - \frac{1}{B} i \in [B] \sum {\nabla_{θ} s (x_{i}, y_{i}) - j \in [C] \sum \frac{e ^{s (x_{i}, y_{j})}}{\sum _{k \in [C]} e ^{s (x_{i}, y_{k})}} \nabla_{θ} s (x_{i}, y_{j})} = - \frac{1}{B} i \in [B] \sum {\nabla_{θ} s (x_{i}, y_{i}) - j \in [C] \sum P (y_{j} ∣ x_{i}) \nabla_{θ} s (x_{i}, y_{j})} = - \frac{1}{B} i \in [B] \sum {p a r t o n e$

∇θs(xi,yi)−part two

EP[∇θs(xi,yj)]}
可以发现梯度的第二部分是

∇

(

)

nabla_theta s(x_i, y_j)

$\nabla_{θ} s (x_{i}, y_{j})$ 关于target distribution P的期望，由于语料库的规模十分庞大，导致在计算配分函数时产生巨大的计算开销，因此需要对期望（梯度）进行近似计算，比较常见的做法是利用importance sampling采样较小规模的item来近似期望(sampled softmax)，本文将对sampled softmax的计算公式进行推导，供学习参考，如有错误还请指出

二、公式推导

设P为target distribution，Q为proposal distribution，重要性采样的基本思想是利用更容易采样的Q分布进行采样

[

∇

(

)

]

∑

∈

(

∣

)

∇

(

)

∑

∈

(

∣

)

(

∣

)

(

∣

)

∇

(

)

[

(

∣

)

(

∣

)

∇

(

)

]

≈

∑

∈

[

]

(

∣

)

(

∣

)

∇

(

)

begin{align} E_{P}[nabla_theta s(x_i, y_j)] &= sum_{j in C} P(y_j|x_i) nabla_ theta s(x_i, y_j) \ &= sum_{j in C} frac{P(y_j|x_i)}{Q(y_j|x_i)} Q(y_j|x_i) nabla_ theta s(x_i, y_j) \ &= E_{Q}[frac{P(y_j|x_i)}{Q(y_j|x_i)}nabla_theta s(x_i, y_j)] \ &approx frac{1}{B}sum_{j in [B]} frac{P(y_j|x_i)}{Q(y_j|x_i)}nabla_theta s(x_i, y_j) end{align}

$E_{P} [\nabla_{θ} s (x_{i}, y_{j})] = j \in C \sum P (y_{j} ∣ x_{i}) \nabla_{θ} s (x_{i}, y_{j}) = j \in C \sum \frac{P ( y _{j} ∣ x _{i} )}{Q ( y _{j} ∣ x _{i} )} Q (y_{j} ∣ x_{i}) \nabla_{θ} s (x_{i}, y_{j}) = E_{Q} [\frac{P ( y _{j} ∣ x _{i} )}{Q ( y _{j} ∣ x _{i} )} \nabla_{θ} s (x_{i}, y_{j})] \approx \frac{1}{B} j \in [B] \sum \frac{P ( y _{j} ∣ x _{i} )}{Q ( y _{j} ∣ x _{i} )} \nabla_{θ} s (x_{i}, y_{j})$

其中

(

∣

)

(

∣

)

frac{P(y_j|x_i)}{Q(y_j|x_i)}

$\frac{P ( y _{j} ∣ x _{i} )}{Q ( y _{j} ∣ x _{i} )}$ 就是importacne sampling中的重要性权重，分布Q与分布P越接近，则权重越大，在公式(9)中，我们从分布Q中采样B个样本，计算近似期望

在得到期望的近似计算公式后，我们再将

(

∣

)

P(y_j|x_i)

$P (y_{j} ∣ x_{i})$ 的计算公式代入

[

∇

(

)

]

≈

∑

∈

[

]

(

∣

)

(

∣

)

∇

(

)

∑

∈

[

]

(

)

(

∣

)

∑

∈

(

)

∇

(

)

∑

∈

[

]

(

)

−

(

∣

)

∑

∈

(

)

∇

(

)

begin{align} E_{P}[nabla_theta s(x_i, y_j)] &approx frac{1}{B}sum_{j in [B]} frac{P(y_j|x_i)}{Q(y_j|x_i)}nabla_theta s(x_i, y_j) \ &= frac{1}{B}sum_{j in [B]} frac{e^{s(x_i, y_j)}}{Q(y_j|x_i)sum_{kin C} e^{s(x_i, y_k)}} nabla_theta s(x_i, y_j) \ &= frac{1}{B}sum_{j in [B]} frac{e^{s(x_i, y_j)-lnQ(y_j|x_i)}}{sum_{kin C} e^{s(x_i, y_k)}} nabla_theta s(x_i, y_j) end{align}

$E_{P} [\nabla_{θ} s (x_{i}, y_{j})] \approx \frac{1}{B} j \in [B] \sum \frac{P ( y _{j} ∣ x _{i} )}{Q ( y _{j} ∣ x _{i} )} \nabla_{θ} s (x_{i}, y_{j}) = \frac{1}{B} j \in [B] \sum \frac{e ^{s (x_{i}, y_{j})}}{Q ( y _{j} ∣ x _{i} ) \sum _{k \in C} e ^{s (x_{i}, y_{k})}} \nabla_{θ} s (x_{i}, y_{j}) = \frac{1}{B} j \in [B] \sum \frac{e ^{s (x_{i}, y_{j}) - l n Q (y_{j} ∣ x_{i})}}{\sum _{k \in C} e ^{s (x_{i}, y_{k})}} \nabla_{θ} s (x_{i}, y_{j})$
可以发现由于

(

∣

)

P(y_j|x_i)

$P (y_{j} ∣ x_{i})$ 的计算引入了配分函数，导致计算量仍然十分庞大，因此需要对配分函数的计算进行简化，思路是构造一个期望的形式，然后同样采样B个样本近似计算期望

∑

∈

(

)

∑

∈

(

∣

)

⋅

(

∣

)

(

)

[

(

∣

)

(

)

−

(

∣

)

]

[

(

)

−

(

∣

)

]

≈

∑

∈

[

]

(

)

−

(

∣

)

begin{align} sum_{kin C} e^{s(x_i, y_k)} &= sum_{kin C} Q(y_k|x_i) cdot frac{1}{Q(y_k|x_i)} e^{s(x_i, y_k)} \ &= E_{Q}[Q(y_k|x_i) e^{s(x_i, y_k)-lnQ(y_k|x_i)}] \ &= E_{Q}[e^{s(x_i, y_k)-lnQ(y_k|x_i)}] \ &approx frac{1}{B}sum_{k in [B]} e^{s(x_i, y_k)-lnQ(y_k|x_i)} end{align}

$k \in C \sum e^{s (x_{i}, y_{k})} = k \in C \sum Q (y_{k} ∣ x_{i}) \cdot \frac{1}{Q ( y _{k} ∣ x _{i} )} e^{s (x_{i}, y_{k})} = E_{Q} [Q (y_{k} ∣ x_{i}) e^{s (x_{i}, y_{k}) - l n Q (y_{k} ∣ x_{i})}] = E_{Q} [e^{s (x_{i}, y_{k}) - l n Q (y_{k} ∣ x_{i})}] \approx \frac{1}{B} k \in [B] \sum e^{s (x_{i}, y_{k}) - l n Q (y_{k} ∣ x_{i})}$
令

(

)

(

)

−

(

∣

)

s^c(x_i, y_i) = s(x_i, y_i) - lnQ(y_i|x_i)

$s^{c} (x_{i}, y_{i}) = s (x_{i}, y_{i}) - l n Q (y_{i} ∣ x_{i})$ ，即可得到最终的计算公式：

[

∇

(

)

]

≈

∑

∈

[

]

(

)

∑

∈

[

]

(

)

∇

(

)

∑

∈

[

]

(

)

∑

∈

[

]

(

)

∇

(

)

begin{align} E_{P}[nabla_theta s(x_i, y_j)] &approx frac{1}{B}sum_{j in [B]} frac{s^c(x_i, y_j)}{frac{1}{B}sum_{k in [B]} s^c(x_i, y_k)} nabla_theta s(x_i, y_j) \ &= sum_{j in [B]} frac{s^c(x_i, y_j)}{sum_{k in [B]} s^c(x_i, y_k)} nabla_theta s(x_i, y_j) end{align}

$E_{P} [\nabla_{θ} s (x_{i}, y_{j})] \approx \frac{1}{B} j \in [B] \sum \frac{s ^{c} ( x _{i} , y _{j} )}{\frac{1}{B} \sum _{k \in [B]} s ^{c} ( x _{i} , y _{k} )} \nabla_{θ} s (x_{i}, y_{j}) = j \in [B] \sum \frac{s ^{c} ( x _{i} , y _{j} )}{\sum _{k \in [B]} s ^{c} ( x _{i} , y _{k} )} \nabla_{θ} s (x_{i}, y_{j})$
至此公式推导完毕，sampled softmax在实际使用中只需利用负采样得到数量较少的负样本，将修正后的分数代入log-softmax即可，大大减小了计算量，但同时也引入了bias，因此许多研究关注于提高采样分布的质量和偏差的修正

Reference

[1] Yang J, Yi X, Zhiyuan Cheng D, et al. Mixed negative sampling for learning two-tower neural networks in recommendations[C]. Companion Proceedings of the Web Conference 2020, 2020: 441-447.
[2] Bengio Y, Senécal J S. Adaptive importance sampling to accelerate training of a neural probabilistic language model[J]. IEEE Transactions on Neural Networks, 2008, 19(4): 713-722.
[3] Jean S, Cho K, Memisevic R, et al. On using very large target vocabulary for neural machine translation[J]. arXiv preprint arXiv:1412.2007, 2014.

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。

THE END

人工智能推荐算法深度学习

二维码

String类——Java中常见的类(模拟登录案例练习）

< <上一篇

感知机与门电路

下一篇>>

搜索内容

基于重要性采样的期望估计——sampled softmax推导

一、背景

二、公式推导

Reference

最新文章

分类

标签云