论文阅读和分析:Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification

admin • 2023-02-10 20:00 • 人工智能

提出了一个统一消息传递模型（UniMP）

两个简单但有效的想法：

（a）将节点特征传播与标签相结合；

UniMP在训练和推理阶段同时使用节点特征和标签。标签使用嵌入技术将部分节点标签从一个 one-hot类型标签转换为密集的类向量节点特征。多层Graph Transformer网络将节点特征和标签作为输入，在节点之间进行信息传播。因此，每个节点可以聚合来自其邻居的特征和标签信息。

（b）屏蔽标签预测。

由于将节点标签作为输入，因此将其用于监督训练将导致标签泄漏问题，在推理中表现不佳。为了解决这个问题，提出了一种屏蔽标签预测策略，该策略随机屏蔽一些训练实例的标签，然后预测它们以克服标签泄漏。这种简单而有效的训练方法从BERT中的屏蔽词预测中吸取了教训[Devlin等人，2018]，并模拟了将标记信息从图中的标记示例转换为未标记示例的过程。

实验结果：

在开放图基准（OGB）中的三个半监督分类数据集上评估了的UniMP模型，其中的新方法在所有任务中实现了最新的结果，在ogbn产品中获得82.56%的ACC，在ogbn蛋白质中获得86.42%的ROC-AUC，在ogbn-arxiv中获得73.11%的ACC。还对UniMP模型进行了消融研究，以评估统一方法的有效性。此外，对标签传播如何提高模型性能进行了最彻底的分析。

Graph Neural Networks：

在第

$l$ 层的特征传播：

其中

$D$ 是正则化邻接矩阵，A是邻接矩阵，

H^l

$H^{l}$ 是

$l$ 层的特征表示，

sigma

$σ$ 是激活函数，

W^l

$W^{l}$ 是

$l$ 层的可学习权重；

Label propagation algorithms

像标签传播算法（LPA）这样的传统算法只利用标签和节点之间的关系来进行预测。LPA假设连接节点之间的标签相似，并在图中迭代传播标签。给定一个初始标签矩阵

(

)

hat{Y^{(0)}}

$Y^{(0)}^$ ，它由一个one-hot标签指示向量

hat{y_i^{0}}

$y_{i 0}^$ （用于标记节点）或零向量（用于未标记节点）组成。LPA的简单迭代方程公式如下：

Combining GNN and LPA

将GNN和LPA结合在社区的半监督分类任务中。APPNP[Klicpera等人，2018]和TPN[Liu等人，2019]建议使用GCN来预测软标签，然后使用个性化Pagerank来传播它们。然而，这些工作仍然只考虑部分节点标签作为监督训练信号。GCN-LPA与的工作最相关，因为它们也将部分节点标签作为输入。然而，他们以更间接的方式结合了GNN和LPA，仅在训练中使用LPA来调整GAT模型的权重边。虽然的UniMP在网络中直接结合GNN和LPA，但在训练和预测中传播节点特征和标签。此外，与GCN-LPA不同，其正则化策略只能用于具有可训练权重边的GNN，如GAT[Velickovi´c´et al.，2017]、GAAN[Zhang et al.，2018]，训练策略可以很容易地扩展到各种GNN，例如GCN和GAT，以进一步提高其性能。

算法：geometric.nn开源实现

torch_geometric.nn — pytorch_geometric documentation (pytorch-geometric.readthedocs.io)

′

∑

∈

(

)

mathbf{x}^{prime}_i = mathbf{W}_1 mathbf{x}_i + sum_{j in mathcal{N}(i)} alpha_{i,j} mathbf{W}_2 mathbf{x}_{j},

$x_{i'} = W_{1} x_{i} + j \in N (i) \sum α_{i, j} W_{2} x_{j},$
where the attention coefficients

a_{i,j}

$a_{i, j}$ are computed via multi-head dot product attention:

softmax

(

)

⊤

(

)

alpha_{i,j} = textrm{softmax} left( frac{(mathbf{W}_3mathbf{x}_i)^{top} (mathbf{W}_4mathbf{x}_j)} {sqrt{d}} right)

$α_{i, j} = softmax (d$

(W3xi)⊤(W4xj))

in_channels (int or tuple) – Size of each input sample, or -1 to derive the size from the first input(s) to the forward method. A tuple corresponds to the sizes of source and target dimensionalities.
out_channels (int) – Size of each output sample.
heads (int, optional) – Number of multi-head-attentions. (default: 1)
concat (bool, optional) – If set to False, the multi-head attentions are averaged instead of concatenated. (default: True)
beta (bool, optional) –

If set, will combine aggregation and skip information via

x

i

′

=

β

i

W

1

x

i

+

(

1

−

β

i

)

(

∑

j

∈

N

(

i

)

α

i

,

j

W

2

x

⃗

j

)

⏟

=

m

i

mathbf{x}^{prime}_i = beta_i mathbf{W}_1 mathbf{x}_i + (1 - beta_i) underbrace{left(sum_{j in mathcal{N}(i)} alpha_{i,j} mathbf{W}_2 vec{x}_j right)}_{=mathbf{m}_i}

$x_{i'} = β_{i} W_{1} x_{i} + (1 - β_{i}) = m_{i}$

j∈N(i)∑αi,jW2x

j

其中：

β

i

=

sigmoid

(

w

5

⊤

[

W

1

x

i

,

m

i

,

W

1

x

i

−

m

i

]

)

beta_i = textrm{sigmoid}(mathbf{w}_5^{top} [ mathbf{W}_1 mathbf{x}_i, mathbf{m}_i, mathbf{W}_1 mathbf{x}_i - mathbf{m}_i ])

$β_{i} = sigmoid (w_{5 ⊤} [W_{1} x_{i}, m_{i}, W_{1} x_{i} - m_{i}])$
dropout (float, optional) – Dropout probability of the normalized attention coefficients which exposes each node to a stochastically sampled neighborhood during training. (default: 0)
edge_dim (int, optional) –

Edge feature dimensionality (in case there are any). Edge features are added to the keys after linear transformation, that is, prior to computing the attention dot product. They are also added to final values after the same linear transformation. The model is:

x

i

′

=

W

1

x

i

+

∑

j

∈

N

(

i

)

α

i

,

j

(

W

2

x

j

+

W

6

e

i

j

)

,

mathbf{x}^{prime}_i = mathbf{W}_1 mathbf{x}_i + sum_{j in mathcal{N}(i)} alpha_{i,j} left( mathbf{W}_2 mathbf{x}_{j} + mathbf{W}_6 mathbf{e}_{ij} right),

$x_{i'} = W_{1} x_{i} + j \in N (i) \sum α_{i, j} (W_{2} x_{j} + W_{6} e_{ij}),$
其中：

α

i

,

j

=

softmax

(

(

W

3

x

i

)

⊤

(

W

4

x

j

+

W

6

e

i

j

)

d

)

alpha_{i,j} = textrm{softmax} left( frac{(mathbf{W}_3mathbf{x}_i)^{top} (mathbf{W}_4mathbf{x}_j + mathbf{W}_6 mathbf{e}_{ij})} {sqrt{d}} right)

$α_{i, j} = softmax (d$

(W3xi)⊤(W4xj+W6eij))
(default None)
bias (bool, optional) – If set to False, the layer will not learn an additive bias. (default: True)
root_weight (bool, optional) – If set to False, the layer will not add the transformed root node features to the output and the option beta is set to False. (default: True)
**kwargs (optional) – Additional arguments of conv.MessagePassing.

       def __init__(
        self,
        in_channels: Union[int, Tuple[int, int]],
        out_channels: int,
        heads: int = 1,
        concat: bool = True,
        beta: bool = False,
        dropout: float = 0.,
        edge_dim: Optional[int] = None,
        bias: bool = True,
        root_weight: bool = True,
        **kwargs,
    ):
    
    def forward(self, x: Union[Tensor, PairTensor], edge_index: Adj,
                edge_attr: OptTensor = None, return_attention_weights=None):

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。

THE END

人工智能深度学习论文阅读

二维码

QT MSVC与MinGW

< <上一篇

《零基础学机器学习》读书笔记五之机器学习项目实战架构

下一篇>>

搜索内容