小样本学习与智能前沿公众号

ECCV 2020 | Robust Re-Identification by Multiple Views Knowledge Distillation

这项工作设计了一项训练策略,允许从描述目标对象的一组视图产生高级知识。我们提出了Views Knowledge Distillation (VKD),将这种visual variety (视觉多样性)固定为teacher-student框架中的监督信息号,其中老师教育观察较少视图的学生。结果,学生不仅在表现在超过了老师,还在image-to-video任务中成为了SOTA。

paper link :https://link.springer.com/chapter/10.1007%2F978-3-030-58607-2_6 code link: https://github.com/aimagelab/VKD.

Introducation

在这里插入图片描述

动机

V2V he I2V之间还存在较大的差距。 

As observed in [10], a large gap in Re-ID performance still subsists between V2V and I2V,

 VKD

we propose Views Knowledge Distillation (VKD), which transfers the knowledge lying in several views in a teacher-student fashion. VKD devises a two-stage procedure, which pins the visual variety as a teaching signal for a student who has to recover it using fewer views.

主要贡献

  • i)学生的表现大大超过其老师,尤其是在“图像到视频”设置中;
  • ii)彻底的调查显示,与老师相比,学生将更多的精力放在目标上,并且丢弃了无用的细节;
  • iii)重要的是,我们不将分析局限于单个领域,而是在人,车辆和动物的Re-ID方面取得了出色的结果。
  • i) the student outperforms its teacher by a large margin, especially in the Image-To-Video setting;
  • ii) a thorough investigation shows that the student focuses more on the target compared to its teacher and discards uninformative details;
  • iii) importantly, we do not limit our analysis to a single domain, but instead achieve strong results on Person, Vehicle and Animal Re-ID.

Related works

  • Image-To-Video Re-Identification.
  • Knowledge Distillation

Method

在这里插入图片描述图2VKD概述。学生网络被优化来在仅使用少量视图的情况下模仿老师的行为。 

 

our proposal frames the training algorithm as a two-stage procedure, as follows

  • First step (Sect. 3.1): the backbone network is trained for the standard Video-To-Video setting.
  • Second step (Sect. 3.2): we appoint it as the teacher and freeze its parameters. Then, a new network with the role of the student is instantiated. As depicted in Fig. 2, we feed frames representing different views as input to the teacher and ask the student to mimic the same outputs from fewer frames.

    第一步,用标准的V2V设置训练骨干网络。  第二步,固定老师网络的参数,初始化学生网络。如图2所示,我们将表达不同视图的帧喂给老师网络,并且叫学生网络根据少量的帧来模仿相同的输出。

Teacher Network

用Imagenet初始化了网络的权重,还对架构做了少量的修改。

首先,我们抛弃了最后一个ReLU激活函数和最终分类层,转而使用BNNeck。 第二:受益于细粒度的空间细节,最后一个残差块的步幅从2减少到1。 

Set Representation.

Here, we naively compute the set-level embedding F(S)F(S) through a temporal average pooling. While we acknowledge better aggregation modules exist, we do not place our focus on devising a new one, but instead on improving the earlier features extractor.

Teacher Optimisation.

We train the base network - which will be the teacher during the following stage - combining a classification term LCELCE (cross-entropy) with the triplet loss LTRLTR , The first can be formulated as:在这里插入图片描述其中 \textbf{y} 和^yy^ 分别表示one-shot 标签和softmax输出的标签。 LTRLTR 鼓励特征空间中的距离约束,将相同目标变得更近,不同目标变得更远。形式化为:

在这里插入图片描述其中,SpSpSnSn分别为锚点SaSa在batch内的最强正锚点和负锚点。 

Views Knowledge Distillation (VKD)

Views Knowledge Distillation(VKD)通过迫使学生网络FθS()FθS(⋅)来匹配教师网络 FθT()FθT(⋅)的输出来解决问题。 为此,我们1)允许教师网络从不同的视角访问帧 ^ST=(^s1,^s2,^s3,...,^sN)S^T=(s^1,s^2,s^3,...,s^N),2)强迫学生网络根据 ^SS=(^s1,^s2,^s3,...,^sM)S^S=(s^1,s^2,s^3,...,s^M) 来模仿教师网络的输出。其中候选量M<N (在文章实验中,M=2,N=8).

Views Knowledge Distillation (VKD) stresses this idea by forcing a student network FθS()FθS(⋅) to match the outputs of the teacher FθT()FθT(⋅) . In doing so, we: i) allow the teacher to access frames ^ST=(^s1,^s2,^s3,...,^sN)S^T=(s^1,s^2,s^3,...,s^N) from different viewpoints; ii) force the student to mimic the teacher output starting from a subset ^SS=(^s1,^s2,^s3,...,^sM)S^S=(s^1,s^2,s^3,...,s^M)with cardinality 𝑀<𝑁 (in our experiments, 𝑀=2 and 𝑁=8 ). The frames in ^STS^T are uniformly sampled from ^SSS^S without replacement. This asymmetry between the teacher and the student leads to a self-distillation objective, where the latter can achieve better solutions despite inheriting the same architecture of the former.

VKD探索知识蒸馏损失为:在这里插入图片描述

In addition to fitting the output distribution of the teacher (Eq. 3), our proposal devises additional constraints on the embedding space learnt by the student. In details, VKD encourages the student to mirror the pairwise distances spanned by the teacher. Indicating with在这里插入图片描述

he distance induced by the teacher between the i-th and j-th sets (the same notation DS[i,j]DS[i,j] also holds for the student), VKD seeks to minimise:在这里插入图片描述where B equals the batch size.

因为教师模型可以使用多个视图,因此我们人气其空间中跨越的距离可以对相应的身份进行有力的描述。 从学生模型的角度来看,距离保持可以提供其他语义信息。因此,这保留了有效的监督信号,由于学生可获得的图像更少,因此其优化更具有挑战。

Student Optimisation.

The VKD overall objective combines the distillation terms ( LKDLKD and LDPLDP ) with the ones optimised by the teacher - LCELCE and LTRLTR - that promote higher conditional likelihood w.r.t. ground truth labels. To sum up, VKD aims at strengthening the features of a CNN in Re-ID settings through the following optimisation problem:在这里插入图片描述

其中ααββ 是用来平衡贡献的超参数。 根据经验,我们发现除了最后的卷积块以外,从老师的权重开始是较好的,最后的卷积块根据ImageNet预训练进行重新初始化。 我们认为,这代表了在探索新的配置和利用老师已经获得的能力之间有了良好的折中。

Experience

数据集

Person Re-ID

  • MARS
  • Duke-Video-ReID

Vehicle Re-ID

  • VeRi-776

Animal Re-ID

  • Amur Tiger

Self-distillation

在这里插入图片描述Table 1 reports the comparisons for different backbones: in the vast majority of the settings, the student outperforms its teacher.

在这里插入图片描述As an additional proof, plots from Fig. 3 draw a comparison between models before and after distillation. VKD improves metrics considerably on all three dataset, as highlighted by the bias between the teachers and their corresponding students. Surprisingly, this often applies when comparing lighter students with deeper teachers: as an example, ResVKD-34 scores better than even ResNet-101 on VeRi-776, regardless of the number of images sampled for a gallery tracklet.

Comparison with State-Of-The-Art

Image-To-Video.

在这里插入图片描述

Tables 2, 3 and 4 report a thorough comparison with current state-of-the-art (SOTA) methods, on MARS, Duke and VeRi-776 respectively. As common practice [3, 10, 32], we focus our analysis on ResNet-50, and in particular on its distilled variants ResVKD-50 and ResVKD-50bam. Our method clearly outperforms other competitors, with an increase in mAP w.r.t. top-scorers of 6.3% on MARS, 8.6% on Duke and 5% on VeRi-776. This results is totally in line with our goal of conferring robustness when just a single image is provided as query. In doing so, we do not make any task-specific assumption, thus rendering our proposal easily applicable to both person and vehicle Re-ID.

Video-To-Video.

在这里插入图片描述

Analogously, we conduct experiments on the V2V setting and report results in Table 5 (MARS) and Table 6 (Duke)4. Here, VKD yields the following results: on the one hand, on MARS it pushes a baseline architecture as ResVKD-50 close to NVAN and STE-NVAN [22], the latter being tailored for the V2V setting. Moreover – when exploiting spatial attention modules (ResVKD-50bam) – it establishes new SOTA results, suggesting that a positive transfer occurs when matching tracklets also. On the other hand, the same does not hold true for Duke, where exploiting video features as in STA [8] and NVAN appears rewarding. We leave the investigation of further improvements on V2V to future works. As of today, our proposals is the only one guaranteeing consistent and stable results under both I2V and V2V settings.

Analysis on VKD

In the Absence of Camera Information.

在这里插入图片描述

Distilling Viewpoints vs time.

在这里插入图片描述

VKD Reduces the Camera Bias.

在这里插入图片描述

Can Performance of the Student be Obtained Without Distillation?

在这里插入图片描述

Student Explanation.

在这里插入图片描述

Cross-distillation.

在这里插入图片描述

On the Impact of Loss Terms.

在这里插入图片描述

Conclusion

有效的Re-ID方法要求视觉描述符对背景外观和视点的变化均具有鲁棒性。 此外,即使对于由单个图像组成的查询,也应确保其有效性。 为了实现这些目标,我们提出了Views Knowledge Distillationl(VKD),这是一种teacher-student方法,学生只能观察一小部分输入视图。 这种策略鼓励学生发现更好的表现形式:因此,在训练结束时,它的表现优于老师。 重要的是,VKD在各种领域(人,车辆和动物)上都表现出了强大的鲁棒性,远远超过了I2V领域的最新水平。 由于进行了广泛的分析,我们着重指出,学生表现出对目标的更强聚焦,并减少了相机偏差。

Powered by Froala Editor

小样本学习与智能前沿
小样本学习与智能前沿

专注于小样本学习、迁移学习、领域自适应等技术在自然语言处理、计算机视觉等人工智能领域中的应用分析。

理论ECCV 2020
2
相关数据
激活函数技术

在 计算网络中, 一个节点的激活函数定义了该节点在给定的输入或输入的集合下的输出。标准的计算机芯片电路可以看作是根据输入得到"开"(1)或"关"(0)输出的数字网络激活函数。这与神经网络中的线性感知机的行为类似。 一种函数(例如 ReLU 或 S 型函数),用于对上一层的所有输入求加权和,然后生成一个输出值(通常为非线性值),并将其传递给下一层。

权重技术

线性模型中特征的系数,或深度网络中的边。训练线性模型的目标是确定每个特征的理想权重。如果权重为 0,则相应的特征对模型来说没有任何贡献。

参数技术

在数学和统计学裡,参数(英语:parameter)是使用通用变量来建立函数和变量之间关系(当这种关系很难用方程来阐述时)的一个数量。

超参数技术

在机器学习中,超参数是在学习过程开始之前设置其值的参数。 相反,其他参数的值是通过训练得出的。 不同的模型训练算法需要不同的超参数,一些简单的算法(如普通最小二乘回归)不需要。 给定这些超参数,训练算法从数据中学习参数。相同种类的机器学习模型可能需要不同的超参数来适应不同的数据模式,并且必须对其进行调整以便模型能够最优地解决机器学习问题。 在实际应用中一般需要对超参数进行优化,以找到一个超参数元组(tuple),由这些超参数元组形成一个最优化模型,该模型可以将在给定的独立数据上预定义的损失函数最小化。

查询技术

一般来说,查询是询问的一种形式。它在不同的学科里涵义有所不同。在信息检索领域,查询指的是数据库和信息系统对信息检索的精确要求

知识蒸馏技术

Hinton 的工作引入了知识蒸馏压缩框架,即通过遵循“学生-教师”的范式减少深度网络的训练量,这种“学生-教师”的范式,即通过软化“教师”的输出而惩罚“学生”。为了完成这一点,学生学要训练以预测教师的输出,即真实的分类标签。这种方法十分简单,但它同样在各种图像分类任务中表现出较好的结果。

推荐文章
暂无评论
暂无评论~