自谷歌 2018 年 10 月推出 BERT 模型以来,各式各样的改进版预训练模型(Pre-Training Model, PTM)层出不穷,为 NLP 领域持续赋能。在近两年的时间里,出现了哪些令人印象深刻的新模型呢?又如何打造最强的预训练模型呢?近日,中科院软件所博士、新浪微博 AI Lab 担任资深算法专家张俊林以现有技术文献为基础,试图回答预训练模型相关的一系列问题。
![](https://image.jiqizhixin.com/uploads/editor/5632b415-911d-4a3c-9c3c-5156f2830d1e/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/4ac81356-fbfa-4dd5-b9df-7ce906feb688/640.jpeg)
进一步增加预训练数据数量,能够改善模型效果;
延长预训练时间或增加预训练步数,能够改善模型效果;
急剧放大预训练的每个 Batch 的 Batch Size,能够明显改善模型效果;
拿掉预训练任务中的 Next Sentence Prediction 子任务,它不必要存在;
输入文本的动态 Masking 策略有帮助。
本节我们归纳下目前能得到的,关于模型结构的现有研究结论,会介绍常见的五种模型结构。当然,这里用模型结构来表达不足够确切,因为除了模型结构外,一般还包含自监督的学习方法,常见的学习方法包括 AutoEncoding(简称 AE)和 AutoRegressive(简称 AR)。AE 即我们常说的双向语言模型,而 AR 则代表从左到右的单向语言模型。
![](https://image.jiqizhixin.com/uploads/editor/9ce969e3-72dc-468a-832a-b6405a69b4a2/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/88a4a98a-e938-4147-a7b2-021f0fab2349/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/83049557-bd31-4ca3-bbf1-0081e1f4bcc9/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/53bc6fe3-a557-4aaf-8077-2f6646bc7aba/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/a6279509-0b8f-42a6-a3f6-e77e4c4ecfaa/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/f02b07da-5d4a-4137-87e0-972edae01e8a/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/ca7c5c22-5135-4a4b-a50b-22f758affc7e/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/52426af6-1a71-46f6-8ccd-8334d742af02/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/786ce7fd-0c8f-4108-892e-874a14808183/640.png)
![](https://image.jiqizhixin.com/uploads/editor/d9ae44fe-55dd-45f8-8b18-82a3587a313a/640.png)
![](https://image.jiqizhixin.com/uploads/editor/d2566337-d62b-4908-bb9c-b47f9a5932ea/640.png)
![](https://image.jiqizhixin.com/uploads/editor/f9a99b94-abcd-4b7d-824e-3ca4c55ff69d/640.png)
![](https://image.jiqizhixin.com/uploads/editor/173ff472-a62e-4fd2-9c53-11d9ce9a0c99/640.png)
![](https://image.jiqizhixin.com/uploads/editor/848856a6-2298-4b31-be51-81ed23c99137/640.png)
![](https://image.jiqizhixin.com/uploads/editor/0eec1beb-d773-419c-a885-b06ccf1bd610/640.png)
![](https://image.jiqizhixin.com/uploads/editor/7f4e31ce-ee2d-4d75-925c-14af89ede384/640.png)
![](https://image.jiqizhixin.com/uploads/editor/2a822c9a-8751-455e-84e2-72c9334cd4c2/640.png)
![](https://image.jiqizhixin.com/uploads/editor/d646f920-261d-4bb1-af14-9ec23f6fc260/640.png)
![](https://image.jiqizhixin.com/uploads/editor/812e71be-d4d8-4e2b-8ffa-fd3b90c76959/640.png)
![](https://image.jiqizhixin.com/uploads/editor/221e24a4-f20f-4dc0-8eeb-5a3e9edbe0e6/640.png)
![](https://image.jiqizhixin.com/uploads/editor/9fa99ca6-7590-4648-9835-2a3e4f10135f/640.png)
![](https://image.jiqizhixin.com/uploads/editor/f1cc9eb1-053c-437b-8932-b6f225a33cbd/640.png)
![](https://image.jiqizhixin.com/uploads/editor/0e375079-82e2-47bf-a9c6-052be23bd63b/640.png)
![](https://image.jiqizhixin.com/uploads/editor/4d38db5a-81f7-48aa-bb1a-e42a746159c3/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/b5dacc5e-8e55-44b5-a343-9038d62ea828/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/51d3f010-6244-44b0-9c47-8de6f7cffd0c/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/8f732d5d-ed3a-4442-80f4-ab2ea9be799c/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/16c40c93-f678-4d5e-9056-07065d4b2942/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/7780cdff-c275-4d1a-9239-e2aeac8f9208/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/f893c12f-e950-4fc0-b807-e5133cf3fb17/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/219370e8-b007-426e-8ec9-0122665bb094/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/029bb848-7e01-47e1-a96b-d268884996ed/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/735af02e-024a-4b96-8689-d81fcc17104b/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/bc9ee4b2-18ba-4622-a41e-91a7f8dac85a/640.jpeg)
首先,也是最重要的,可能是急需构建不同模态间的大规模对齐数据。目前,“图片 - 文本”类型的对齐数据规模尚可,但是继续扩大数据规模无疑是有益的;对其它类型的模态组合而言,大规模的标准对齐数据比较缺乏,这会严重制约多模态预训练的发展。所以明显需要数据先行,这是发展技术的前提条件;
其次,感觉在自由文本预训练研究领域中,目前得到的一些得到验证的经验,推理起来,应该是能够直接迁移到多模态预训练领域的。典型的经验,比如:在扩大数据规模的同时,增加模型复杂度。增加模型复杂度包括图片特征抽取器模型复杂度(已经有实验验证加深 ResNet 模型对效果提升明显),以及增加对应的 Transformer 层深,放大 Transformer 的 Hidden Size 等,相信这是能够大幅提升多模态预训练的首选手段;再比如文本预训练任务中的 Mask 对象,采用 Span 方式而非单词方式(已有工作这么做了),加大 Batch Size 延长训练时间等训练方法优化手段,想来都应该是有益的;从训练目标来说,目前的模态间对齐任务还是有点类似 NSP 这种句子分类任务,明显偏简单了一些,这块可以考虑引入更有难度的对齐任务,以及实体级别细粒度的对齐任务,来增强模态对齐模型的效果。
再次,可以考虑由目前的两模态向真正的多模态扩展,比如三模态动态联合训练,目前常见的是 “文本 - 图片”,或者“文本 - 视频”,通常是两模态结构,后面可以考虑“文本 - 图片 - 音频”,或者“文本 - 视频 - 音频” 等三模态甚至更多模态的联合预训练。当然,这么做的前提,仍然是得先有多模态的对齐数据。
![](https://image.jiqizhixin.com/uploads/editor/25597169-a80f-47a8-8b4f-287a87a5a9b6/640.jpeg)
![](https://image.jiqizhixin.com/uploads/editor/37cfb6e5-4892-4137-b3fb-a1a372bf72d4/640.jpeg)