这篇博文也不打算扯太多东西了,不切分 子标题了,也基本上无需再次翻阅原文了~
首次研究了MLP架构用于VL融合。在5个VL任务和5个数据集上进行实验。结果:
这些发现证实了MLPs可以有效的对齐VL特征而不需要self-attention,那么问题来了,能够用MLPs代替VL模型结构吗?答案是不行,所有的MLPs相较于最先进的VL模型,处于预训练的情况下,精度是次优的,但是预训练的MLP能够超过没有预训练的transform(废话)。
前面的故事主要是讲一些MLP模型在分类任务中能够和Transform媲美,那么在多模态任务中呢?遂有了接下来的这篇论文。本文贡献如下:
两个部分的预训练介绍下:视觉-语言预训练,MLPs在视觉和语言方面的应用。
结论:摘要里面那些
展望:加大数据,加大模型,前提:钱烧的足够多!
以下这些文章我后面也会一一去阅读的,共同进步,奥利给!
【5】Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In ECCV, 2020. 5
【10】Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. Vqa-lol: Visual question answering under the lens of logic. In ECCV, 2020. 2, 5, 15
【14】Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. Decoupling the role of data, attention, and losses in multimodal transformers. arXiv preprint arXiv:2102.00529, 2021. 5
【32】Linjie Li, Zhe Gan, and Jingjing Liu. A closer look at the robustness of vision-and-language pre-trained models. arXiv preprint arXiv:2012.08673, 2020. 5, 15
【33】Linjie Li, Jie Lei, Zhe Gan, and Jingjing Liu. Adversarial vqa: A new benchmark for evaluating the robustness of vqa models. In ICCV, 2021. 2, 5, 15
【48】Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for robust visual question answering. In CVPR, 2019. 2, 5, 15
【50】Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Wojciech Galuba, Devi Parikh, and Douwe Kiela. Human-adversarial visual question answering. In NeurIPS, 2021. 2, 5, 15
写在后面
本篇博文就这样潦草结束了,还是那句话,结论重要,论文没时间看的就不用去阅读了。博文最后列出来的几篇参考文献还是值的一看滴~