【论文阅读】Learning Transferable Visual Models From Natural Language Supervision(2021)

摘要

State-of-the-art(最先进的) computer vision systems(计算机视觉系统) are trained to predict a fixed set of predetermined object categories(被训练来预测一组固定的预定对象类别). This restricted form of supervision(受限制的监督形式) limits their generality(通用性) and usability(可用性) since(因为) additional labeled data is needed(需要额外的标记数据) to specify any other visual concept(指定任何其他视觉概念). Learning directly from raw text about images(直接从原始文本中学习图像) is a promising alternative(有前途的选择) which leverages a much broader source of supervision(它利用了更广泛的监督来源). We demonstrate(证明) that the simple pre-training task(简单的预训练任务) of predicting which caption goes with which image(预测哪个标题与哪个图像相匹配) is an efficient and scalable way(有效可扩展) to learn SOTA image representations(图像表示) from scratch o

你可能感兴趣的:(论文阅读)