菜单 学习猿地 - LMONKEY

VIP

开通学习猿地VIP

尊享10项VIP特权 持续新增

知识通关挑战

打卡带练!告别无效练习

接私单赚外块

VIP优先接,累计金额超百万

学习猿地私房课免费学

大厂实战课仅对VIP开放

你的一对一导师

每月可免费咨询大牛30次

领取更多软件工程师实用特权

入驻
430
0

论文阅读:Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

原创
05/13 14:22
阅读数 68597

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

We propose an unsupervised method for reference res-
olution in instructional videos, where the goal is to tem- porally link an entity (e.g., “dressing”) to the action (e.g., “mix yogurt”) that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervi- sion. We address these challenges by learning a joint visual- linguistic model, where linguistic cues can help resolve vi- sual ambiguities and vice versa. We verify our approach by learning our model unsupervisedly using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially im- prove upon state-of-the-art linguistic only model on refer- ence resolution in instructional videos.

我们在教学视频中提出了一种无监督的参考解析方法,其目的是将实体(例如“装扮”)与产生它的动作(例如“混合酸奶”)临时联系起来。 关键挑战是视频实体中视觉外观和参考表达的变化不可避免地导致视觉语言歧义。 我们旨在无监督地解决参考文献这一事实加剧了这一挑战。 我们通过学习联合的视觉语言模型来应对这些挑战,其中语言提示可以帮助解决视觉上的歧义,反之亦然。 我们通过使用YouTube上的两千多条非结构化烹饪视频无监督地学习了我们的模型,从而验证了我们的方法,并表明我们的视觉语言模型可以大大改善基于参考解析的最新语言唯一模型。 教学视频。

 

发表评论

0/200
430 点赞
0 评论
收藏