Induce, edit, retrieve: Language grounded multimodal schema

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval - 2021

Research paper on Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Research Area: Machine Learning

Abstract:

Schemata are structured representations of complex tasks that can aid artificial intelligence by allowing models to break down complex tasks into intermediate steps. We propose a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance. Our system proceeds in three major phases: (1) Given a task with related videos, we construct an initial schema for a task using a joint video-text model to match video segments with text representing steps from wikiHow; (2) We generalize schemata to unseen tasks by leveraging language models to edit the text within existing schemata. Through generalization, we can allow our schemata to cover a more extensive range of tasks with a small amount of learning data; (3) We conduct zero-shot instructional video retrieval with the unseen task names as the queries. Our schema-guided approach outperforms existing methods for video retrieval, and we demonstrate that the schemata induced by our system are better than those generated by other models.

Keywords:
Language Grounded
Multimodal Schema
Instructional Video Retrieval
Machine Learning
Deep Learning

Author(s) Name: Yue Yang, Joongwon Kim, Artemis Panagopoulou, Mark Yatskar, Chris Callison-Burch

Journal name: Computer Vision and Pattern Recognition

Conferrence name:

Publisher name: arXiv:2111.09276

DOI: 10.48550/arXiv.2111.09276

Volume Information:

Paper Link: https://arxiv.org/abs/2111.09276

Office Address

Social List