Research Area:  Machine Learning
This paper introduces multimodal question answering, a new interface for community-based question answering services. By offering users an extra modality---photos---in addition to the text modality to formulate queries, multimodal question answering overcomes the limitations of text-only input methods when the users ask questions regarding visually distinctive objects. Such interface is especially useful when users become curious about an interesting object in the environment and want to know about it---simply by taking a photo and asking a question in a situated (from a mobile device) and intuitive (without describing the object in words) manner. We propose a system architecture for multimodal question answering, describe an algorithm for searching the database, and report on the findings of two prototype studies.
Keywords:  
Multimodal question answering
Multimodal
Mobile devices
Text modality
Applied computing
Author(s) Name:  Tom Yeh,Trevor Darrell
Journal name:  
Conferrence name:  Proceedings of the 13th international conference on Intelligent user interfaces
Publisher name:  ACM
DOI:  10.1145/1378773.1378841
Volume Information:  
Paper Link:   https://dl.acm.org/doi/abs/10.1145/1378773.1378841