PhD Thesis on Multimodal Representation and Learning

Research Area:  Machine Learning


Deep learning has remarkably improved the state-of-the-art speech recognition, visual object detection, object recognition and text processing tasks. Majority of these techniques focused on unimodality (images, text, speech, etc.), however, real-world scenarios present data in a multimodal fashion – we see objects, listen sounds, feel texture, smell odours and taste flavours. Multimodality refers to the fact that the real-world concepts can be described by multiple modalities. Moreover, recent years has seen an explosion in multimodal data on the web. Typically, users combine text, image, audio or video to sell a product over an e-commence platform or express views on social media. It is well-known that multimodal data may provide enriched information to capture a particular “concept” than single modality. For example, two adverts typically available on an e-commerce platform, where two visual objects have seemingly similar captions in the first row but dissimilar images. On the second row, we have two different captions but seemingly similar images in the second row. Typically, such scenarios are faced in multimodal classification. Similarly, two examples of multimodal data collected from a social media platform. If we consider only the text descriptions, entities may be wrongly labelled. Therefore, visual context is beneficial to resolve ambiguities. In addition, various modalities generally carry different kinds of information that may provide enrich understanding; for example, the visual signal of a flower may provide happiness; however, its scent might not be pleasant. Multimodal information may be useful to make an informed decision.

Name of the Researcher:  Shah Nawaz

Name of the Supervisor(s):  Prof. Ignazio Gallo

Year of Completion:  2019

University:  University of Insubria

