BreMM19 | Ewerth

Ralph Ewerth
Leibniz University | Hanover, Germany

Towards automatic interpretation of multimodal information? – Computational approaches in the context of the visual/verbal divide

The focus of the 4th Bremen Conference on Multimodality is on empirical inroads to multimodal data research. In this talk, we address this topic from the computer science perspective. We give an overview of work on automatic understanding of multimodal data, in particular from the perspective of (multimedia) information retrieval. We start with a brief survey of today’s systems capabilities to interpret visual information and compare it with human performance. Which visual objects and artefacts can be recognized well by machines, how precise are their capabilities compared to human perception? From a computational perspective it is difficult to “understand” the (intended) meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that the automatic understanding and interpretation of a single source of information (e.g., textual, visual or audio) is difficult – and it is even more difficult to model and understand the interplay of two different modalities. While the visual/verbal divide has been investigated in the communication sciences for years, as for instance summarized by Bateman (2014), it has been rarely considered from an information retrieval perspective. To this end, we present machine learning approaches to automatically recognize semantic cross-modal relations that are defined along several dimensions: cross-modal mutual information, semantic correlation, and the relative abstractness level. These dimensions have been introduced in own previous work (Henning and Ewerth, 2017; Otto et al. 2019) and rely on Martinec and Salway’s (2005) taxonomy. The presented approaches utilize deep neural networks and multimodal embeddings. Typically, deep learning requires a large set of training data and we describe two strategies to overcome this issue. Finally, we outline possible use cases in the fields of search as learning, indexing of open educational resources and scientific papers, or automatic interpretation of multimodal news and online communication.


J. Bateman (2014). Text and image: A critical introduction to the visual/verbal divide. Routledge.

C. A. Henning & R. Ewerth (2017): Estimating the Information Gap between Textual and Visual Representations. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), Bucharest, Romania, ACM, 2017, 14-22.

R. Martinec & A. Salway (2005). A system for image–text relations in new (and old) media. Visual Communication, 4(3), 337–371.

C. Otto, S. Holzki, & R. Ewerth (2019): “Is this an example image?” – Predicting the relative abstractness level of image and text. European Conference on Information Retrieval (ECIR), Cologne, Germany, accepted for publication


Prof. Dr. Ralph Ewerth studied computer science at the Universities of Frankfurt am Main and Marburg with a minor in psychology and graduated in 2002 (Diploma). From 2002 to 2012 he worked as a research assistant at the Universities of Siegen and Marburg and received his PhD degree in 2008 in Marburg on machine learning methods for automatic video analysis. From 2012 to 2015 he was Professor for Image Processing and Media Technology at the Jena University of Applied Sciences. Since 2015, Dr. Ewerth has been Professor at the Leibniz Universität Hannover and Head of the Visual Analytics Research Group at TIB (Leibniz Information Centre for Science and Technology). Since 2016, he has been a member of the L3S Research Center in Hannover. Prof. Ewerth has published more than 60 scientific articles, his main research interests are automatic understanding of multimedia/multimodal data, multimedia information retrieval, image-text relations (vision and language), and search as learning.

  • © 2020 University of Bremen || Faculty of Linguistics and Literary Science