BreMM19 | Cassens & Wegener

Jörg Cassens and Rebekah Wegener
University of Hildesheim | Hildesheim, Germany
Paris Lodron University of Salzburg | Salzburg, Austria

Multi-modal Markers for Meaning: Making Use of Behavioural, Acoustic and Textual Cues to Automatically Identify Importance in Lectures

Identifying and extracting important information from monologic interaction is a difficult task for humans and modelling this for intelligent systems is a big challenge. Here, we concentrate on academic lectures as a speci c form of monologic interaction. One goal is to support students through a system alerting them to important aspects of a lecture.

Importance is a difficult concept to work with, and one that super cially could be replaced by concepts such as salience or prominence. There is an established body of research on detecting saliency in both video and audio data [2]. We argue, however, that retaining the concept of importance is crucial for making a distinction between the speaker driven concepts of salience and prominence and the contextual concept of importance.

Meaning making in human interaction is most often multi-modal and this
feature can be exploited for dynamic, automated, context dependent information extraction [4]. Drawing on semiotic models of expression, gesture, and behaviour; linguistic models of text structure and sound; and a rich model of context, we argue that the combination of these modalities to form multi-modal ensembles through data triangulation provides a better basis for information extraction than each modality alone. By using a rich model of context [3] that maps the unfolding of the text in real time with features of the context, it is also possible to advance from noti cation about importance to producing real-time query driven summarization on demand.

One goal of the research presented is to develop a system to support the development of student listening and note-taking skills by alerting them to important aspects of academic lectures. In this presentation, we will discuss how our theoretical and empirical ndings have guided the design and implementation of an early implementation of such a system [5] and how its evaluation feeds back into the theory. We will also discuss the principles, technologies and methods from natural language processing, machine learning and knowledge representation that have been combined to form a computing pipeline for detection of importance.

The work presented is part of a wider theory-guided action research program that uses multi-modal markers of importance to automatically extract key information from lectures and summarises them as a step towards being able to identify and track contextually relevant importance in spoken language in real time. We base our contextual analysis on models we have proposed earlier [1]. Actual system development combines human-centred and feature-driven approaches.

Regarding future work, lectures are not the only domain where noti cation
of importance and summarisation is useful. Moving from monologic (lecture) to dialogic situations, doctor-patient consultations are one of the next targets. Increasing in complexity, multi-participant situations such as team meetings (potentially including multi-language feature) can also be considered.


  1. Butt, D., Wegener, R., Cassens, J.: Modelling behaviour semantically. In:
    Brézillon, P., Blackburn, P., Dapoigny, R. (eds.) Proceedings of CONTEXT 2013. pp. 343-349. No. 8175 in LNCS, Springer, Annecy, France (2013). 27
  2. Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K., Skoumas, G., Avrithis, Y.: Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia 15(7), 1553-1568 (2013)
  3. Hasan, R.: Situation and the de nition of genre. In: Grimshaw, A. (ed.) What’s going on here? Complementary Analysis of Professional Talk: volume 2 of the multiple analysis project. Ablex, Norwood, NJ, USA (1994)
  4. Maskey, S., Hirschberg, J.: Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. Interspeech 2005, pages 621-624, 2005.
  5. Ude, J., Schüller, B., Wegener, R., Cassens, J.: A pipeline for extracting multimodal markers for meaning in lectures. In: Cassens, J., Wegener, R., Kofod-Petersen, A. (eds.) Proceedings of the Tenth International Workshop on Modelling and Reasoning in Context. pp. 16-21. No. 2134 in CEUR Workshop Proceedings, Aachen, Germany (July 2018).


Jörg Cassens: Jörg Cassens is a lecturer and senior researcher in media informatics at the University of Hildesheim, Germany. His main research interests are the applicability of socio-technical, psychological and semiotic theories for design, implementation and deployment of intelligent systems. Working at the intersection of Human-Computer Interaction and Arti cial Intelligence, he is particularly interested in the usability of and user experience with computational systems. He has worked on the development of psychologically sound context models, interfaces based on meaning bearing behaviour and on requirements engineering methodologies for intelligent systems.

Rebekah Wegener: Rebekah Wegener is a lecturer and senior researcher in
linguistics and semiotics at the Paris Lodron University Salzburg, Austria, and co-founder of learning technology startup Audaxi in Sydney, Australia. Her research interests include context modelling, theoretical and applied linguistics as well as intelligent learning and teaching technologies. She is currently working on models of context for text understanding and multimodal environments as well as behavioural interfaces for arti cial intelligence.

  • © 2020 University of Bremen || Faculty of Linguistics and Literary Science