IRMS and Expressive Gesture Analysis

The basic concept in the paradigm of Reflexive Interaction is to establish a dialogue between the user and the machine, in which the user tries to “teach” the machine his/her musical language. The effectiveness of such a dialogue will be greatly increased by introducing the dimensions of expressivity and emotion. The MIROR project will include the design of an expressive interface for real-time analysis of emotions through multimodal, expressive features to further develop the children capacity for improvisation, composition and creative performance. Expressivity and emotion have been demonstrated to be a key factor in human-machine interaction (e.g., see research on Affective Computing, Picard 1997). It has also long been recognized that embodiment, in terms of movement, gesture, and emotion, plays a relevant role in musical learning of children (Delalande 1993, Custodero 2005, McPherson 2006, Imberty & Gratier 2007). Moreover, recent studies show that complex intra-personal and inter-personal phenomena are based on non-verbal communication through human full-body movement (Wallbott 1998). An innovative aspect of the MIROR project will be to endow Interactive Reflexive Musical Systems with the capability of exploiting mechanisms of expressive emotional communication. In particular, this novel generation of IRMS (emotional IRMS) will integrate modules able to analyze and process in real-time the expressive content conveyed by users through their full-body movement and gesture.

Concretely, techniques will be developed for extracting expressive motion features characterizing the behaviour of the users (e.g., their motoric activation, contraction/expansion, hesitation, impulsivity, fluency) and to map such features onto real-time control of acoustic parameters for rendering the expressivity in a musical performance. Inputs for such techniques will be images from one or more video-cameras and possible biometric sensors (e.g., accelerometers embedded in hand-held mobile devices). Machine learning techniques will be applied to classify motion with respect to motion categories and to compute trajectories in high-level emotional spaces derived from dimensional theories of emotion (e.g., Russell, 1980, Tellegen et al., 1999). As result, such technique will enable using expressive gesture for improvisation (e.g., users by means of their expressive gesture can mold and modify in real-time the sound and music content they are creating), for composition (e.g., the choice of the mapping between expressive features and acoustic parameters is a compositional issue), for creation (e.g., the availability of high-level gesture features provides novel possibilities and novel languages for music creation).

Visual feedback will also be designed to respond to the user’s expressive motor behaviour. The figure below shows an example of system developed by UNIGE, where expressive gesture features (mainly motor activation, contraction/expansion, impulsivity, fluency), classified with respect to basic emotion categories, are used to control in real-time the expressive performance of a music piece an improvisation session. The system also included a visual feedback where colours are associated to the recognized emotions (Castellano et al., 2007).

fig. 1.3

Figure 1.3. An installation using the EyesWeb XMI Platform (from Castellano et al. 2007). The coloured silhouettes are examples of visual feedback as presented to the user. The current instant of time corresponds to the silhouette. Previous movements with different expressivity are still visible as traces left by the silhouette. The current input frame is shown on the left-hand side of each silhouette.

Research on emotional IRMS will be grounded on the EyesWeb XMI platform (Camurri et al., 2005,, see Figure 1.4). EyesWeb XMI (for eXtended Multimodal Interaction) is an open platform supporting on the one hand multimodal processing both in its conceptual and technical aspects, and allowing on the other hand fast development of robust application prototypes for use in artistic performances, in education, in interactive multimedia installations and applications. The EyesWeb XMI platform consists of two main components: a kernel and a graphical user interface (GUI). The kernel is a dynamically pluggable component, which takes care of most of the tasks performed by EyesWeb XMI. Since XMI is designed to be extended with third party software modules (that we call blocks), the first task performed by the kernel is the registration and organization of such extensions into a coherent set of libraries. Most important, the kernel contains the EyesWeb execution engine, which manages the actual execution of EyesWeb applications (that we call patches). The GUI manages interaction with the user and provides all the features needed to design patches. Kernel and GUI are not tightly coupled: this means that several GUIs may be developed for different applications. The main new feature introduced in EyesWeb XMI is the possibility of synchronizing multimodal streams of data having different clocks. EyesWeb XMI can support simultaneously and transparently a wide range of devices (e.g., video cameras, microphones, physiological sensors, shock sensors, accelerometers). This allows integrated development of expressive multimodal interactive systems. As for interoperability, an instance of EyesWeb XMI can easily connect with other instances on the same machine or on other machines. EyesWeb XMI can also connect with other applications either locally or remotely trough standard network protocols (e.g., TCP/IP).

fig. 1.4

Figure 1.4: A running EyesWeb application for expressive gesture processing. Expressive gesture qualities such as energy and fluidity are extracted and mapped in a 2D space (trajectory in the upper left window). Occupation rates for regions in such a space are then computed. In the case displayed in the figure a higher amount of quick and fluid gestures is detected.