Strange that I didn’t see this before. Apparently, W3C has made a draft for multimodal annotation called EMMA: Extensible MultiModal Annotation markup language. The abstract of the document reads:
The W3C Multimodal Interaction working group aims to develop specifications to enable access to the Web using multimodal interaction. This document is part of a set of specifications for multimodal systems, and provides details of an XML markup language for containing and annotating the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user’s input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user’s inputs such as interaction managers.
I am a bit sceptical when it comes to systems claiming to be “multimodal systems”, as they usually end up being more mono- or duo- than multimodal, but I will have to read the standard before I can conclude on this.