Several multimodal modules combining several input modalities such as speech and gesture have been alread implemented (cf CMC'98, IJCAI'97, CMC'95, IMMI95). Comparison of these multimodal modules can be done on the basis of their evaluation during user studies. Yet, user studies are done with different protocols from one multimodal module to the other. Hence comparison is difficult. Multimodal systems can also be compared on th basis of the number and complexity of modality they process. Yet, it seems weird to take into account the monomodal aspects of multimodal modules in their comparison.
In this paper, we suggest two new criteria for comparing multimodal modules from a software-engineering point of view: how well do these multimodal modules address the upgrading and composing problems.