This paper introduces and evaluates a collaborative task designed to elicit auditory-visual dialogs. The task was based on the viewing of two versions of the same cartoon film that was edited so that in order to reconstruct the story information from two incomplete versions must be shared in a consecutive fashion. The aim of this design was to elicit a relatively balanced dialog between the two participants as the story is pieced together from the beginning to the end. The current paper describes the production of a corpus consisting of audio, video and motion capture data from 22 pairs of Australian English speaking participants, and presents results on turn-distribution and raw prosodic features. Our analysis showed that the task could produce relatively balanced dialogs although this was not the case for all pairs. Analysis of raw prosodic features did not suggest that convergence occurred over the conversation, but replicated earlier findings of similarity between partners as compared to others.