We present a technique for automated assessment of public speaking and presentation proficiency based on the analysis of concurrently recorded speech and motion capture data. With respect to Kinect motion capture data, we examine both time-aggregated as well as time-series based features. While the former is based on statistical functionals of body-part position and/or velocity computed over the entire series, the latter feature set, dubbed histograms of cooccurrences, captures how often different broad postural configurations co-occur within different time lags of each other over the evolution of the multimodal time series. We examine the relative utility of these features, along with curated features derived from the speech stream, in predicting human-rated scores of different aspects of public speaking and presentation proficiency. We further show that these features outperform the human inter-rater agreement baseline for a subset of the analyzed aspects.