Multi-level segmentation of speech signals has become increasingly popular for quite a while. It yields a rich representation which captures both coarse and fine acoustic information in a uniform framework. However, determining which segments are useful, and how they should be combined to arrive at the correct segmentation of the acoustic signal has proved to be rather difficult. The approach described in this paper uses segment classification confidences as well as dynamically generated segment duration constraints for disambiguation. An experimental upper bound on performance using duration constraints is determined. Experiments show that the results compare well with a manual phonetic transcription.