We propose a simple, yet novel, multi-layer model for the problem of phonetic classification. Our model combines a frame level transformation of the acoustic signal with a segment level phone classification. Our key contribution is the study of new temporal pooling strategies that interface these two levels, determining how frame scores are converted into segment scores. On the TIMIT benchmark, we match the best performance obtained using a single classifier. Diversity in pooling strategies is further used to generate candidate classifiers with complementary performance characteristics, which perform even better as an ensemble. Without the use of any phonetic knowledge, our ensemble model achieves a 16.96% phone classification error. While our data-driven approach is exhaustive, the combinatorial inflation is limited to the smaller segmental half of the system.