The performance of a speech recogniser, or of any other pattern classifier, strongly depends on the input features: to obtain a good performance, the feature set needs to be both highly discriminative and compact. Linear discriminant analysis (LDA) is a common data-driven method used to find linear transformations that map large feature vectors onto smaller ones while retaining most of the discriminative power. LDA however over-simplifies the problem by condensing all class information into only two scatter matrices, hence losing important information on the individual class distributions. We therefore propose a new approach, based on the mutual information or minimum classification error paradigm, which takes all information on the individual class distributions into account while searching an optimal sub-space, thus avoiding the crude approximations done by LDA. Experiments show that the proposed scheme provides more discriminative feature vectors, leading to substantially better recognition results.