This paper describes a framework for phonological concept, formation, which is the task of acquiring an efficient representation of phonemes from spoken word samples with-out using any transcriptions except for the identification of the words. The phoneme models are represented as networks of segments, each of which forms a compact distribution of spectral features. We call this representation a phonological concept. The learning process is performed by searching in a hypotheses space for which each hypothesis is produced by modifying a set of phoneme models. This system potentially enables us to improve speech recognition performance.