Sound-imitation words, a sound-related subset of onomatopoeia, are important for computer-human interaction and automatic tagging of sound archives. The main problem of automatic recognition of sound-imitation word is that the literal representation of such words is dependent on listeners and influenced by a particular cultural history. Based on our preliminary experiments of such dependency and the sonority theory, we discovered that the process of transforming environmental sounds into syllable-structure expressions is mostly listener-independent while that of transforming syllable-structure expressions into sound-imitation words is mostly listener-dependent and influenced by culture. This paper focuses on the former lister-independent process and presents the three-stage architecture of automatic transformation of environmental sounds to sound-imitation words; segmenting sound signals to syllables, identifying syllable structure as mora, and recognizing mora as phonemes.