In this paper we present a new speech recognition strategy that is based on diphones as the primary recognition unit in a time-event neural network (TENN) framework. TENN is based on a two- phase approach to identifying a speech unit: event detection followed by classification. We investigate two different implementation configurations, an integrated vs. a cascaded system, and report on their performance. Preliminary results show that for some of the most frequent diphone classes in Finnish recognition rates of 93-97% on the diphone level are possible.