This paper describes a many-to-many voice conversion model that filters the speaker vector to control high-level attributes such as speaking rate while preserving voice timbre. In order to control only the speaking rate, it is essential to decompose the speaker vector into a speaking rate vector and others. The challenge is to train such disentangled representations with no/few annotation data. Motivated by this difficulty, we propose an approach combining the conditional filtering method with data augmentation. The experimental results showed that our method disentangled complex attributes without annotation and separately controlled speaking rate and voice timbre. Audio samples can be accessed on our web page.