This paper takes efforts to tackle the challenge of "live” one-shot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularity using mutual information constraint and vector quantization technique. Then we augment linguistic content modeling with a supervised ASR acoustic encoder. To perform live conversion, we design the model with streamable neural networks and run the model in streaming mode with sliding windows. Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.