This work presents a novel deep generative model for unsupervised learning of sparse binary feature representations with data-adaptive dimensionality directly from continuous speech streams. Sharing the critical assumption of unbounded latent dimensionality with previously proposed Bayesian non-parametric approaches, our proposed model can capture the much richer, non-Markovian dependencies between its latent representations. The present work focuses on an investigation of our proposed model's performance in learning linguistically meaningful representations under challenging, realistic scenarios. We train our model with highly speaker-imbalanced datasets and evaluate it on the ABX phone discriminability test. Our model achieves a promising, competitive performance to the state-of-the-art model, despite its huge disadvantage: limited or no access to speaker information during training.