Current speech and language processing relies on massive amounts of annotations and textual resources to train acoustic and language models. This is not sustainable for the majority of the word's language, and prevent from addressing the full complexity and mutability of conversational speech. Yet, young children across all linguistic communities autonomously learn how to communicate in their native language(s) before they even know how to read and write. In this talk, I identify three roadblocks along the path of reverse engineering this ability: unsupervised structure discovery, multimodal contextual grounding, and data efficient learning. I review recent work conducted in these three areas and present results and lessons from the "Zero Resource" challenge series, and propose a path forward to improving the technology and ultimately build fully autonomous text-free language processing systems.