Unsupervised word learning from unlabeled speech is a fundamental problem in zero-resource speech processing, which enables dialogue agents to learn new words directly from spoken utterances. The embedded segmental K-means (ES-KMeans) is a representative unsupervised word segmentation method. However, it has a heterogeneous structure consisting of word boundary search based on Dynamic Programming, segment embedding, and K-Means clustering, which prevents unified optimization. This paper proposes an end-to-end neural network version of the ES-KMeans model. We apply the memory network to hold a dictionary of word embeddings and realize the word boundary search and the clustering respectively as forward and backward propagations. Moreover, we replace the fixed embedding function of the original method with a learnable neural network. Experimental results using the ZeroSpeech Challenge 2020 package show the proposed approach provides superior performance to the state-of-the-art methods.