In spatial-audio enabled systems, evaluating the quality of spatialization is an essential process. This paper proposes a new objective metric to measure the spatialization quality (SQ) between any pair of binaural signals while being agnostic to speech content and signal duration. We formulate SQ as a metric learning problem and compute deep-feature distance on embeddings learned using triplet loss and multi-task learning with direction-of-arrival and binaural speech synthesis as auxiliary tasks. We show the robustness of our model on localization in (un)seen contexts, monotonicity with increasing angular distance, content in-variance and retrieval performance. Experiments show that our metric correlates well with publicly available subjective ratings, and it yields improvements when used as a differentiable loss in a binaural speech enhancement system.