Advanced end-to-end ASR systems encode speech signals by means of a multi-layer network architecture. In Wav2vec2.0, for example, a CNN is used as feature encoder on top of which transformer layers are used to map the high-dimensional CNN representations to the elements of some lexicon. Compared to the previous generation of 'modular' ASR systems it is much more difficult to interpret the processing and representations in an end-to-end system from a phonetic point of view. We built a Wav2vec2.0-based end-to-end system for producing broad phonetic transcriptions of Dutch. In this paper we investigate to what extent the CNN features and the representations on several transformer layers of a pre-trained and fine-tuned model reflect widely-shared phonetic knowledge. For that purpose we analyze distances between phones and the phonetic features of the most-activated phones in the output of an MLP classifier operating on the representations in several layers.