A fundamental task for an agent to understand an environment acoustically is to detect sound source locationand semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy~(amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named \textSoundDoA}, that is capable of learning sound source DoA directly from sound raw waveforms. We first use a learnable front-end to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication is then proposed to further learn semantic label and DoA both separately and jointly. Extensive experimental results on DCASE 2020 sound event detection and localization dataset demonstrate the superiority of SoundDoA, when comparing with other existing methods