Traditionally, a speaker diarization system has multiple components to extract and cluster speaker embeddings. However, end-to-end diarization is more desirable as it facilitates optimizing one model in contrast to multiple components in a traditional set up. Moreover, end-to-end diarization systems are capable of handling overlapped speech. Recently proposed self-attentive end-to-end diarization model with encoder-decoder based attractors (EEND-EDA) is capable of processing speech from an unknown number of speakers, and has reported comparable performances to traditional systems. In this work, we aim to improve the EEND-EDA model. First, we increase the robustness of the model by incorporating an additive margin penalty for minimizing the intra-class variance. Second, we propose to replace the Transformer encoders with Conformer encoders to capture local information. Third, we propose to use convolutional subsampling and upsampling instead of manual subsampling only. Our proposed improvements report 21.6% relative reduction in DER on the evaluation full set of the track 2 of the DIHARD III challenge.