End-to-end neural speaker diarization (EEND) systems are currently of high interest as the approach can easily handle overlapped speech and can be trained to optimize directly the diarization decision. Recently, there have been several investigations that achieve further enhancement of the EEND system, such as proposing various network structures for the encoder module or integration of the EEND with, the well-established in speaker embedding-based diarization, clustering methods. In this paper, we propose an alternative for the EEND backend and replace the LSTM-based attractor estimator with a non-autoregressive approach based on a Transformer decoder. Moreover, we introduce an iterative method that refines the system decision and the attractors in turns. Finally, we present results derived from an additional regularization of the proposed system with the use of Additive Angular Softmax speaker classification loss. We achieve up to 15% relative improvement over baseline on 2-speaker real recordings from CALLHOME dataset and up to 18% on simulated 2-speaker mixtures.