End-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and mask-predict models. There are pros and cons to each of these architectures, and thus practitioners may switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four different decoders (CTC, attention, RNN-T, mask-predict) share an encoder - we refer to this as 4D modeling. Additionally, we propose to 1) train 4D models using a two-stage strategy which stabilizes multitask learning and 2) decode 4D models using a novel time-synchronous one-pass beam search. We demonstrate that jointly trained 4D models improve the performances of each individual decoder. Further, we show that our joint CTC/RNN-T/attention decoding surpasses the previously proposed CTC/attention decoding.