Given the success of diffusion in synthesizing realistic speech,we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modulesfor Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically,the adaptive layer norm in the architecture is used to conditionthe diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpartfor general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario,while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.