This paper proposes a novel method to jointly generate spoken and written text from input speech for expanding use cases of speech-based applications. The spoken text generated using speech-to-spoken text systems, i.e., speech recognition systems, has disfluencies and no punctuation marks. Thus, spoken text is often converted into written text using a spoken text-to-written text system. However, this cascading is unsuitable for overall optimization and computationally expensive. Although a speech-to-written-text system that directly outputs the written text from the speech is also developed, it cannot output the spoken text. To efficiently produce both spoken and written text from speech, our key advance is to handle a joint text of spoken and written texts in an autoregressive model. This enables us to correctly generate both spoken and written texts by considering their dependencies via a single decoding process. Our experiments demonstrate the effectiveness of the proposed method.