Esophageal speech is an alternative speech production method for people who have undergone laryngectomy and often suffer from reduced intelligibility. This paper proposes three lightweight speech enhancement methods trained with the loss given by end-to-end automatic speech recognition models. The proposed methods are based on frame mask (FM), ideal ratio mask (IRM), and voice conversion (VC) techniques. The evaluations show that the enhanced speech produced by the proposed methods led to an improvement over the original esophagus speech, specifically in terms of speech recognition rates by automatic speech recognition systems and human evaluators, naturalness as assessed by mean opinion scores, and more detected voicing segments.