Fundamental frequency (f0) estimation, also known as pitch
tracking, has been a long-standing research topic in the speech and
signal processing community. Many pitch estimation algorithms, however,
fail in noisy conditions or introduce large delays due to their frame
size or Viterbi decoding.
In this study, we
propose a deep learning-based pitch estimation algorithm, LACOPE, which
was trained in a joint pitch estimation and speech enhancement framework.
In contrast to previous work, this algorithm allows for a configurable
latency down to an algorithmic delay of 0. This is achieved by exploiting
the smoothness properties of the pitch trajectory. That is, a recurrent
neural network compensates delay introduced by the feature computation
by predicting the pitch for a desired point, allowing a trade-off between
pitch accuracy and latency.
We integrate the pitch
estimation in a speech enhancement framework for hearing aids. For
this application, we allow a delay on the analysis side of approx.
5ms. The pitch estimate is then used for constructing a comb filter
in frequency domain as post-processing step to remove intra-harmonic
noise.
Our pitch estimation performance is on par with SOTA algorithms
like PYIN or CREPE for spoken speech in all noise conditions while
introducing minimal latency.