Hao Tang 唐顥

CTC does not always induce a proper probability distribution

This note discusses why connectionist temporal classification (CTC) does not always induce a proper probability distribution.1 The result is perhaps not surprising to many, but it is never formally written down. At the high level, the probability distribution is not proper when repeating labels appear in the label sequence. I will briefly review the definition of CTC, formally define the problem, and discuss the sufficient conditions for the induced distribution to be proper.

Suppose we have a sequence of T frames x=(x1,,xT) and a sequence of K labels y=(y1,,yK) for ykL for k=1,,K and for some label set L. Let B:(L{})L be a function that collapses repeating labels and removes blanks (). CTC defines the probability distribution p(y|x)=zB1(y)p(z|x) where B1 is the pre-image of B. In words, if zB1(y), then y=B(z). Each zt is assumed to be independent of zt for tt, so we have p(z|x)=t=1Tp(zt|x). The sequence z is often called an alignment or a path.

Every time we refer to a function as a probability distribution, we should always ask ourselves whether the distribution is well defined. For a function p to be a proper probability distribution, it should satisfy p(ω)0 for all ω is the domain and ωp(ω)=1. Here is the central question of this note: Given the definition of CTC above, is it true that yp(y|x)=1?

Sufficient conditions for properness

Before looking at p(y|x), we first show that p(z|x) is proper. Since p(zt|x)'s are almost always the result of a softmax, we can assume p(zt|x) is proper. It is also not hard to see that zp(z|x)=z1zTp(z|x)=z1zTt=1Tp(zt|x)=t=1Tztp(zt|x)=t=1T1=1, so p(z|x) is proper as long as p(zt|x)'s are.

Given that p(z|x) is proper and that p(y|x) is a sum based on p(z|x), p(y|x) should be proper if the sum is carefully handled. We will use the following fact about probability distributions. Let Ω be the event space of distribution p. If A1,,An are disjoint subsets of Ω and i=1nAi=Ω, then i=1np(Ai)=p(i=1nAi)=p(Ω)=1. The key condition here is disjointness (AiAj= for any i and j) and coverage (i=1nAi=Ω, or for any ωΩ, there exists i{1,,n} such that ωAi).

To apply the above fact, let Ω be the domain of p(z|x), i.e., (L{})T. The function p(y|x) sums over B1(y), a subset of Ω. If the subsets are disjoint and they cover everything in Ω, then yp(y|x)=1 and p(y|x) is proper. More formally, disjointness requires B1(y)B1(y)= for any y and y, and coverage requires that there is a y for every z such that zB1(y). These form the sufficient condition for p(y|x) to be proper.

In the case of CTC, it is trivially true that any alignment z there is a label sequence y such that y=B(z)—we simply apply B on z to get a y. The deciding condition is whether B1(y) and B1(y) are disjoint for any y and y. What happens when there are y and y such that B1(y)B1(y)? If we train to maximize p(y|x), the probability mass of p(y|x) would also increase.

Implementation of B1

To compute the marginalization efficiently, care needs to be taken when implementing B1. A common approach is to implement B1 as a graph (or more precisely a finite-state acceptor), and run dynamic programming on it.

Since B removes duplicate symbols and blanks, a common approach to implement B1 is to insert zero or more blanks between each symbol and to repeat the symbols one or more times. For example, B1(c a t)=c+a+t+, where I have used regular expression to indicate how many times a symbol can be repeated. However, this implementation causes a problem when there are repeating symbols in the label sequence. For example, if B1(m e e t)=m+e+e+t+, the sequence m m e e e t t belongs to B1(m e e t), but the sequence also belongs to B1(m e t). In other words, B1(m e t)B1(m e e t). This particular implementation not only fails to be faithful to the defition of B (because B(m m e e e t t)=m e t), but also violates the sufficient conditions of properness.2

A more careful implementation of B1 would be to insert one or more blanks between repeating symbols, to insert zero or more blanks otherwise, and to repeat the symbols one or more times. With this implementation, B1(m e e t)=m+e++e+t+. Note that there has to be at least one blank between the two e's.3

Other variants

Since properness depends on the choice of B, there are a few other variants that guarantee properness. For example, avoid repeating symbols while allowing blanks (Graves, 2012); only allow repeating symbols and avoid using blanks altogether (Collobert et al., 2016); create a separate symbol for each pair of symbols (Zweig et al., 2017). These variants are considered generalizations of CTC and some are not called CTC anymore.

Acknowledgement

I would like to thank Shubham Toshniwal for the wonderful discussion on the different CTC implementations and their consequences.
1 This note is the result of preparing a guest lecture on end-to-end speech recognition for MIT 6.345. The properness was supposed to be a sanity check but turned out to be false.
2 Note that blanks need to be removed after removing the duplicate symbols. Otherwise, B(m m e  e t t)=m e t, and all the sequences that we think should go into B1(m e e t) belong to B1(m e t).
3 In fact, the subtlety was taken care of in equation (6) of (Graves et al., 2006).