CTC does not always induce a proper probability distribution
This note discusses why connectionist temporal classification (CTC)
does not always induce a proper probability distribution.1
The result is perhaps not surprising to many, but it is never
formally written down.
At the high level, the probability distribution is not proper when
repeating labels appear in the label sequence.
I will briefly review the definition
of CTC, formally define the problem, and discuss the sufficient
conditions for the induced distribution to be proper.
Suppose we have a sequence of frames
and a sequence of labels
for for and for some label set .
Let be a function that
collapses repeating labels and removes blanks ().
CTC defines the probability distribution
where is the pre-image of .
In words, if , then .
Each is assumed to be independent of for ,
so we have
The sequence is often called an alignment or a path.
Every time we refer to a function as a probability distribution,
we should always ask ourselves whether the distribution is well defined.
For a function to be a proper probability distribution,
it should satisfy for all is the domain
and .
Here is the central question of this note:
Given the definition of CTC above, is it true that
?
Sufficient conditions for properness
Before looking at , we first show that is proper.
Since 's are almost always the result of a softmax,
we can assume is proper. It is also not hard to see that
so is proper as long as 's are.
Given that is proper and that is a sum based on
, should be proper if the sum is carefully handled.
We will use the following fact about probability distributions.
Let be the event space of distribution .
If are disjoint subsets of and
, then
The key condition here is disjointness ( for any and )
and coverage (, or
for any , there exists such that ).
To apply the above fact, let be the domain of ,
i.e., .
The function sums over , a subset of .
If the subsets are disjoint
and they cover everything in , then
and is proper.
More formally, disjointness requires
for any and , and coverage requires that there is a for every
such that .
These form the sufficient condition for to be proper.
In the case of CTC, it is trivially true that
any alignment there is a label sequence
such that —we simply apply on to get a .
The deciding condition is whether and
are disjoint for any and .
What happens when there are and such that
?
If we train to maximize ,
the probability mass of would also increase.
Implementation of
To compute the marginalization efficiently,
care needs to be taken when implementing .
A common approach is to implement
as a graph (or more precisely a finite-state acceptor),
and run dynamic programming on it.
Since removes duplicate symbols and blanks,
a common approach to implement is to
insert zero or more blanks between each symbol
and to repeat the symbols one or more times.
For example,
,
where I have used regular expression to indicate how many times
a symbol can be repeated.
However, this implementation causes a problem when there are
repeating symbols in the label sequence. For example, if
,
the sequence belongs to ,
but the sequence also belongs to .
In other words, .
This particular implementation not only fails to be faithful to the defition
of (because ),
but also violates the sufficient conditions of properness.2
A more careful implementation of would be
to insert one or more blanks between repeating symbols,
to insert zero or more blanks otherwise,
and to repeat the symbols one or more times.
With this implementation,
.
Note that there has to be at least one blank between the two e's.3
Other variants
Since properness depends on the choice of ,
there are a few other variants that guarantee properness.
For example, avoid repeating symbols while allowing blanks
(Graves, 2012);
only allow repeating symbols and avoid using blanks altogether
(Collobert et al., 2016);
create a separate symbol for each pair of symbols
(Zweig et al., 2017).
These variants are considered generalizations of CTC and some are not
called CTC anymore.
Acknowledgement
I would like to thank Shubham Toshniwal
for the wonderful discussion on the different CTC implementations
and their consequences.
This note is the result of preparing a guest lecture
on end-to-end speech recognition for MIT 6.345. The properness
was supposed to be a sanity check but turned out to be false.